Improve SQL Server Extended Events system_health Session

By:   |   Comments   |   Related: > Extended Events


Problem

I was recently trying to troubleshoot a SQL Server replication-related deadlock that our monitoring tool didn't capture, and tried to find information about it in the system_health Extended Events session. With the default retention settings and the amount of noise contributed by security ring buffer events, I quickly discovered that the session only had data going back less than two hours. Meaning unless I started investigating an event immediately after it happened, all evidence had rolled out forever.

Solution

There are a few things you can do to make your system_health data last longer. An elaborate way would be to archive the .xel files themselves (or periodically take snapshots into your own storage), but I have two simpler solutions that might work for you.

Change the retention settings for system_health Extended Events session

By default, the system_health session retains 4 rollover files of a maximum of 5 MB each, allowing you to retain 20 MB of data. If you have a lot of events, especially if they're noisy, this squanders your ability to look at events in the past.

Luckily, unlike with the default trace, you can change the retention settings. For example, if I wanted to keep 40 rollover files of 10 MB each, for 400 MB of retention, I could run the following:

USE [master];
GO ALTER EVENT SESSION [system_health] ON SERVER
  DROP TARGET package0.event_file;

ALTER EVENT SESSION [system_health] ON SERVER
  ADD TARGET package0.event_file
 (
    SET filename          =N'system_health.xel',
        max_file_size      = (10), -- MB
        max_rollover_files = (40)
  );

In SQL Server 2016, 2017, and 2019, the defaults – if you haven't altered them already – increase to 10 files of 100 MB each, allowing you to retain 1 GB of system_health data (see KB #4541132 for more details). You can still increase or decrease these settings after applying the relevant cumulative update, but what is the right number will be a balancing act between manageability and XML query performance.

Stop collecting unactionable noise for system_health Extended Events session

Increasing the retention is nice, but what if you're collecting 1 GB (or more) of garbage? On every system I looked at, 99% of the events we were collecting were security ring buffer errors we can't do anything about (they involve how apps connect, authenticate, and validate authentication). I ran the following query (I always dump XEvent session data to a #temp table before any further processing):

;WITH cte AS 
(
  SELECT ed = CONVERT(xml, event_data)
    FROM sys.fn_xe_file_target_read_file(N'system_health*.xel', NULL, NULL, NULL)
)
SELECT event_data = x.ed.query('.')
  INTO #t FROM cte CROSS APPLY cte.ed.nodes(N'.') AS x(ed); SELECT
  EventName,
  EventCount = COUNT(*),
  EarliestEvent_MinutesAgo = DATEDIFF(MINUTE, MIN(EventTime), SYSUTCDATETIME())
FROM
(
  SELECT
    event_data.value(N'(event/@timestamp)[1]', N'datetime'),
    event_data.value(N'(event/@name)[1]',      N'nvarchar(255)')
  FROM #t
) AS t(EventTime, EventName)
GROUP BY EventName
ORDER BY EventCount DESC;

Results:

Number of events, and age in minutes of the oldest event.

Opening one of the XML values yielded this data for the event:

XML for the security error ring buffer recorded event.

There is not a whole lot I can do, after the fact, about a harmless authentication validation for a session_id that is almost certainly no longer connected anyway. I have it on good authority that Microsoft is at least considering either augmenting this information with more details (so you can filter and maybe so it can be actionable), or removing it from system_health altogether.

Until then, I plan on dropping the event from system_health, in addition to increasing the retention, across our entire environment:

ALTER EVENT SESSION [system_health] ON SERVER
 DROP EVENT sqlserver.security_error_ring_buffer_recorded;

The risk here is very small, since I didn't find a single event with useful information, and the rollback action is simple (just add the event back):

ALTER EVENT SESSION [system_health] ON SERVER
  ADD EVENT sqlserver.security_error_ring_buffer_recorded
  (SET collect_call_stack=(0)); -- maybe you don't need the call stack!

Alternatively, you could add this event to your own XEvent session (I talk about moving the useful events from the default trace to your own session here, here, and here).

Summary

The system_health XEvent session can contain some very useful troubleshooting information, but that data can easily get pushed out by short retention settings or too much noise. You can make some minor changes to this session to make the data you keep more useful and last longer.

Next Steps

See these related tips and other resources:



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article

















get free sql tips
agree to terms