Monitoring SQL Server Availability Groups with AlwaysOn Extended Events Health Session


By:   |   Updated: 2018-01-30   |   Comments   |   Related: More > Availability Groups


Problem

In some environments, DBAs like to use to a third-party tool to monitor their servers especially when they have a lot of servers.  When you have a small number of servers having all your alerts come from SQL Server Agent is not so bad, but when you get to where you are managing more, it is nice to be able to have a dashboard to see what is happening across your environment.  Also, for alert 1480 which is the alert for the Availability Group is changing roles, you get multiple alerts per server for every failover.  If you use the AlwaysOn_health session, you can write a custom alert that only alerts once per failover.

Solution

There are just a few alerts we want to be alerted on when it comes to Availability Groups, the main ones are:

  1. When the role as changed to the primary alert number 1480
  2. When the availability group is not ready for automatic failover if it is set up for automatic failover alert number 41405

Introduction to the AlwaysOn_health Extended Events Session

Letís take a look at the AlwaysOn_health extended events session and how you would go about query it for the relevant error numbers.

First, letís take a look at where the AlwaysOn_health session is.  In SSMS, under Management you will find Extended Events, then under Sessions you will see the AlwaysOn_health session.

SSMS AlwaysOn_health Extended Events session

If you expand it, you can right-click on package0.eventfile and click View Target Data to get an idea of the type of data that is stored in this extended event.

Data from AlwaysOn_health Extended Events session

If you script out the extended event session, you get the following code and see that they track more error messages than just the availability group error messages.  Notice the 823 and 824 error numbers telling us about database corruption.

CREATE EVENT SESSION [AlwaysOn_health]
ON SERVER
ADD EVENT sqlserver.alwayson_ddl_executed ,
ADD EVENT sqlserver.availability_group_lease_expired ,
ADD EVENT sqlserver.availability_replica_automatic_failover_validation ,
ADD EVENT sqlserver.availability_replica_manager_state_change ,
ADD EVENT sqlserver.availability_replica_state ,
ADD EVENT sqlserver.availability_replica_state_change ,
ADD EVENT sqlserver.error_reported
( WHERE ( [error_number] = ( 9691 )
OR [error_number] = ( 35204 )
OR [error_number] = ( 9693 )
OR [error_number] = ( 26024 )
OR [error_number] = ( 28047 )
OR [error_number] = ( 26023 )
OR [error_number] = ( 9692 )
OR [error_number] = ( 28034 )
OR [error_number] = ( 28036 )
OR [error_number] = ( 28048 )
OR [error_number] = ( 28080 )
OR [error_number] = ( 28091 )
OR [error_number] = ( 26022 )
OR [error_number] = ( 9642 )
OR [error_number] = ( 35201 )
OR [error_number] = ( 35202 )
OR [error_number] = ( 35206 )
OR [error_number] = ( 35207 )
OR [error_number] = ( 26069 )
OR [error_number] = ( 26070 )
OR [error_number] > ( 41047 )
AND [error_number] < ( 41056 )
OR [error_number] = ( 41142 )
OR [error_number] = ( 41144 )
OR [error_number] = ( 1480 )
OR [error_number] = ( 823 )
OR [error_number] = ( 824 )
OR [error_number] = ( 829 )
OR [error_number] = ( 35264 )
OR [error_number] = ( 35265 )
)
) ,
ADD EVENT sqlserver.hadr_db_partner_set_sync_state ,
ADD EVENT sqlserver.lock_redo_blocked
ADD TARGET package0.event_file
( SET filename = N'AlwaysOn_health.xel', max_file_size = ( 5 ), max_rollover_files = ( 4 ))
WITH ( MAX_MEMORY = 4096KB ,
EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS ,
MAX_DISPATCH_LATENCY = 30 SECONDS ,
MAX_EVENT_SIZE = 0KB ,
MEMORY_PARTITION_MODE = NONE ,
TRACK_CAUSALITY = OFF ,
STARTUP_STATE = ON
);
GO

How to Query the Extended Events Session for When a Server Becomes the Primary

Now you can write some queries against the extended event and pull out the data you need for alerts.  For example, for Role Change alert number 1480 here is the query to pull out only when a server becomes the primary so you only get one alert.  The first part of the code goes and finds the current file on the operating system that holds the extended event session data.  Then we SELECT from that WHERE the availability_replica_state_change change became the PRIMARY_NORMAL state.

DECLARE @FileName NVARCHAR(4000)
SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]','nvarchar(4000)')
FROM (
SELECT
CAST(target_data AS XML) target_data
FROM sys.dm_xe_sessions s
JOIN sys.dm_xe_session_targets t
ON s.address = t.event_session_address
WHERE s.name = N'AlwaysOn_health'
) ft
 
SELECT 
XEData.value('(event/@timestamp)[1]','datetime2(3)') AS event_timestamp,
XEData.value('(event/data[@name="previous_state"]/text)[1]', 'varchar(255)') AS previous_state,
XEData.value('(event/data[@name="current_state"]/text)[1]', 'varchar(255)') AS current_state,
XEData.value('(event/data[@name="availability_replica_name"]/value)[1]', 'varchar(255)') AS availability_replica_name,
XEData.value('(event/data[@name="availability_group_name"]/value)[1]', 'varchar(255)') AS availability_group_name
FROM (
SELECT CAST(event_data AS XML) XEData, *
FROM sys.fn_xe_file_target_read_file(@FileName, NULL, NULL, NULL)
WHERE object_name = 'availability_replica_state_change'
) event_data
WHERE XEData.value('(event/data[@name="current_state"]/text)[1]', 'varchar(255)') = 'PRIMARY_NORMAL'
ORDER BY event_timestamp DESC;

How to Query the Extended Events Session for Error Numbers

Next we can query for specific error numbers that are that referenced in the extended events session.  In the query below we are looking for the error numbers related to possible database corruption issues.

DECLARE @FileName NVARCHAR(4000)
SELECT @FileName = target_data.value('(EventFileTarget/File/@name)[1]','nvarchar(4000)')
FROM (
SELECT
CAST(target_data AS XML) target_data
FROM sys.dm_xe_sessions s
JOIN sys.dm_xe_session_targets t
ON s.address = t.event_session_address
WHERE s.name = N'AlwaysOn_health'
) ft
 
SELECT
XEData.value('(event/@timestamp)[1]','datetime2(3)') AS event_timestamp,
XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') AS error_number,
XEData.value('(event/data[@name="severity"]/value)[1]', 'int') AS severity,
XEData.value('(event/data[@name="message"]/value)[1]', 'varchar(max)') AS message
FROM (
SELECT CAST(event_data AS XML) XEData, *
FROM sys.fn_xe_file_target_read_file(@FileName, NULL, NULL, NULL)
WHERE object_name = 'error_reported'
) event_data
WHERE XEData.value('(event/data[@name="error_number"]/value)[1]', 'int') IN (823, 824, 829)
ORDER BY event_timestamp DESC;
Next Steps
  • Setup these queries in your third-party monitoring tool and add the event_timestamp to the WHERE clause.
  • Check out these additional resources.
    • Using Extended Events to Monitor Availability Groups here.
    • SQL Server Agent Alerts here.


Last Updated: 2018-01-30


get scripts

next tip button



About the author
MSSQLTips author Tracy Boggiano Tracy Boggiano is the Database Superhero for ChannelAdvisor in North Carolina specializing in automation, performance tuning, and high availability/disaster recovery technologies.

View all my tips
Related Resources





Comments For This Article





download





Recommended Reading

Connect to SQL Server Availability Group replica with SSMS when Readable Secondary is Read-intent only

Configure SQL Server AlwaysOn Availability Group on a Multi-Subnet Cluster

Fix SQL Server AlwaysOn Availability Group Error: 1408 Joining database on secondary replica resulted in an error

What is SQL Server AlwaysOn?

Add a SQL Server Database to an Existing Always On Availability Group








get free sql tips
agree to terms


Learn more about SQL Server tools