Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 
Failover Clustering in the Cloud - Understanding Your Options - Free Webinar
 

SQL Server AlwaysOn Availability Groups Monitoring and Alerting - Part 2


By:   |   Last Updated: 2018-12-03   |   Comments (3)   |   Related Tips: More > Availability Groups

Problem

In the part 1 of this tip series, I showed you how to build the inventory of replicas that form an Availability Group. However, that’s just half of the picture, because you must be able to check the status of the databases within Availability Groups and this has to be a constant check to make sure the databases are “available”.

Solution

I will be presenting a script that gathers basic, but very useful information for each Availability Group database. This information will be obtained from the performance counter objects and with PowerShell we are going to be able to gather this information from all the availability databases from the replicas inventory you already have.

SQL Server Availability Groups Monitoring Script

The Availability Groups inventory is built first, because that way you can run this second script against a smaller subset of SQL Server instances. You can probably do it within one run, but that wouldn’t be as efficient.

Also, there are some performance counter objects that must be obtained from the primary replica and others from the secondary ones.

Information from Primary Replica

  • Log Bytes Flushed/sec: A log flush occurs when data is written from the log cache to the physical transaction log file on disk, every time a transaction is committed.  In order to get the rate, you have to obtain a couple of samples and then subtract the values. That’s the reason why the “WAITFOR DELAY” is there, so you can have an accurate delta. This will give you an idea of the activity going on currently at the databases in the Primary Replica.
SELECT instance_name, cntr_value 
INTO #Logflushes1
FROM sys.dm_os_performance_counters 
WHERE object_name  = 'SQLServer:Databases' 
  AND counter_name = 'Log Bytes Flushed/sec';

WAITFOR DELAY '00:00:01';

SELECT instance_name, cntr_value 
INTO #Logflushes2
FROM sys.dm_os_performance_counters 
WHERE object_name  = 'SQLServer:Databases' 
  AND counter_name = 'Log Bytes Flushed/sec';

SELECT 
     ag.name AS 'GroupName'
     ,db_name(drs.database_id) AS 'Database'
     ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
     ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
     ,drs.synchronization_state_desc AS 'State'
     ,drs.synchronization_health_desc 'Health'
     ,ags.primary_replica AS 'PrimaryReplica'
     ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
FROM #Logflushes1 l1
JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
     ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
WHERE rs.role_desc = 'PRIMARY';

DROP TABLE #Logflushes1;
DROP TABLE #Logflushes2;

Information from Secondary Replica

  • Redone Bytes/sec: Amount of log records redone on the secondary database in the last second.
  • Log Bytes Received/sec: Amount of log records received by the secondary replica for the database in the last second.
  • Log Send Queue Size: Amount of log records of the primary database that has not been sent to the secondary databases, in kilobytes (KB).
  • Send Rate: Average rate at which primary replica instance sent data during last active period, in kilobytes (KB)/second.
  • Send Latency: Time (in seconds) it will take the Primary Replica to fully send all the log records to the Secondary Replica.
  • Redo Queue Size: Amount of log records that the Secondary Replica has to process after being received by the Primary Replica.
  • Redo Rate: Rate at which the Secondary Replica is effectively processing the log records received from the Primary Replica.
SELECT instance_name, cntr_value 
INTO #redo1
FROM sys.dm_os_performance_counters
WHERE object_name  = 'SQLServer:Database Replica' 
  AND counter_name = 'Redone Bytes/sec';

SELECT instance_name, cntr_value 
INTO #send1
FROM sys.dm_os_performance_counters 
WHERE object_name  = 'SQLServer:Database Replica' 
  AND counter_name = 'Log Bytes Received/sec';

WAITFOR DELAY '00:00:01';

SELECT instance_name, cntr_value 
INTO #redo2
FROM sys.dm_os_performance_counters
WHERE object_name  = 'SQLServer:Database Replica' 
  AND counter_name = 'Redone Bytes/sec';

SELECT instance_name, cntr_value 
INTO #send2
FROM sys.dm_os_performance_counters 
WHERE object_name  = 'SQLServer:Database Replica' 
  AND counter_name = 'Log Bytes Received/sec';

SELECT
      DB_NAME(rs.database_id) AS 'Database'
      ,r.replica_server_name AS 'SecondaryReplica'
      ,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
      ,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
      ,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
      ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
      ,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate' 
      ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
FROM sys.dm_hadr_database_replica_states rs
JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
      FROM #redo1 l1
      JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
      ) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
      FROM #send1 l1
      JOIN #send2 l2 ON l2.instance_name = l1.instance_name
      ) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);

DROP TABLE #send1;
DROP TABLE #send2;
DROP TABLE #redo1;
DROP TABLE #redo2;

PowerShell Script to Automate the Data Collection of SQL Server Availability Group Databases

Below is the PowerShell script that unifies all the code described above. This will let you grab all the information from your environment and store it in a central place for data collection, analysis and monitoring. Note: you will have to update some of the parameters in the PowerShell script below to match your environment settings.

$labServer = "XXX"
$inventoryDB = "XXX"

#Clean the table where the monitoring information will be stored
Invoke-Sqlcmd -Query "TRUNCATE TABLE XXX" -Database $inventoryDB -ServerInstance $server

Write-Host "Fetching all the Primary Replicas from the inventory"

#Fetch all the instances that have a PRIMARY role in an Availability Group
#This list of instances should ideally come from the list we already built in Part 1 of this series
/*
   This is an example of the result set that your query must return
   ############################################################
   # name                     # instance              # group #
   ############################################################
   # server1.domain.net,45000 # server1\MSSQLSERVER1  # AG1   #
   # server2.domain.net,45000 # server2 # server2     # AG1   #
   # server3.domain.net,45000 # server3 # server3     # AG2   #
   # server4.domain.net,45000 # server4\MSSQLSERVER2  # AG2   #
   ############################################################
*/

$PrimaryReplicasLookupQuery = /*Put in here your query that returns the Primary Replicas only*/
$PrimaryReplicas = Invoke-Sqlcmd -ServerInstance $server -Database $inventoryDB -Query $PrimaryReplicasLookupQuery

#Grab the current information from each Primary Replica
foreach ($PrimaryReplica in $PrimaryReplicas){

    #Fetch all the Secondary Replicas per Primary Replica
    Write-Host "Fetching all the Secondary Replicas of Availability Group" $PrimaryReplica.group

    $SecondaryReplicasLookupQuery = /*Pun in here your query that returns the Secondary Replica for the Primary Replica being evaluated*/
    $SecondaryReplicas = Invoke-Sqlcmd -ServerInstance $server -Database $inventoryDB -Query $SecondaryReplicasLookupQuery

    $PrimaryReplicaInformationQuery = "
    SELECT instance_name, cntr_value 
    INTO #Logflushes1
    FROM sys.dm_os_performance_counters
    WHERE object_name  = 'SQLServer:Databases' 
      AND counter_name = 'Log Bytes Flushed/sec';

    WAITFOR DELAY '00:00:01';

    SELECT instance_name, cntr_value 
    INTO #Logflushes2
    FROM sys.dm_os_performance_counters
    WHERE object_name  = 'SQLServer:Databases' 
      AND counter_name = 'Log Bytes Flushed/sec';

    SELECT 
        ag.name AS 'GroupName'
       ,db_name(drs.database_id) AS 'Database'
        ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 0 AND mf.database_id = drs.database_id) AS 'MDFSize'
       ,(SELECT (mf.size * 8) / 1024.0 FROM sys.master_files mf WHERE mf.type = 1 AND mf.database_id = drs.database_id) AS 'LDFSize'
       ,drs.synchronization_state_desc AS 'State'
       ,drs.synchronization_health_desc 'Health'
       ,ags.primary_replica AS 'PrimaryReplica'
        ,CONVERT(DECIMAL(10,2), log_flushes / 1024.0 / 1024.0) AS 'LogFlushed'
    FROM sys.availability_groups ag
    JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
    JOIN sys.dm_hadr_availability_replica_cluster_states cs ON ags.group_id = cs.group_id 
    JOIN sys.dm_hadr_availability_replica_states rs  ON rs.replica_id = cs.replica_id 
    JOIN sys.dm_hadr_database_replica_states drs ON rs.replica_id=drs.replica_id
    JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value log_flushes
         FROM #Logflushes1 l1
         JOIN #Logflushes2 l2 ON l2.instance_name = l1.instance_name
        ) log_flushes ON log_flushes.instance_name = DB_NAME(drs.database_id) 
    WHERE rs.role_desc = 'PRIMARY';

    DROP TABLE #Logflushes1;
    DROP TABLE #Logflushes2;
    "

    Write-Host "Obtaining information from Primary Replica" $PrimaryReplica.instance
    $PrimaryReplicaResults = Invoke-Sqlcmd -Query $PrimaryReplicaInformationQuery -ServerInstance $PrimaryReplica.name -ErrorAction Stop -querytimeout 30

    #Now that we have the information from the Primary Replica, let's complement it with the information captured for the Secondary Replica
    foreach ($SecondaryReplica in $SecondaryReplicas){
        $SecondaryReplicaInformationQuery = "
        SELECT instance_name, cntr_value 
        INTO #redo1
        FROM sys.dm_os_performance_counters
        WHERE object_name  = 'SQLServer:Database Replica' 
          AND counter_name = 'Redone Bytes/sec';

        SELECT instance_name, cntr_value 
        INTO #send1
        FROM sys.dm_os_performance_counters 
        WHERE object_name  = 'SQLServer:Database Replica' 
          AND counter_name = 'Log Bytes Received/sec';

        WAITFOR DELAY '00:00:01';

        SELECT instance_name, cntr_value 
        INTO #redo2
        FROM sys.dm_os_performance_counters
        WHERE object_name  = 'SQLServer:Database Replica' 
          AND counter_name = 'Redone Bytes/sec';

        SELECT instance_name, cntr_value 
        INTO #send2
        FROM sys.dm_os_performance_counters 
        WHERE object_name  = 'SQLServer:Database Replica' 
          AND counter_name = 'Log Bytes Received/sec';

        SELECT
             DB_NAME(rs.database_id) AS 'Database'
            ,r.replica_server_name AS 'SecondaryReplica'
            ,CONVERT(DECIMAL(10,2), rs.log_send_queue_size / 1024.0) AS 'LogSendQueueSize'
            ,CONVERT(DECIMAL(10,2), send_rate / 1024.0 / 1024.0) AS 'LogSendRate'
           ,CONVERT(DECIMAL(10,2), log_send_queue_size / CASE WHEN send_rate = 0 THEN 1 ELSE send_rate / 1024.0 END) AS 'SendLatency'
            ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / 1024.0) AS 'RedoQueueSize'
           ,CONVERT(DECIMAL(10,2), redo_rate.redo_rate / 1024.0 / 1024.0) AS 'RedoRate'
           ,CONVERT(DECIMAL(10,2), rs.redo_queue_size / CASE WHEN redo_rate.redo_rate = 0 THEN 1 ELSE redo_rate.redo_rate / 1024.0 END) AS 'RedoLatency'
        FROM sys.dm_hadr_database_replica_states rs
        JOIN sys.availability_replicas r ON r.group_id = rs.group_id AND r.replica_id = rs.replica_id
        JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value redo_rate
            FROM #redo1 l1
           JOIN #redo2 l2 ON l2.instance_name = l1.instance_name
           ) redo_rate ON redo_rate.instance_name = DB_NAME(rs.database_id)
        JOIN (SELECT l1.instance_name, l2.cntr_value - l1.cntr_value send_rate
           FROM #send1 l1
           JOIN #send2 l2 ON l2.instance_name = l1.instance_name
           ) send_rate ON send_rate.instance_name = DB_NAME(rs.database_id);

        DROP TABLE #send1;
        DROP TABLE #send2;
        DROP TABLE #redo1;
        DROP TABLE #redo2;
        "

        Write-Host "Obtaining information from Secondary Replica" $SecondaryReplica.instance
        $SecondaryReplicaResults = Invoke-Sqlcmd -Query $SecondaryReplicaInformationQuery -ServerInstance $SecondaryReplica.name -ErrorAction Stop -querytimeout 30
    }

    #Fisrt let's insert the information of the Primary Replicas followed by their respective Secondary Replicas
    foreach($PrimaryReplicaResult in $PrimaryReplicaResults){

        #Build the INSERT statement if it returned at least 1 row
        if($PrimaryReplicaResults.Length -ne 0){

          #Build the insert statement
          $insert = "INSERT INTO XXX VALUES"
            $insert += "
          (
          '"+$PrimaryReplicaResult.PrimaryReplica+"',
             '"+$PrimaryReplicaResult.Database+"',
          "+$PrimaryReplicaResult.MDFSize+",
             "+$PrimaryReplicaResult.LDFSize+",
          '"+$PrimaryReplicaResult.State+"',
          '"+$PrimaryReplicaResult.Health+"',
          '"+$PrimaryReplicaResult.PrimaryReplica+"',
             "+$PrimaryReplicaResult.LogFlushed+",
             0,
             0,
             0,
             0,
             0,
             0,
             0      
           ),
         "
          
          #Store the Primary Replicas results in our centralized database
          Invoke-Sqlcmd -Query $insert.Substring(0,$insert.LastIndexOf(',')) -ServerInstance $server -Database $inventoryDB
       }
    }

    #Now let's update the information of each Primary Replica with the information of their respective Secondary Replica
    foreach($SecondaryReplicaResult in $SecondaryReplicaResults){

        #Build the UPDATE statement if it returned at least 1 row
        if($SecondaryReplicaResults.Length -ne 0){

          #Build the update statement
          $update = "UPDATE XXX"
            $update += "
            SET SecondaryReplica   = '"+$SecondaryReplicaResult.SecondaryReplica+"',
                LogSendQueueSize   = "+$SecondaryReplicaResult.LogSendQueueSize+",
                LogSendRate        = "+$SecondaryReplicaResult.LogSendRate+",
                SendLatency        = "+$SecondaryReplicaResult.SendLatency+",
                RedoQueueSize      = "+$SecondaryReplicaResult.RedoQueueSize+",
                RedoRate           = "+$SecondaryReplicaResult.RedoRate+",
                RedoLatency        = "+$SecondaryReplicaResult.RedoLatency+"
            WHERE PrimaryReplica   = '"+$PrimaryReplicaResult.PrimaryReplica+"' AND [Database] = '"+$SecondaryReplicaResult.Database+"';
         "
         
          #Execute the UPDATE statement
          Invoke-Sqlcmd -Query $update -ServerInstance $server -Database $inventoryDB
       }
    }
}

Write-Host "Done!"

After executing the PowerShell script, you should be able to have something that looks lot like the following:

availability group status results

With this final output, you can choose to do what you want:

Next Steps
  • You can add as many performance counters as you like. There’s a very useful list at Microsoft’s website, where you can even see from what replica you have to capture the value for a particular counter.
  • The scripts presented in this tip are focused towards Availability Groups that have 1 secondary replica per Availability Group. A future version will be able to handle any number of replicas available, although the presentation of the results will have to vary a bit (at least this can be a very good starting point if you have nothing at all).
  • Download the PowerShell script.


Last Updated: 2018-12-03


next webcast button


next tip button



About the author
MSSQLTips author Alejandro Cobar Alejandro Cobar is an MCP, SQL Server DBA and Developer with 10+ years of experience working in a variety of environments.

View all my tips
Related Resources





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Monday, December 03, 2018 - 1:50:57 PM - Alejandro Cobar Back To Top

Hi Roger,

Thanks for the feedback!

You are right, the object name was specific for default instances... but I have sent the fix so it can be updated in the tip. This time it will work for named instances as well.

I also tried the primary replica query in an instance with 3 Availability Groups and got no error.

*It would be very helpful if you can tell me the whole scenario where you made it crash. I tried in a SQL 2016 version and got no errors at all.

-Alejandro


Monday, December 03, 2018 - 8:30:10 AM - Roger Hullstrung Back To Top

 Hello Alejandro,

The primary replica query returns an error when executing against an instance that contains multiple AGs:

Msg 512, Level 16, State 1, Line 15

Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

Plus I had to comment out the object_name counter SQLServer:Databases, as this counter does not exist for named instance installs.

Roger

 


Monday, December 03, 2018 - 4:25:45 AM - Vesa Juvonen Back To Top

 

 Nice...

Msg 512, Level 16, State 1, Line 15

Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

 

 


Learn more about SQL Server tools