Queries to Monitor SQL Server Replication Backlog
By: Tim Smith | Updated: 2016-02-24 | Comments | Related: More > Replication
We've run into an issue where keeping records of row counts didn't warn us about SQL Server replication stalling and we wanted to know if there were other approaches to solving this problem, especially related to transactions pending to be replicated.
Tracking row counts can help assuming that the people tracking the differences can recognize an abnormal miscount between the source and destination and that they have a large sample size to ascertain whether that's true. Unfortunately, without a statistical approach, both of those can be difficult to determine on a day-to-day basis (or case-by-case basis).
Imagine a billion rows of data being replicated throughout the day from source to destination; at any given moment, what should the uncommitted difference between the source and destination be (remember, .NET looks at the uncommitted row count)? Stated another way, when using a statistical approach (average plus three multiplied by three standard deviations as the outlier), 100% of the time I saw many warning signs (statistically significant miscounts between source and destination tables), there was an issue.
After having written that, "trust but verify" is a solid idiom and holding an approach accountable to scrutiny can help provide some peace of mind. I've talked with some developers who use a checksum like approach, such as aggregating a numeric column, but with data like bitcoin, swings can be fast in either direction giving a false impression or false flag (because a 10% jump or fall in less than a minute is not a problem with replication, but a valid move, where the source has a different data point temporarily than the destination). I like to remind people that as someone from West Texas, I've seen oil prices rise $25 in less than one day; luckily, in that particular case, replication wouldn't have been behind too far.
I recently spoke to Harshal Bhargav, an engineer at Microsoft who used the following query, that answers this question to a certain degree (do we have enough data points?):
SELECT name , rowcnt ---- NOTE: consider how you've configured replication for instance, if you've scaled a ---- distribution per publisher, like distribution_publisherone, then you would adjust this query FROM distribution.dbo.sysindexes WHERE name IN('ucMSrepl_transactions', 'ucMSrepl_commands')
In this query, we're retrieving the name of the index - two indexes, ucMSrepl_transactions (the clustered index for MSrepl_transactions) and ucMSrepl_commands (the clustered index for MSrepl_commands) - and the data level row count. As an alternative, we could use the SMO library with PowerShell to perform a soft read of the row count of both tables - either approach will give us the row count, though this may not be fully committed.
Let's put aside the distribution clean up job for a minute and state that this query can help us identify how many commands and transactions we have - whether pending or completed. Until the clean up runs, this should either stay the same or increase; you will see these counts decrease after the clean up runs, relative to how long you retain records. A great accountability check for row count accuracy is what is the average amount of commands or transactions? Let's quickly highlight that this may differ at certain points in the day - the Hong Kong Stock Exchange is only open over a period of time versus bitcoin always being open. The replication behavior of both will differ, and thus we will have more options for cleaning up replication if we have some low or downtime because we schedule the clean-up job to run outside certain windows of time that are busier.
Let's expand this approach to keeping a history, so that we can determine averages and standard deviations for calculating abnormal events, using outliers:
--DROP TABLE Logging.dbo.tb_ReplicationTransCmds CREATE TABLE Logging.dbo.tb_ReplicationTransCmds( TransCmdName VARCHAR(35), RepRowCount BIGINT, RepTimeStamp DATETIME ) INSERT INTO Logging.dbo.tb_ReplicationTransCmds SELECT name AS TransCmdName , rowcnt AS RepRowCount , GETDATE() AS RepTimeStamp FROM distribution.dbo.sysindexes WHERE name IN('ucMSrepl_transactions', 'ucMSrepl_commands')
With a time-based history, we can determine the increase or decrease in both over a period of time, retrieving the latest command and transaction count first:
;WITH RepGrowth AS( SELECT DENSE_RANK() OVER (ORDER BY RepTimeStamp DESC) ID , * FROM Logging.dbo.tb_ReplicationTransCmds ) SELECT t.* , (((t.RepRowCount - tt.RepRowCount)/CAST(tt.RepRowCount AS DECIMAL(22,8)))*100) Growth FROM RepGrowth t INNER JOIN RepGrowth tt ON t.ID = (tt.ID - 1) AND t.TransCmdName = tt.TransCmdName
Likewise, we can get the average, standard deviation and outlier value:
SELECT Name , AVG(RepRowCount) AvgValue , STDEV(RepRowCount) StDevValue , AVG(RepRowCount) + (3*STDEV(RepRowCount)) OutValue FROM Logging.dbo.tb_ReplicationTransCmds GROUP BY Name
The above query uses three multiples of the standard deviation; some environments may benefit from a lower standard deviation multiple, such as two. How does this compare to row counts on tables as for an accountability check on whether the row count increase (or decrease) is accurate? This actually carries a slight advantage over row counts in that abnormal behavior on this can predict possible miscounts on rows, but this can also create a possible false alert if a distribution clean up job only runs at certain times (we would expect both commands and transactions to be high if the clean up job hasn't run).
- The more you know what's normal for your replication, the easier it will be to detect when an issue arises. Unfortunately, determining normal often requires some history.
- Test this with some of your other alerts to measure how it relates with them.
- Check out these other Replication Tips.
Last Updated: 2016-02-24
About the author
View all my tips