Monitor SQL Server replication latency using tracer tokens
By: Chad Boyd | Updated: 2007-10-24 | Comments (1) | Related: More > Replication
Tracer tokens were introduced with Sql Server 2005 transactional replication as a way to monitor the latency of delivering transactions from the publisher to the distributor and from the distributor to the subscriber(s) (and as a side-effect, you can also get the time from the publisher to the subscriber(s)). A tracer token is basically a small piece of data that is inserted into the transaction log for the given publication just as if it were data for a standard database transaction (i.e. a new record that was inserted, or an update, or a delete, etc.) and is treated basically the same way as well (with the exception that no user data is obviously modified in either the publisher or subscriber). You insert tokens for a given publication at the publisher, and the progress of the token can be queried on the publisher as well.
Prior to Sql 2005, tracking the latency for transaction distribution in a replication topology was very tedious to say the least - some solutions included using 'dummy' tables to push record updates/inserts/deletes out to each subscriber, however much code had to be written to track the changes through the entire system, including the necessity typically to connect to the actual distributors/subscribers to query for the changed data and verify the replication of said data. With Sql 2005, tracer tokens provide a simple, easy, light-weight way for you to check the latency of a transactional replication topology automatically from a single point (i.e. your publisher).
There are multiple built-in system procedures that you can use to manage tracer tokens, some of which include:
- sp_posttracertoken (used to push a token into a publication)
- sp_helptracertokens (returns a row for each tracer token that has been inserted at a publisher for a given publication)
- sp_helptracertokenhistory (returns latency information for a given token id value retrieved from either of the procedures above)
- sp_deletetracertokenhistory (removes token records from the system meta tables)
Having these system procedures makes it very easy to automate the monitoring of latency for your publications, possibly fire errors/emails/alerts if the latency exceeds certain thresholds, and store historical latency information for reporting purposes or historical tracking. I've included a sample stored procedure that I call "sp_replCheckLatency_allPubs" that will do basically what it says - when run on a given instance, it will walk through every database on the server that includes a publication, then walk through each publication in the given database, and post a tracer token to it. Once all the publications in all DB's have had tokens posted, the procedure will then wait for a specified delay, and then proceed to walk back through each of the posted tokens to retrieve latency information (via the sp_helptracertokenhistory system procedure using the token ID value returned from the original posting of the token). If at any point a token is found where the latency for the token to reach the distributor/subscribers has exceeded a certain specified threshold, an error will be raised with detailed information about the publication, token latency, etc. Then, after all the tokens have been inspected for the specified threshold, all the latency information can be optionally logged to a table for historical tracking and reporting purposes, or just to have around for querying in the future to correlate network outages, lag times, etc. if necessary.
In an environment for my client, we use this procedure in a scheduled job on each publisher to run on a scheduled basis (every 10 minutes here today), and if an error is fired indicating replication latency over our specified threshold (in our environment we use a total latency to all subscribers of 1 minute or less), then an email is immediately sent to a DBA group and a copy of a report from Reporting Services is sent along as well to show the day's latency information. The report from reporting services is built from the captured data from the procedure, and allows DBAs and IT folks to see trending analysis on networks throughout the environment related to replication.
I've included the procedure below, you'll notice you have the ability to optionally specify parameters such as a single database vs. all database, a defined delay threshold, a retry threshold if responses aren't received, the ability to suppress output, and an optional fully-qualified table name for storing the latency information (we use a table within an admin database here).
---------------------------------------------------------------------- ------------------ CODE ONLY BELOW ------------------ ---------------------------------------------------------------------- use master; go if object_id('dbo.sp_replCheckLatency_allPubs') > 0 drop procedure dbo.sp_replCheckLatency_allPubs; go create procedure dbo.sp_replCheckLatency_allPubs @database nvarchar(500) = '', -- Database to limit results to - by default all db's are included in execution @waitDelaySeconds smallint = 15, -- Time, in seconds, to delay for before checking for a response from the -- dist/subs...must be between 0 and 60... @retryAttemptsIfNoResponse smallint = 2, -- # of times to retry checking for results prior to giving up and flagging an error -- for each publication...must be between 0 and 10... @noOutputDisplay bit = 0, -- If set (1), no result is returned, execution is simply performed with no reporting... @outputTable nvarchar(1500) = null -- Fully qualified (i.e. server.db.owner.tablename) table to store the output to - If -- the table exists, data will be appended to it as /* exec dbo.sp_replCheckLatency_allPubs @database = 'GratisInternet', @waitDelaySeconds = 20, @retryAttemptsIfNoResponse = default, @noOutputDisplay = default, @outputTable = 'GratisAdminDb.dbo.ztbl_repl_latency'; */ set nocount on; declare @dbname nvarchar(300), @sql nvarchar(max), @i int, @n int, @y smallint, @errString varchar(max), @postTime datetime, @waitFor nvarchar(100); if object_id('tempdb..#tokens') > 0 drop table #tokens; if object_id('tempdb..#response') > 0 drop table #response; -- Format... select @database = case when len(@database) > 0 then @database else '' end, @waitDelaySeconds = case when @waitDelaySeconds between 0 and 60 then @waitDelaySeconds else 15 end, @retryAttemptsIfNoResponse = case when @retryAttemptsIfNoResponse between 0 and 10 then @retryAttemptsIfNoResponse else 2 end, @noOutputDisplay = case when @noOutputDisplay > 0 then @noOutputDisplay else 0 end; select @waitFor = '000:00:' + cast(@waitDelaySeconds as varchar(10)) + '.000'; -- Table to hold listing of tokens... create table #tokens (id int identity(1,1) not null, dbname nvarchar(300), pub nvarchar(1000), tokenid int, postTime datetime); create table #response (id int, distLatency bigint, subscriber nvarchar(500), subscriberDb nvarchar(500), subLatency bigint, overallLatency bigint, sqlText nvarchar(max), postTime datetime, lastCheckTime datetime, errorString varchar(max)); declare dbCursor cursor local fast_forward for select name from sys.databases where is_published > 0 and ((name = @database) or (len(isnull(@database,'')) = 0)); open dbCursor; while 1=1 begin fetch next from dbCursor into @dbname; if @@fetch_status <> 0 break; select @sql = N' -- Use the database we are in... use ' + @dbname + '; declare @pub nvarchar(500), @token int; if isnull(object_id(''dbo.syspublications''),0) = 0 return; -- Get list of all publications in this db... declare pubsCursor cursor local fast_forward for select name from dbo.syspublications where status > 0; open pubsCursor; while 1=1 begin fetch next from pubsCursor into @pub; if @@fetch_status <> 0 break; -- Post a token to this publication... exec sys.sp_posttracertoken @publication = @pub, @tracer_token_id = @token output -- Add this info to the tracking table insert #tokens (dbname, pub, tokenid, postTime) select @dbname, @pub, @token, getdate(); end close pubsCursor; deallocate pubsCursor;'; exec dbo.sp_executesql @sql, N'@dbname nvarchar(300)', @dbname; end -- Close the cursor... close dbCursor; deallocate dbCursor; -- Pause for a bit... waitfor delay @waitFor; -- Now that we've posted a token, time to check the responses for each of them... select @i = min(id), @n = max(id), @sql = N'', @y = 0 from #tokens; -- Get each response... while @i <= @n begin select @sql = N'exec ' + dbname + '.dbo.sp_helptracertokenhistory @publication = ''' + pub + ''', @tracer_id = ' + cast(tokenid as varchar(50)) + ';', @postTime = postTime from #tokens where id = @i; -- Execute if we have something for this id... if len(@sql) > 0 begin -- Push in the response data... insert #response (distLatency, subscriber, subscriberDb, subLatency, overallLatency) exec (@sql); -- Add the sql text... update #response set sqlText = @sql, id = @i, lastCheckTime = getdate(), postTime = @postTime where sqlText is null; -- Did we get a response? If not, wait another bit of time, then recheck... if exists(select * from #response where id = @i and overallLatency is null) begin -- If we haven't got a response after @retryAttemptsIfNoResponse tries, this is an error... if @y >= @retryAttemptsIfNoResponse begin select @errString = 'REPLICATION LATENCY ERROR - Server ' + quotename(@@servername) + ', Publication ' + quotename(pub) +'. ' + char(13) + char(10) + 'Token initiated ' + quotename(convert(varchar(50), postTime, 109)) + '. No response by ' + quotename(convert(varchar(50), getdate(), 109)) + '.' + char(13) + char(10) + 'Latency exceeds ' + quotename(cast(datediff(second, postTime, getdate()) as varchar(50))) + ' seconds.' from #tokens where id = @i; -- Post the error string... update #response set errorString = @errString where id = @i; -- Reset the sub-counter... select @y = 0; end else begin -- Pause and retry this token... waitfor delay @waitFor; delete #response where id = @i; select @i = @i - 1, @y = @y + 1; end -- if @y >= 2 end -- if exists(select * from #response where id = @i and overallLatency is null) end -- if len(@sql) > 0 select @i = @i + 1, @sql = N''; end -- while @i <= @n -- Capture if we should if len(@outputTable) > 0 begin select @sql = ' if isnull(object_id(''' + @outputTable + '''),0) = 0 create table ' + @outputTable + ' ( distLatency bigint, subscriber nvarchar(500), subscriberDb nvarchar(500), subLatency bigint, overallLatency bigint, sqlText nvarchar(max), postTime datetime, lastCheckTime datetime, errorString varchar(max) ); insert ' + @outputTable + ' (distLatency,subscriber,subscriberDb,subLatency,overallLatency,sqlText,postTime,lastCheckTime,errorString) select distLatency,subscriber,subscriberDb,subLatency,overallLatency,sqlText,postTime,lastCheckTime,errorString from #response;'; exec (@sql); end -- Display results... if isnull(@noOutputDisplay,0) = 0 select distLatency, cast(subscriber as varchar(100)) as subscriber, cast(subscriberDb as varchar(100)) as subscriberDb, subLatency, overallLatency, postTime,lastCheckTime, cast(errorString as varchar(350)) as errorString, cast(sqlText as varchar(350)) as sqlText from #response; -- Raise an error if needed... if exists(select * from #response where len(errorString) > 0) raiserror ('Replication Agent Latency Detected. See ErrorString output for details.', 16, 15); if object_id('tempdb..#tokens') > 0 drop table #tokens; if object_id('tempdb..#response') > 0 drop table #response; go
About the author
View all my tips
Article Last Updated: 2007-10-24