Monitor SQL Server replication latency using tracer tokens

By:   |   Comments (1)   |   Related: > Replication


Tracer tokens were introduced with Sql Server 2005 transactional replication as a way to monitor the latency of delivering transactions from the publisher to the distributor and from the distributor to the subscriber(s) (and as a side-effect, you can also get the time from the publisher to the subscriber(s)). A tracer token is basically a small piece of data that is inserted into the transaction log for the given publication just as if it were data for a standard database transaction (i.e. a new record that was inserted, or an update, or a delete, etc.) and is treated basically the same way as well (with the exception that no user data is obviously modified in either the publisher or subscriber). You insert tokens for a given publication at the publisher, and the progress of the token can be queried on the publisher as well.

Prior to Sql 2005, tracking the latency for transaction distribution in a replication topology was very tedious to say the least - some solutions included using 'dummy' tables to push record updates/inserts/deletes out to each subscriber, however much code had to be written to track the changes through the entire system, including the necessity typically to connect to the actual distributors/subscribers to query for the changed data and verify the replication of said data. With Sql 2005, tracer tokens provide a simple, easy, light-weight way for you to check the latency of a transactional replication topology automatically from a single point (i.e. your publisher).

There are multiple built-in system procedures that you can use to manage tracer tokens, some of which include:

- sp_posttracertoken (used to push a token into a publication)
- sp_helptracertokens (returns a row for each tracer token that has been inserted at a publisher for a given publication)
- sp_helptracertokenhistory (returns latency information for a given token id value retrieved from either of the procedures above)
- sp_deletetracertokenhistory (removes token records from the system meta tables)

Having these system procedures makes it very easy to automate the monitoring of latency for your publications, possibly fire errors/emails/alerts if the latency exceeds certain thresholds, and store historical latency information for reporting purposes or historical tracking.  I've included a sample stored procedure that I call "sp_replCheckLatency_allPubs" that will do basically what it says - when run on a given instance, it will walk through every database on the server that includes a publication, then walk through each publication in the given database, and post a tracer token to it. Once all the publications in all DB's have had tokens posted, the procedure will then wait for a specified delay, and then proceed to walk back through each of the posted tokens to retrieve latency information (via the sp_helptracertokenhistory system procedure using the token ID value returned from the original posting of the token). If at any point a token is found where the latency for the token to reach the distributor/subscribers has exceeded a certain specified threshold, an error will be raised with detailed information about the publication, token latency, etc. Then, after all the tokens have been inspected for the specified threshold, all the latency information can be optionally logged to a table for historical tracking and reporting purposes, or just to have around for querying in the future to correlate network outages, lag times, etc. if necessary.

In an environment for my client, we use this procedure in a scheduled job on each publisher to run on a scheduled basis (every 10 minutes here today), and if an error is fired indicating replication latency over our specified threshold (in our environment we use a total latency to all subscribers of 1 minute or less), then an email is immediately sent to a DBA group and a copy of a report from Reporting Services is sent along as well to show the day's latency information. The report from reporting services is built from the captured data from the procedure, and allows DBAs and IT folks to see trending analysis on networks throughout the environment related to replication.

I've included the procedure below, you'll notice you have the ability to optionally specify parameters such as a single database vs. all database, a defined delay threshold, a retry threshold if responses aren't received, the ability to suppress output, and an optional fully-qualified table name for storing the latency information (we use a table within an admin database here).

Enjoy!

----------------------------------------------------------------------
------------------ CODE ONLY BELOW ------------------
----------------------------------------------------------------------

use master;
go
if object_id('dbo.sp_replCheckLatency_allPubs') > 0
 drop procedure dbo.sp_replCheckLatency_allPubs;
go
create procedure dbo.sp_replCheckLatency_allPubs
 @database   nvarchar(500) = '',  -- Database to limit results to - by default all db's are included in execution
 @waitDelaySeconds smallint = 15,   -- Time, in seconds, to delay for before checking for a response from the 
            -- dist/subs...must be between 0 and 60...
 @retryAttemptsIfNoResponse smallint = 2, -- # of times to retry checking for results prior to giving up and flagging an error 
            -- for each publication...must be between 0 and 10...
 @noOutputDisplay bit = 0,    -- If set (1), no result is returned, execution is simply performed with no reporting...
 @outputTable  nvarchar(1500) = null -- Fully qualified (i.e. server.db.owner.tablename) table to store the output to - If 
            -- the table exists, data will be appended to it
as
/*
exec dbo.sp_replCheckLatency_allPubs @database = 'GratisInternet', @waitDelaySeconds = 20, 
   @retryAttemptsIfNoResponse = default, @noOutputDisplay = default, 
   @outputTable = 'GratisAdminDb.dbo.ztbl_repl_latency';
*/
set nocount on;
declare @dbname  nvarchar(300),
  @sql  nvarchar(max),
  @i   int,
  @n   int,
  @y   smallint,
  @errString varchar(max),
  @postTime datetime,
  @waitFor nvarchar(100);
if object_id('tempdb..#tokens') > 0
 drop table #tokens;
if object_id('tempdb..#response') > 0
 drop table #response;
-- Format...
select @database = case when len(@database) > 0 then @database else '' end,
  @waitDelaySeconds = case when @waitDelaySeconds between 0 and 60 then @waitDelaySeconds else 15 end,
  @retryAttemptsIfNoResponse = case when @retryAttemptsIfNoResponse between 0 and 10 then @retryAttemptsIfNoResponse else 2 end,
  @noOutputDisplay = case when @noOutputDisplay > 0 then @noOutputDisplay else 0 end;
select @waitFor = '000:00:' + cast(@waitDelaySeconds as varchar(10)) + '.000';
-- Table to hold listing of tokens...
create table #tokens (id int identity(1,1) not null, dbname nvarchar(300), pub nvarchar(1000), tokenid int, postTime datetime);
create table #response (id int, distLatency bigint, subscriber nvarchar(500), subscriberDb nvarchar(500), subLatency bigint, overallLatency bigint, 
      sqlText nvarchar(max), postTime datetime, lastCheckTime datetime, errorString varchar(max));
declare dbCursor cursor local fast_forward for
 select name
 from sys.databases
 where is_published > 0
 and  ((name = @database)
   or (len(isnull(@database,'')) = 0));
open dbCursor;
while 1=1 begin
 fetch next from dbCursor into @dbname;
 
 if @@fetch_status <> 0
  break;
 
 select @sql = N'
  -- Use the database we are in...
  use ' + @dbname + ';
  declare @pub nvarchar(500),
    @token int;
    
  if isnull(object_id(''dbo.syspublications''),0) = 0
   return;
  -- Get list of all publications in this db...
  declare pubsCursor cursor local fast_forward for
   select name
   from dbo.syspublications
   where status > 0;
  open pubsCursor;
  while 1=1 begin
   fetch next from pubsCursor into @pub;
   
   if @@fetch_status <> 0 
    break;
   -- Post a token to this publication...  
   exec sys.sp_posttracertoken @publication = @pub, @tracer_token_id = @token output
   
   -- Add this info to the tracking table
   insert #tokens (dbname, pub, tokenid, postTime)
   select @dbname, @pub, @token, getdate();
  end
  close pubsCursor;
  deallocate pubsCursor;';
 
 exec dbo.sp_executesql @sql, N'@dbname nvarchar(300)', @dbname;
end
-- Close the cursor...
close dbCursor;
deallocate dbCursor;
-- Pause for a bit...
waitfor delay @waitFor;
-- Now that we've posted a token, time to check the responses for each of them...
select @i = min(id), @n = max(id), @sql = N'', @y = 0
from #tokens;
-- Get each response...
while @i <= @n begin
 select @sql = N'exec ' + dbname + '.dbo.sp_helptracertokenhistory @publication = ''' + pub + ''', @tracer_id = ' + cast(tokenid as varchar(50)) + ';',
   @postTime = postTime
 from #tokens
 where id = @i;
 
 -- Execute if we have something for this id...
 if len(@sql) > 0 begin
  -- Push in the response data...
  insert #response (distLatency, subscriber, subscriberDb, subLatency, overallLatency)
  exec (@sql);
 
  -- Add the sql text...
  update #response
  set  sqlText = @sql,
    id = @i,
    lastCheckTime = getdate(),
    postTime = @postTime
  where sqlText is null;
  
  -- Did we get a response? If not, wait another bit of time, then recheck...
  if exists(select * from #response where id = @i and overallLatency is null) begin
   
   -- If we haven't got a response after @retryAttemptsIfNoResponse tries, this is an error...
   if @y >= @retryAttemptsIfNoResponse begin
    select @errString = 'REPLICATION LATENCY ERROR - Server ' + quotename(@@servername) + ', Publication ' + quotename(pub) +'. ' + char(13) + char(10) +
       'Token initiated ' + quotename(convert(varchar(50), postTime, 109)) + '. No response by ' + quotename(convert(varchar(50), getdate(), 109)) + '.'  + char(13) + char(10) +
       'Latency exceeds ' + quotename(cast(datediff(second, postTime, getdate()) as varchar(50))) + ' seconds.'
    from #tokens
    where id = @i;
    -- Post the error string...
    update #response
    set  errorString = @errString
    where id = @i;
    -- Reset the sub-counter...
    select @y = 0;
   end else begin
    -- Pause and retry this token...
    waitfor delay @waitFor;
    delete #response where id = @i;
    select @i = @i - 1, @y = @y + 1;
   end -- if @y >= 2
   
  end -- if exists(select * from #response where id = @i and overallLatency is null)
 end -- if len(@sql) > 0
 
 select @i = @i + 1, @sql = N'';
end -- while @i <= @n
-- Capture if we should
if len(@outputTable) > 0 begin
 select @sql = '
  if isnull(object_id(''' + @outputTable + '''),0) = 0
   create table ' + @outputTable + ' (
    distLatency bigint, 
    subscriber nvarchar(500), 
    subscriberDb nvarchar(500), 
    subLatency bigint, 
    overallLatency bigint, 
    sqlText nvarchar(max), 
    postTime datetime, 
    lastCheckTime datetime, 
    errorString varchar(max)
   );
  insert ' + @outputTable + ' (distLatency,subscriber,subscriberDb,subLatency,overallLatency,sqlText,postTime,lastCheckTime,errorString)
  select  distLatency,subscriber,subscriberDb,subLatency,overallLatency,sqlText,postTime,lastCheckTime,errorString
  from  #response;';
 
 exec (@sql);
end
-- Display results...
if isnull(@noOutputDisplay,0) = 0
 select distLatency, cast(subscriber as varchar(100)) as subscriber,
   cast(subscriberDb as varchar(100)) as subscriberDb, subLatency,
   overallLatency, postTime,lastCheckTime,
   cast(errorString as varchar(350)) as errorString,
   cast(sqlText as varchar(350)) as sqlText
 from #response;
-- Raise an error if needed...
if exists(select * from #response where len(errorString) > 0)
 raiserror ('Replication Agent Latency Detected. See ErrorString output for details.', 16, 15);
 
if object_id('tempdb..#tokens') > 0
 drop table #tokens;
if object_id('tempdb..#response') > 0
 drop table #response;
go


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Chad Boyd Chad Boyd is an Architect, Administrator and Developer with technologies such as SQL Server, .NET, and Windows Server.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Thursday, April 12, 2018 - 11:33:23 AM - Asaf Back To Top (75686)

 When testing on SQL Server 2017, i noticed OBJECT_ID('tempdb..#tokens) returned integer value as negative value. Code checks for > 0 so cleanup of temp table was failing.

I replaced  the following;

 if object_id('tempdb..#tokens') > 0
  drop table #tokens;

 if object_id('tempdb..#response') > 0
  drop table #response;
  
With the following statements; 

if object_id('tempdb..#tokens') IS NOT NULL
  drop table #tokens;

 if object_id('tempdb..#response') IS NOT NULL
  drop table #response;















get free sql tips
agree to terms