Troubleshooting Slow SQL Server Replication issue due to Distributor Database Growth

By:   |   Comments (13)   |   Related: > Replication


Problem

My company has a very replication intensive environment (mostly transactional replication) and recently I started to see that replicating data was taking longer and longer.  In this tip I will show you some of the techniques that I used to troubleshoot the issue as well as what the final resolution was.

Solution

In addition to providing the solution, let me provide the symptoms to my issue and provide how I started troubleshooting the issue.

Basic Symptoms

Here are list of symptoms that I had.

  • Taking longer to replicate data.
  • High CPU on Distributor server
  • High Disk IO on Distributor server
  • High growth rate on Distribution database

Troubleshooting High CPU and High Disk IO

First, I wanted to know why I was getting high CPU and high Disk IO on the distributor server, so I setup Performance Monitor (also known as perfmon) on the server. You can read this article "Creating SQL Server performance based reports using Excel" on how I setup perfmon to collect data and how I found high CPU/Disk IO for about 7~8min every 10min. That means, we had a job running every 10min with a duration of 7~8min. 

I was able to found out there are two jobs that are causing the issue and they were related to replication clean up.

  • Agent history clean up: distribution
  • Distribution clean up: distribution

So, I disabled these two jobs and ran them manually to see if that was causing the high CPU/Disk IO and I was able to find out the "Distribution clean up: distribution" job that runs "EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 48" stored procedure was causing the problem and when I ran it manually, it ran for more than 7min and returned the below message.

Removed 0 replicated transactions consisting of 0 statements in 451 seconds (0 rows/sec).

Troubleshooting Distribution cleanup issue

From previous research, I was able to find out why we were having high CPU issues and why the distributor database was getting bigger. The issue was that the clean up job was not removing the commands and I was able to see that table "dbo.MSrepl_commands" was 27GB and kept growing everyday.

So, I took a look at the Distributor Properties and I confirmed that retention was set correctly as you can see below.

transaction relations

Fixing Retention days for Each Publication

The next step is to check the retention days for each publication and that is where I found the problem. Many of our publication properties subscription expiration setup was set to "never expire" as you can see from the screen capture below.

refresh

Now, I can open each publication to see how it is setup, but as I mentioned we have many publications so I didn't want to waste my time opening each publication using SQL Server Management Studio.  So instead,  I used the query that I wrote in this tip "SQL Server Replication Scripts to Pull Replication Configuration Information" and added one more column to the script to pull retention days as shown below.

set transaction isolation level read uncommitted
select distinct 
srv.srvname publication_server 
, a.publisher_db
, p.publication publication_name
, p.retention
, ss.srvname subscription_server
, s.subscriber_db
from MSArticles a 
join MSpublications p on a.publication_id = p.publication_id
join MSsubscriptions s on p.publication_id = s.publication_id
join master..sysservers ss on s.subscriber_id = ss.srvid
join master..sysservers srv on srv.srvid = p.publisher_id
join MSdistribution_agents da on da.publisher_id = p.publisher_id 
and da.subscriber_id = s.subscriber_id
ORDER BY p.retention 
publisher

Confirmation

After I identified and changed the retention days to 120 hours and ran the above job, the distributor database space used was reduced from 27GB to 300MB.  After that I shrunk the database to 500MB and then I re-indexed all tables in the distribution database and ran the clean up job again. Now, it takes less 3 sec to run the clean up job.  Also, replication as a whole is a lot faster.

Next Steps
  • Setup an Alert to notify if any new publication is created where retention day is set to never expire
  • Implement additional replication monitoring
  • To shrink databases or database files look at DBCC SHRINKDATABASE and DBCC SHRINKFILE


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Kun Lee Kun Lee is a database administrator and his areas of interest are database administration, architecture, data modeling and development.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Thursday, December 19, 2019 - 12:56:48 PM - jordan Back To Top (83478)

I have the same problem, but my cleaning job never fails, however it doesn't erase anything.

My MS_repl_commands table has 81 million records and I haven't been able to release anything.

Immediate synchronization is at '0', and subscribers expire '120'

manually apply cleaning with max_distretention = 48

Then apply the shrink and only under 1 GB

but I still can't release the 50GB of the distribution BD

It did not work, apply the method and failed to reduce the size of the distribution base


Thursday, September 24, 2015 - 1:56:23 AM - Konstantinos Kats Back To Top (38742)

Hi Lee, I am facing a little different situation.

I have already taken the actions that you mention (retension is set to 120 hours for all publications) from day one. But the distribution cleanup job is always deadlocked by other sessions and we end up in having a huge distribution database (60GB, 90 million rows in MSrepl_commands) that is very difficult to housekeep.

Even when I stop the log reader and run the distribution cleanup manually, it takes hours before it even starts to delete any records.

Do you have any advice on how to speedup this process? In the execution plan the most costly step is the "Clustered Index Delete".

Thanks 


Friday, February 27, 2015 - 2:36:48 PM - Paul Back To Top (36385)

Not sure I follow, does this relate to timeframe with subscriptions being reinitialized (via snapshot) ?  with Trans REPL, I do it once, and the rest is handled with the log reader / distribution agents, so if I do not re-init a sub within that time frame, is it scrapped ?


Friday, October 24, 2014 - 5:19:56 PM - Marios Philippopoulos Back To Top (35056)

Hi, thank you for this informative article.

How can we decide how many hrs to set the subscription expiration period to?

What factors would dictate setting it to 5 days (120 hrs) vs., say, 30 days (720 hrs)?

Thanks again,

Marios Philippopoulos

 

 


Sunday, September 8, 2013 - 8:27:30 AM - Muhanned Maayeh Back To Top (26671)

Question for Transactional Replication, do you really set the subscription to expire after 336 or 120 hours.  I thought I read documentation per Microsoft that the recommendation is to never expire but for merge replication it recommends expiration of 336 hours.  Can you please clarify this point?  I have a transactional replication at a customer site and when the Distribution Clean job ran it produced slowness in our system where the job needed to be stopped and disabled.  Can you please advise on the proper course of action with a similar problem as described.  (note, using SQL 2008 R2 engine).


Tuesday, April 30, 2013 - 4:52:44 PM - Kun Lee Back To Top (23640)

Salu,

I assume that you are looking at the Distributor SQL server? I haven't seen that case. Let me research for you but I haven't come accross that issue before.

Kun


Tuesday, April 30, 2013 - 4:52:29 PM - Kun Lee Back To Top (23639)

Vijay,

To make long story short, even if it is simple mode, the log file won't flush out the log until Log Reader reads the data and mark it. You can easily reproduce the situation (please don't do on production db), you can create a publisher with simple recovery mode and setup replication and stop the log reader and start loading data to the db and watch the open transaction and log space by using

- DBCC SQLPERF(LOGSPACE)

- use <db name>

DBCC OPENTRAN

And you will see that. As matter of fact, you can search many issue because of the log reader wasn't mark it soon due to heavy load if you google it..

 


Friday, April 26, 2013 - 5:57:16 PM - Salu Back To Top (23587)

I am not seeing these two jobs on my SQL Server and I have transactional replication set up. Do you know why?

  • Agent history clean up: distribution
  • Distribution clean up: distribution

Thanks


Sunday, August 26, 2012 - 2:40:41 AM - Vijay Back To Top (19227)

In replication ,when Backup is SIMPLE RECOVERY MODE. In simple recovry mode. t-log will be treuncate. then,how wll be working replication when, back up is SIMPLE RECOVERY MODE.

 

Vijay


Friday, May 4, 2012 - 11:58:50 AM - sam Back To Top (17267)

Thanks! Saved the busy day!

 


Thursday, April 12, 2012 - 8:21:25 PM - Daniel Back To Top (16889)

Kun,

Fantasic article.  One of our Distribution DBs grew out of control just today.  Your solution has helped me greatly to resolve this issue and to learn more about how replication works!   

Thank you so much!

 


Monday, May 3, 2010 - 4:43:31 PM - drsloat Back To Top (5316)

That helped me zero in on my problem very quickly.  My cleanup job had been running for 15 minutes.  Thanks for the write up!


Tuesday, November 17, 2009 - 2:49:06 PM - mictiu Back To Top (4448)

Hi,

Thanks for the excellent tips .. but want to understand something more in details. In quite new to replication but from ur tips here .. i still can't get what ur meaning of

  • Setup an Alert to notify if any new publication is created where retention day is set to never expire
  • your retention day is set to never expire here refer to Subscription expiration in publication properties or Transaction Retention in Distribution properties. Sorry to ask such a question but just to make myself clear :) ....

    Can this setting help to prevent my replication to become inactive and need to be reinitialized ? Coz im also having problem where replication didn't run for numbers of days like 3 days ... then it become inactive. Please advice and if can notify me through email ..















    get free sql tips
    agree to terms