Tracking SQL Server Replication Subscriber Schema Changes Disrupting Data Flow
By: Tim Smith | Comments | Related: More > Replication
We've been experiencing SQL Server transactional replication issues where we discover that a subscriber's schema is not synced with a publisher's schema after we add a column to the publisher. It appears that the subscriber doesn't get the column, or something removes it. How we can track when this occurs or ways to prevent this issue from occurring?
For tracking this issue, especially if we have many different teams that use and change the objects involved in SQL Server replication, both on the publisher and subscriber, we'll want to look at a few logs as well as possibly compare the schemas, when replication is timed. Let's look at a test example of this and identify how we can troubleshoot the time that this occurs, as well as ways to prevent this. Let's create a test table in a test environment and add a few records to it and set it to transactional replication to another test environment:
CREATE TABLE tblPubSubTest( OurIDColumn INT, OurVCColumn VARCHAR(100), ColumnWeDropped VARCHAR(1) ) INSERT INTO tblPubSubTest VALUES (1,'One','1') , (2,'Two','2')
Once SQL Server replication is set up between the publisher and subscriber, let's drop a column on the subscriber table that is currently being replicated without removing it on the publisher - ColumnWeDropped for this example. Then, let's add another value to the publisher:
INSERT INTO tblPubSubTest VALUES (3,'Three','3')
In the MSdistribution_history table within the distribution database on the distribution server (or appropriate server relative to your set up), you will see two errors if there are still records that are being replicated immediately. Let's run the query to check the history first:
SELECT [start_time] , [time] , [comments] FROM MSdistribution_history ORDER BY [time] DESC
"Error executing a batch of commands. Retrying individual commands." "Invalid column name 'ColumnWeDropped'."
In the MSrepl_errors table, we'll see only one of those messages:
"Invalid column name 'ColumnWeDropped'."
Along with the message:
"if @@trancount > 0 rollback tran"
By contrast, if we had added a new column on the Publisher (without adding it on the subscriber), we would have seen a message in the MSdistribution_history table stating:
"A DDL change has been replicated."
This is good to know to help with process of elimination, as a change to the published table's schema will affect the subscriber when the full table is published. When updating the publisher table, we can verify the DDL confirmation in the comments of the MSdistribution_history table in the distribution database. When the error occurs because a column is removed on the subscriber, one point here is that this errors occurs after a record is added, updated or removed; it's true that if we drop a column on the subscriber table without affecting the publisher table and no records are replicated to the subscriber, the error won't appear immediately. Tracer tokens unfortunately do not generate this error, so either simulating a transaction - or if in an environment where replication is constant - will be required if using these table logs to confirm. An alternative solution is post deployment, comparing publisher and subscriber table schemas either based on INFORMATION_SCHEMA if they should match, or a configuration table if only a select set of columns are being replicated. An example where we dropped a column on the subscriber table, replication - provided inserts, updates and deletes continue - will fail and let us know that there is an invalid column. For tracking this helps isolate a time.
Now, for troubleshooting. Since this issue happened after the publisher's schema changed, let's identify when the DDL change happened as compared to when the first error appeared about the column being removed on the subscriber. The first appearance of the error tells us that replication failed because the column on the subscriber table was removed either then or before then (if there are times where replication isn't sending transactions), but we can also confirm that the DDL change was passed to the subscriber (for instance, the subscriber server wasn't offline). Since we've confirmed that the DDL changes were applied and that the column was later removed, due to the error message that occurred after replication continued, we can use the timeframe of the error to identify if a change was made by a team, a deployment or change removed columns, an automated process dropped columns, or something else changed the subscriber table's columns relative to the publisher's table. Once the tables match again and replication continues, we'll know by seeing the message "Initializing" along with replication confirmations in the MSdistribution_history table.
Be very careful about relying on Replication Monitor for detecting if there are errors; while these tables will report on issues immediately, I've caught Replication Monitor behind for these tables many times. While it generally reports errors correctly, it can generate false positives and this is one of the many reasons I tend to check the underlying tables directly - they'll provide information faster and are more up-to-date. In environments where schema changes occur frequently monitoring publisher and subscriber schemas is the best way to prevent and identify when these issues arise. If the environment publishes a fraction of a publisher table, they will need to design a configuration table to save all the information for monitoring - for instance the publisher and subscriber information and how many columns the subscriber should have. For environments that publish full tables, they can use the below query to do a column count check from the source to the destination:
SELECT p.publication AS PublicationName , pubs.srvname AS SourceServer , subs.srvname AS DestinationServer , p.publisher_db SourceDatabase , s.subscriber_db AS DestinationDatabase , a.source_owner AS SourceSchema , CASE WHEN a.destination_owner IS NULL THEN a.source_owner ELSE a.destination_owner END AS DestinationSchema , a.source_object AS SourceTable , a.destination_object AS DestinationTable ---- The below can be used to either directly query from PS, a linked server, or another preferred way , 'SELECT COUNT(COLUMN_NAME) AS SourceColumnCount FROM ' + p.publisher_db + '.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = ''' + a.source_object + ''' AND TABLE_SCHEMA = ''' + a.source_owner + '''' AS SourceCheck , 'SELECT COUNT(COLUMN_NAME) AS SourceColumnCount FROM ' + s.subscriber_db + '.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = ''' + a.destination_object + ''' AND TABLE_SCHEMA = ''' + CASE WHEN a.destination_owner IS NULL THEN a.source_owner ELSE a.destination_owner END + '''' AS DestinationCheck FROM distribution.dbo.MSarticles a INNER JOIN distribution.dbo.MSpublications p ON a.publication_id = p.publication_id INNER JOIN distribution.dbo.MSsubscriptions s ON s.publication_id = p.publication_id AND s.article_id = a.article_id INNER JOIN master..sysservers subs ON s.subscriber_id = subs.srvid INNER JOIN master..sysservers pubs ON s.publisher_id = pubs.srvid
I would set alerts around a schema mismatch when the schemas should match, otherwise, a configuration table that checks using the INFORMATION_SCHEMA is another way to alert on a mismatch. Since each environment (or application) may differ, one environment may be fine by running this both before and after a load, if they have timed loads, while another environment will need to run this throughout the day. Running this before and after a release, deployment or change is a good idea to confirm that the schemas match.
- Test the effects of both subtracting columns on both the publisher and subscriber to familiarize yourself with how replication logs each.
- Put the alerting on schemas around change schedules, replication timing, etc. which will differ depending on process and environment.
About the author
View all my tips