By: Jeffrey Yao | Comments | Related: > Replication
Problem
From time to time you may see replication error 20598 for DELETEs. This a pain to fix manually and we often do not want to resort to having to redo the snapshot or skip the error. In this tip I show how you can deal with this error.
Solution
In transactional replication, error 20598 is due to a missing row on the subscriber and there are two scenarios that can cause this error:
- An UPDATE command cannot be replicated, because there is no record that matches the update condition on the subscriber.
- A DELETE command cannot be replicated, because there is no record that matches the delete condition on the subscriber.
I have addressed the UPDATE scenario in part 1. In this tip, I will address the delete scenario. The traditional way to fix this error for a DELETE is similar to what I discussed in part 1, so I will bypass the common steps and only mention the steps unique to the delete scenario. There are actually two ways to fix this issue:
- On the subscriber, insert a dummy record with the exact primary key value of the deleted row. This PK value can be retrieved using sp_browsereplcmds.
- The other option is to modify the sp_MSDel_<schema><table_name> to comment out the error emitting codes when @@rowcount = 0.
The pros of solution 1 is that it is non-intrusive, i.e. no change in the subscription database except for adding the missing row. The cons is that it may be challenging to make a valid dummy record on a table that has many constraints. On the other hand, solution 2 just switches the pros and cons of solution 1.
No matter which solution you use, the data integrity is not compromised. From a solution workload perspective, solution 1 may take more time to generate a dummy record to meet constraints like a Unique Key constraint or user-defined check constraints. While solution 2 can be very simple and elegant in coding. As such, I choose solution 2.
Correct Transactional Replication Error 20598
Let's first look at a system generated sp_MSDel_<schema><table>. The following replication stored procedure is based on a table dbo.t2 which has a primary key with four columns.
-- table definition create table dbo.t2 (a int not null, b varchar(30) not null , c int, d datetime not null, u uniqueidentifier not null, e varchar(30) constraint pk_t2 primary key (a, b, d, u) ); go -- if dbo.t2 is replicated, the system generated replication del SP on the -- subscriber db side will be like the following create procedure [sp_MSdel_dbot2] @pkc1 int, @pkc2 varchar(30), @pkc3 datetime, @pkc4 uniqueidentifier as begin delete [dbo].[t2] where [a] = @pkc1 and [b] = @pkc2 and [d] = @pkc3 and [u] = @pkc4 if @@rowcount = 0 if @@microsoftversion>0x07320000 exec sp_MSreplraiserror 20598 end go
This stored procedure has simple in logic. It deletes a record based on the primary key and if no record is deleted (@@rowcount = 0) it raises error 20598 via "exec sp_MSreplraiserror 20598". To me, if replication cannot DELETE a row because the row is not there, it really does not matter. Replication should continue instead of being stuck on error 20598 forever. Looking at the stored procedure code, if we comment out the "if @@rowcount = 0" block, the stored procedure will bypass error 20598 and replication can continue.
PowerShell Solution to Automate Fixing Error 20598 for DELETEs
The solution is coded in PowerShell, it will first check the distribution.dbo.MSRepl_errors table for error 20598. If error 20598 exists, it gets the [xact_seqno] and [command_id] values and then uses these values to run distribution.dbo.sp_browsereplcmds to retrieve the stored procedure name, i.e. sp_MSDEL_xxxx. It then goes to the subscriber database and modifies the sp_MSDEL_xxx by commenting out the "if @@rowcount = 0" block. At the end, it will wait for a period of time and then un-comment the "if @@rowcount = 0" block and restore the sp_MSDEL_xxx to its original state.
<# Assumption: 1. The computer on which this script runs has SQL Server 2012 PowerShell module installed 2. The computer on which this script runs has PowerShell V3 installed Usage: 1. modify $Distributor/ $PubDB / $SubSvr / $SubDB to be your environment settings 3. Test in the test environment by initially set $debug = $true, recommended to run in the ISE first to get familiar with the script 4. When there are potential many rows missing (for deletion), please set $duration to a proper time (like 10+ min), and it will loop to fix all the missing rows one by one. #>; #Requires -Version 3.0 add-type -assembly "microsoft.sqlserver.smo, culture=neutral, version=11.0.0.0, PublicKeyToken=89845dcd8080cc91"; import-module sqlps -disablename [int]$Duration = 9 # minutes, how long you want this script to run, this is to address potential multiple missing row errors for deletion # At the end of script, it will reverse sp_MSDEL_XXX to its original state [boolean]$debug = $true; #when $true, it will print out the modifying script for sp_MSDEL_xxx [string]$Distributor='tp_w520'; #replication distributor server name [string]$SubSvr = 'tp_w520'; #subscriber server name [string]$SubDB = 'testdb'; #subscription database [string]$PubDB = 'testdb'; #publication database [string]$SPName = ''; [datetime]$end_time = (get-date).AddMinutes($duration); $svr = new-object "microsoft.sqlserver.management.smo.server" $SubSvr; $db = $svr.databases[$SubDB]; #first let's see what's the latest error 20598 [string]$maxid = '0'; [string]$xact_seq='0x00'; [string]$command_id='0' $qry = "select id from dbo.MSPublisher_Databases where publisher_db='$pubdb';" $result2 = invoke-sqlcmd -ServerInstance $Distributor -database Distribution -Query $qry; while ((get-date) -le $end_time) { #start of while $qry = @" declare @id int; select @id = max(id) from dbo.MSRepl_Errors where id > $maxid and error_code=20598 and time > dateadd(minute, -20, getdate()) and (xact_seqno<>$xact_seq or command_id <> $command_id); select top 1 id=cast(id as varchar(8)), xact_seqno=convert(varchar(42), xact_seqno,1), command_id=cast(Command_id as varchar(8)) from dbo.MSRepl_Errors where id = @id "@ $result = $null; $result=invoke-sqlcmd -ServerInstance $Distributor -Database Distribution -Query $qry; if ($result -eq $null) { start-sleep -Seconds 15; continue; } else { $maxid = $result.ID; $xact_seq=$result.xact_seqno; $command_id = $result.command_id; } $qry = @" exec sp_browsereplcmds @xact_seqno_start='$($result.xact_seqno)', @xact_seqno_end='$($result.xact_seqno)' , @command_id=$($result.command_id), @publisher_database_id=$($result2.id); "@; $result3 = invoke-sqlcmd -ServerInstance $Distributor -database Distribution -Query $qry; [string]$odbc_qry = $result3.command; [string]$reg_pattern ="(sp_msdel.+)(?=])" if ($odbc_qry -match $reg_pattern) { if ($SPName -eq $matches[1]) { get-sleep -seconds 10; continue; } else { if ($sp -ne $null) {#means we have different SP, so we need to restore the original $sp, i.e. uncommment out the commentted lines $sp.TextBody = $textbody; $sp.alter(); } $SPName = $matches[1]; } } else { write-error 'The current error cannot be solved by this solution, which aims only at missing rows when deleting.' return; } $sp = $db.storedprocedures.Item($SPName); $textbody = $sp.TextBody; $new_body =''; if ($TextBody -match "if @@rowcount[\w\W]+(?=end)") { $origtext = $matches[0] $replace_str = "/* " + $origtext + " */ `n"; $newbody = $textbody.replace($origtext, $replace_str) } if ($debug) { [string]$header = $sp.TextHeader $header = $header.replace('create proc', 'alter proc'); write-host @" use [$($SubDB)] $header $newbody "@ break; # if $debug, we need to break out because the replication is stuck in the first error # and thus there is no need to continue check } else { $sp.textbody = $newbody; $sp.alter(); } }#end of while if (-not $debug ) { if ($sp -ne $null) { # we have modified the SP, so now we need to reverse it back $sp.TextBody = $textbody; $sp.alter(); } }
How to Test
Setup replication with a few articles in a test environment (SQL Server 2005+), populate these tables and make sure the publication and subscription tables are in synch. First delete a few rows on each table at the subscription side and then delete the same (or more) rows on the publication side. Then you can start the replication monitor to see that replication error 20598 occurs. Now run this script in a PowerShell ISE (by providing the correct replication servers information in the script) and see how the replication error disappears.
Note: The replication distribution agent may stop after a bunch of errors. If so, just restart it, so the error can be logged into distribution.dbo.MSRepl_Errors table.
Next Steps
The script can be run from any computer as long as you can connect to the distributor/subscriber from this computer.
- Test the script in a test environment to get familiar with the script behavior
- Read this tip: Handling Data Consistency Errors in SQL Server Transactional Replication
- You can actually combine this script with the script in part 1 to make a complete solution for replication error 20598
- If your environment is not complex, for example, you do not have unique key or user-defined check constraints in your subscription tables (except for the PK or FK constraints), you can use a different solution, for example, select top 1 * from the publication table, and replace the row with the primary key values retrieved from sp_browsereplcmds then insert this row into the subscription table so replication can delete this row and move on. This method will avoid any modification to the sp_MSDEL_xxx stored procedure, and in my opinion, it is the best solution in an environment where any changes will be audited and need to be explained.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips