By: Jeffrey Yao | Comments (6) | Related: > Replication
Problem
When using SQL Server replication, from time to time we will see replication error 20598 which is a pain to fix without having to resort to snapshot synchronization or error skipping. In this tip I show you an easier approach to fix these issues.
Solution
In SQL Server transactional replication, error 20598 refers to a missing row on the subscriber side, and there are two scenarios that can cause this error:
- An UPDATE command cannot be replicated because there is no record that matches the update condition on the subscriber side.
- A DELETE command cannot be replicated because there is no record that matches the delete condition on the subscriber side.
In this tip, we will address the UPDATE scenario.
Manual Way to Fix SQL Server Error 20598 for an UPDATE
The traditional manual way to fix this is:
- Start the replication monitor and in the replication monitor you can see
the error message as shown in Fig_1 below
- Copy the transaction sequence number, in this case it is:0x0000003E00000041000800000000
and notice Command ID 2 in the message
- Open an
SSMS
window and connect to the distributor and run the following command and you
will get results similar to Fig_2 below
USE Distribution; EXEC sp_browsereplcmds '0x0000003E00000041000800000000', '0x0000003E00000041000800000000' GO
- Focus on the [command] and [command_id] columns. Copy the value from the
[command] column where [command_id]=2
- From step 4, you can figure out the table name on which the replication
command fails. The table name is embedded in the system generated stored procedure
name. In my case, the [command] column value (for command_id=2) is as shown
below. The stored procedure is sp_MSupd_dbot2 and from
this we can tell that my table is dbo.t2.
{CALL [sp_MSupd_dbot2] (,,,,,'nice 3',3,'good',2014-08-31 07:49:47.960,{C1479523-6839-4C90-9429-EE31CD2D5831},0x20)}
Open an SSMS window and connect to subscriber database and exec sp_helptext sp_MSupd_dbot2. From this you can figure out the missing row based on the values of the parameters of sp_MSUpd_dbot2 (i.e. @pkc1,...,@pkcN) - You can then insert the missing row on the subscriber table using the following format:
insert into <subscriber_server>.<sub_db>.<schema>.<table_name> (<column_list>) select (<column_list>) from <publisher_server>.<pub_db>.<schema>.<table_name> where pk_col_1 = <value1> and pk_col_2 = <value2> and pk_col_n = <valuen>
or you can export the missing data from the publishing table to a file and then import the data back to the subscription table.
Fig_1: Replication Error 20598 in Replication Monitor
Fig_2: SP_BrowseReplCmds result
Technical Challenges to Automate this Process
Basically, we will try to mimic the steps as outlined in the manual steps mentioned above. However, there are a few technical challenges before we can have a qualified solution.
- We need to extract the table name based on the result of sp_browsereplcmds.
- We need to find the information about the primary key columns and their matching values for the table, so we can select the missing row from the publication table.
- We need to move the data from the publisher table to the subscriber table without any changes to the existing environment, such as adding a linked server or creating a specific folder/file etc.
PowerShell Solution to Automate Fixing Error 20598 for UPDATEs
The solution will use PowerShell to:
- connect to each replication server (Distributor, Publisher, Subscriber)
- then check dbo.MSRepl_errors for error code 20598 and grab the [xact_seqno] and [command_id] values
- use this data to connect to the distributor
- run sp_browsereplcmds to retrieve the table name and the primary key values
- use the table and the primary key values to retrieve the missing row from the publication table
- and then insert the row into the subscription table.
<#Assumption: 1. the schema of all the publishing tables remains the same, and by default, it is 'dbo' 2. The computer on which this script runs has sql server 2012 PowerShell module installed 3. The computer on which this script runs has PowerShell V3 or above and .net 3.5 or above installed Usage: 1. modify $schema / $PubSvr / $PubDB / $SubSvr / $SubDB / $Distributor to be your environment settings 2. Test in the test environment by initially set $debug = $true, recommended to run in the ISE first to get familiar with the script 3. When there are potential many rows missing (for update), please set $duration to a proper time (like 10 min), and it will loop to fix all the missing rows one by one. #> #Requires -Version 3.0 import-module sqlps -DisableNameChecking; [int] $duration = 3 # in minutes, how long this script will run, this is to handle multiple missing rows [boolean]$debug = $true; # when $true, it will print out a query statement that can be executed on the subscriber server to insert the missing row # it will not do any real fix work as such it will only print out the first fixing query. [string]$schema='dbo';# the schema of the all publishing tables, [string]$PubSvr = 'tp_w520\sql2k8r2'; #publisher, in my example, this is sql server 2008R2 [string]$PubDB = 'testdb'; #publication database [string]$SubSvr = 'tp_w520';#subscriber, in my example, this is sql server 2012 [string]$SubDB = 'testdb'; #subscription database [string]$Distributor = 'tp_w520';#replication distributor [datetime]$end_time = (get-date).AddMinutes($duration); #first let's see what's the latest error 20598 [string]$maxid = '0'; [string]$xact_seq='0x00'; [string]$command_id='0' $qry = "select id from dbo.MSPublisher_Databases where publisher_db='$pubdb';" $result2 = invoke-sqlcmd -ServerInstance $Distributor -database Distribution -Query $qry; while ((get-date) -le $end_time) { $qry = @" declare @id int; select @id = max(id) from dbo.MSRepl_Errors where id > $maxid and error_code=20598 and time > dateadd(hour, -1, getdate()) and (xact_seqno<>$xact_seq or command_id <> $command_id); select top 1 id=cast(id as varchar(8)), xact_seqno=convert(varchar(42), xact_seqno,1), command_id=cast(Command_id as varchar(8)) from dbo.MSRepl_Errors where id = @id "@ $result = $null; $result=invoke-sqlcmd -ServerInstance $Distributor -Database Distribution -Query $qry; if ($result -eq $null) { start-sleep -Seconds 15; continue; } else { $maxid = $result.ID; $xact_seq=$result.xact_seqno; $command_id = $result.command_id; } $qry = @" exec sp_browsereplcmds @xact_seqno_start='$($result.xact_seqno)', @xact_seqno_end='$($result.xact_seqno)' , @command_id=$($result.command_id), @publisher_database_id=$($result2.id); "@; $result3 = invoke-sqlcmd -ServerInstance $Distributor -database Distribution -Query $qry; [string]$odbc_qry = $result3.command; [string]$reg_pattern ="sp_msupd_$($schema)(.+?)\]" if ($odbc_qry -match $reg_pattern) { $TableName ="$($schema).$($matches[1])" } else { write-error 'The current error cannot be solved by this solution, which aims only at missing rows when updating.'} #need to get the parameter values and put into $pm if ($odbc_qry -match "(\{CALL \[sp_MSUpd_.+)\]") { if ($odbc_qry -match "\((.+?)\)") { [string]$pm = $matches[1]; } else { write-error "cannot find parameters"; return; } } else { write-error 'The current error cannot be solved by this solution, which aims only at missing rows when updating.' return; } #next we need to script out the columns of the Subscription Table $svr = new-object "microsoft.sqlserver.management.smo.server" $SubSvr $db = $svr.databases[$subdb]; $tbl=$db.Tables[$TableName.split('.')[1], $TableName.split('.')[0]]; [string]$col_list = ""; $tbl.Columns | % {$col_list += $_.Name + ',' }; $col_list = $col_list.remove($col_list.length-1); #we need to grab the where clause in the replication update stored proc of the subscription table $p = $db.StoredProcedures["sp_MSUpd_$($TableName.replace(""."", """"))"]; if ($p -eq $null) { write-error "cannot find the stored proce of sp_MS$($Fix_Type)_$($TableName.replace(""."", """"))" return; } $ar = $pm.split(",") [int]$i = 0; $where_clause = @{}; while ($i -lt $p.parameters.count ) { if ($p.parameters[$i].name -match '@pkc[0-9]{1,2}') { switch ($p.parameters[$i].DataType.Name) { 'datetime' { $ar[$i] = "'" + $ar[$i] + "'"; break;} 'uniqueidentifier' {$ar[$i]=$ar[$i].replace("{", "'");$ar[$i]=$ar[$i].replace("}", "'"); break;} } $where_clause.add($p.parameters[$i].name, $ar[$i]); #$where_clause. += New-Object -TypeName PSObject -Property @{ $($p.parameters[$i].name)=$ar[$i]; } } $i += 1; } $re = new-object "System.Text.RegularExpressions.regex" ("where.+?(?=if)", [System.Text.RegularExpressions.RegexOptions]::Singleline) $where=$re.match($p.TextBody) [string]$str = $where.Value; foreach ($k in $where_clause.keys) { $str=$str.replace($k, $where_clause[$k]); } [string]$qry = "select $col_list from $TableName $str"; if ($debug) { "insert into $subsvr.$subdb.$tablename ($col_list) ` select $col_list from $pubsvr.$pubdb.$TableName $str; " } else { # start to auto insert the missing row [System.Data.datarow]$r=$null; $r = invoke-sqlcmd -ServerInstance $SubSvr -Database $SubDB -Query $qry; if ($r -eq $null) {#if subsvr.subdb does not have the row #start to insert the missing row from the pubsvr.pubdb.tablename [string]$conn = "Data Source=$SubSvr; trusted_connection=true; initial catalog=$SubDB;" $sqlbcp = new-object "system.data.sqlclient.sqlbulkcopy" ($conn, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity) $r=$null $r = invoke-sqlcmd -ServerInstance $PubSvr -Database $PubDB -Query $qry; $sqlbcp.DestinationTableName = $TableName; $tbl.Columns | % {$sqlbcp.ColumnMappings.Add($_.name, $_.name) } | out-null; $sqlbcp.WriteToServer($r); }#if subsvr.subdb does not have the row } #end to auto insert the missing row start-sleep -Seconds 15; # repl error will take some time before being written to distribution.dbo.msrepl_errors table. }
How to Test
To test this fix do the following:
- Set up replication with an article in a test environment and add some data to the table.
- Then delete multiple rows on the subscriber
- Then update these deleted rows on the publishing table.
- Then start the replication monitor to see that the replication error 20598 occurs (just as Fig_1 shows).
- Now run this script in a PowerShell ISE (by providing the correct replication servers information in the script) and the replication errors should disappear.
Note: The replication distribution agent may stop after a bunch of errors. If so, just restart it so the error can be logged into the distribution.dbo.MSRepl_Errors table.
Next Steps
The script can be run from any computer as long as you can connect to the distributor, publisher and subscriber from that computer.
- Test the script in a test environment
- Read this tip to learn about Handling Data Consistency Errors in SQL Server Transactional Replication
- You can customize the script to adapt to your needs. For example, in my environment I created a stored procedure [uspA] with parameters for distributor/publisher/subscriber and I also created a table [tblReplicationConfig] which will be populated by [uspA]. The PowerShell script is customized to retrieve the data from [tblReplicationConfig] and the PowerShell script is embedded in a job step of [jobA]. The [uspA] will first populate [tblReplicationConfig] with the input parameters and then start the Job [jobA]. The PowerShell script in [jobA] will then pick up the information from [tblReplicationConfig] and then try to fix the error. So the scenario is: when I receive replication 20598 errors, I run [uspA] with parameters and then [jobA] will fix the replication error if these 20598 errors are caused by an UPDATE on a missing row.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips