Synchronize Table Data Using a Merge Join in SSIS

By:   |   Comments (14)   |   Related: More > Integration Services Data Flow Transformations


Problem

We have a need to synchronize data in tables across SQL Server instances. There are multiple ways this can be done such as using link servers, replication, TableDiff utility, etc., but in this tip I am going to demonstrate one-way synchronization using an SSIS Merge Join.

Solution

I have two SQL Server instances: serverA and serverB. I also have two different databases, TestingA on serverA and TestingB on serverB. Both of these databases have the same data in table Customer_Detail with a unique key of custID.   Then we changed some of the data on serverA and now I want the data on serverB to be the same for reporting purposes, so based on that I will demonstrate how to synchronize this data from serverA to serverB using a Merge Join in SSIS.

Below is what my SSIS Package looks like. I will discuss each of the steps in the package.

SSIS Package Synchronization

Step 1: Initiate SQL Connections

I have two OLE DB Sources, one for serverA (source) and the other for serverB (destination). Below are how these connections are setup.

Source connection: serverA

serverA - Description: serverA connection

Source connection: serverB

serverB - Description: serverB connection

Step 2: Apply sorting on tables

When applying a Merge Join, the data needs to be sorted for both inputs, hence I am applying a sort operation to both sides. An alternative to this step is to pre-sort the data in a query and use a SQL command instead of table. This needs to be done for both inputs.

SSIS Data sorting - Description: apply sorting

Step 3: Applying Merge Join

I have applied a full outer join on the unique column custid in the Merge Join. I need to get NULL records from the right and left tables where the custid does not match and this will determine if we need to perform a delete or insert operation.

Full Outer Join as Merge Join Transformation Editor - Description: apply merge join

Step 4: Split the Records

In the next step, I have applied a conditional split.  I have added three different conditions based on the full outer join result. These are the three conditions:

  1. UPDATE (custID match): If the custID exists on both the source and destination then we do an update.
  2. INSERT (custid not found in destination): If custID is found only on the left (source) then we need to insert this data into the destination table.
  3. DELETE (custID is not found in source): If custID is not found in the source table then we need to delete this data from the destination table.
Conditional Split Transformation Editor - Description: split operations

Step 5: Find Updated Record

As mentioned above the UPDATE (custID Match) condition, it returns the data where custID matches. We don't know if the data has changed on the source, so to minimize updating every record I have added a conditional split that compares each column to see if any data is different.  If it is, then we will do the update and if not we will not update the record.  Only the records that are different will go to Step 6.

Condition

(custName! = [custName (1)]) || (custAddress!= [custAddress (1)]) || (pincode != [pincode (1)]) || (buaddress != [buaddress (1)]) || (varmobile != [varmobile (1)]) || (dtentrydate != [dtentrydate (1)]) || (ref_userid != [ref_userid (1)])
finding updated records in the Conditional Split Transformation Editor - Description: finding updated records

Step 6: Performing Insert, Update and Delete to Destination Table

So we now have three streams of data to do either an Insert, Update or Delete.

Insert Records

I have an OLE DB Destination connection object Customer_detail and mapped the input columns as follows.

Insert the data
Mapping data

Update Records

For updating the data, I have an OLE DB Destination connection and added a sqlcommand as follows using the Advanced Editor.

updataing data - Description: Conditional base 
Updating the records
updating the data - Description: Updating the data

After initiating the command with parameter “?”, I then applied the input column mapping with available parameter destination columns based on the destination table.

Mapping columns - Description: Mapping columns

DELETE Records

For deleting data, I have an OLE DB Destination connection and added a sqlcommand as follows using the Advanced Editor.

Deleting the record - Description: Delete the records
configure command - Description: delete the records

For the delete record the column mapping is as follows.

Mapping the data - Description: Mapping columns

Running the Package

After setting up all of the above steps, randomly changing some of the data, I then run the package and get the following.

Synchronization  - Description: Package execution

Conclusion

As per the above package synchronization, I have performed insert, update and delete operations with data from the source to the destination. In SSIS, there are also other methods available for synchronization data such as using the tablediff utility in SSIS. I suggest you find the method that works best for your business needs.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Bhavesh Patel Bhavesh Patel is a SQL Server database professional with 10+ years of experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Wednesday, December 9, 2020 - 12:28:11 AM - Noman Back To Top (87893)
I am following this procedure to segregate update and insert operations on my table but I have a question. Do I need to define the parameters first? when I use the SQL command with "?" the column mapping step doesn't show me any destination tables :(

Tuesday, May 14, 2019 - 9:22:59 PM - Henry Stinson Back To Top (80078)

Good article.  I would suggest that, since SORT is a blocking operation in SSIS, that the sort be done in the query (or stored procedure)  that pulls the data from A & B.  The Merge-Join is also a blocking operation, but you can tell that operation that each input data set is pre-sorted, to speed up the Merge-Join.

Otherwise, normally, Merge-Join would block to perform a sort.

I kind of skimmed over the rest of the article.  I plan to go back and study it in detail.


Thursday, May 9, 2019 - 9:21:13 AM - Oded Dror Back To Top (80003)

Bhavesh,

Is there any way you can give us the solution so we can test it?

Thanks,

Oded Dror


Tuesday, March 19, 2019 - 8:58:24 AM - JR Back To Top (79336)

 In the Update step, how would you handle a composite key with parameters? For example, if you have multiple columns as a key, instead of just a custid?

 


Saturday, February 23, 2019 - 4:57:23 AM - ashish jain Back To Top (79102)

 Thanks For the post. It helps a lot. I am not judging you just wanted correction so it will help others.

Change is required in the documentation for updated and delete mentioned OLEDB destination, But screenshot shows it is OLEDB command and that is correct. Actually, OLEDB destination doesn't support parameter for all parameter based destination OLEDB command is task is available.

https://docs.microsoft.com/en-us/sql/integration-services/data-flow/ole-db-destination?view=sql-server-2017

So change will be replaced

I have an OLE DB Destination

with

I have an OLE DB Command.


Monday, January 28, 2019 - 10:58:22 AM - David Potter Back To Top (78895)

I have a problem... Once I build the SQL command.. column = ?.. for each column in the table.. column mappings shows error.. 518 No Column information was returned by the SQL Command... but if I go into designer it shows the table and all the columns.. so not sure what the problem is.


Tuesday, May 29, 2018 - 10:41:52 AM - Mark Landry Back To Top (76043)

PS This behavior occurs for me b/c I chose to sort the data at the source ("order by") and removed the sort tasks from the package. The sort tasks will inherently read in all records before completing. So those tasks effectively buffer all input rows and there's no chance of the catch-22 I encountered.


Wednesday, May 16, 2018 - 2:34:48 PM - Mark Landry Back To Top (75959)

Overall this is an excellent example and saved me a good deal of effort. But it has one major fault that's not obvious.

If the target table is larger than the SSIS buffer, then the package could hang. That's because the SELECT from the target table (ServerB OLEDB Source) will transactionally block all subsequent INSERTs and UPDATEs. Since the SSIS pipeline is full, no more rows can be read from the target table and the package hangs indefinitely. It's not a database deadlock because the SELECT is free to continue but SSIS won't read any more rows because the buffer is full.

My solution was to increase DefaultMaxBufferSize and DefaultMaxBufferRows to accomodate the entire source table into the pipeline.

See https://www.desertislesql.com/wordpress1/?p=318


Tuesday, May 8, 2018 - 10:27:20 AM - Patel Bhavesh Back To Top (75899)

Hi Priya,

You can go with the primary key column but the case is every time same data updates if you not got the actual updated record.
There are multiple option you can keep and used method as per appropiate your buisness scenario..


 


Monday, May 7, 2018 - 4:20:45 PM - PRIYA Back To Top (75884)

 HELLO,

In the example that you have given there are only 7 columns so in the conditional split you have given all the 7 columns.  but the table im working on has 60+ columns.

do I have to give all the 60+ columns or is it okay if i just give the primary key columns?

Appreciate your response

 


Thursday, March 29, 2018 - 11:02:07 PM - Daniel Dickler Back To Top (75556)

Well done and very useful! Thanks for sharing!


Thursday, February 15, 2018 - 3:42:21 AM - Alex Tabone Back To Top (75217)

Many thanks for sharing this tutorial. One small correction: in Step 6, under "Update Records", it should be "OLE DB Command" (not "OLE DB Destination connection").


Sunday, December 3, 2017 - 12:20:50 AM - Bhavesh Patel Back To Top (73571)

Thanks Gerardo Arevalo for this advice. some times this sorting operatior cost created problem in server so with respect of this, I have mentioned alernate in step2. It is better to use T-SQL with Order statement instead of sorting component. for more clearification I have added reference of such discussion..

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/231079cb-9aad-4c0f-b126-73d2acee1629/tuning-ssis-sort-transformation-vs-using-tsql-order-statement?forum=sqlintegrationservices

 

 

 

 

 

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/231079cb-9aad-4c0f-b126-73d2acee1629/tuning-ssis-sort-transformation-vs-using-tsql-order-statement?forum=sqlintegrationservices

 


Tuesday, November 21, 2017 - 10:06:07 AM - Gerardo Arevalo Back To Top (70057)

Nice post, well documented. But, a  notice of caution before trying to implement that on a production server: sort and merge join components run entirely on memory, the SSIS server must have plenty enoung free memory to sort (twice) all the data, and then compare between them (join). I have seen to fail implementations like this while in dev boxes ran succesfuly as the (T/SQL) load in the prod server usually is ten times the dev load, letting less usuable memory for SSIS operations.















get free sql tips
agree to terms