How to make an SSIS Merge Join transformation fail safe from sorting irregularities
By: Siddharth Mehta | Comments (2) | Related: More > Integration Services Data Flow Transformations
The SQL Server Integration Services (SSIS) Merge Join transformation requires input data streams to be sorted for its operation. If during the design time you try to join two non-sorted inputs, it won't allow you to do so. One of the options is to sort the data in the database rather than using the Sort transformation in SSIS, which is a more speedier option. But if the sort order is changed unknowingly, SSIS would not raise an exception and the results would be unpredictable.
In any general SSIS Development approach that involved SQL Server as the source database, we use mostly parameterized stored procedures or views to fetch data from the source tables and sort it as per the requirement using the ORDER BY statement. For the discussion of this problem, we would consider stored procedures only. This approach of sorting data using ORDER BY introduces two different issues:
1) Stored procedures which sort data in this way, are not directly consumable for any other requirements that need the same data set but need a different sort order to use with Merge Join transformation. So the approach that one would take is, use Sort transformation in the SSIS package or recreate the same logic with a different sort order.
2) For some reasons, if one inadvertently changes to sort order or intentionally changes the sort order to tailor it for use in his/her own package and the original package which expects a particular sort order would not even realize that the package logic has been broken due to sort order changes.
The solution for this is to make the Sorting parameterized from the package. And in this tip we would simulate the problem and see how this solution can be implemented.
Follow the steps below to simulate the problem, and then we would look at how to solve this problem.
1) Download and install AdventureWorks database
2) Create a new SSIS Project, name it something meaningful. I did this test on the R2 version, and I named it Rnd R2.
3) We would use HumanResources.Employee and HumanResources.EmployeeAddress tables from the Adventureworks database for our test. Create two simple stored procedures as seen in the picture below.
4) Add a DataFlow Task to the package, and add two OLEDB Source Adapters. Name one of them as Employee and other as EmployeeAddress, and use the stored procedures we developed in both of these adapters accordingly. To use this stored procedure select "Data Access Mode" as "SQL Command" and key in "SQL command text" as "EXEC GetEmployees" for example.
5) Now right-click the source adapter and configure the sort hints as shown in the picture below. This provides hints to the stream that the data is sorted and sorting is done on EmployeeID. If you see carefully, we sorted data in both of our stored procedures using the same column.
6) Now add a Merge Join transformation, use Inner Join as the join type and join both of these sources. For our example, to keep things simple, we would just use Multicast as a destination. After you are done with this, your package should look something like the below picture.
7) Modify the GetEmployeeAddress and change the Order By from EmployeeID to AddressID and see the result. If the GetEmployeeAddress is sorted by EmployeeId it returns 290 rows, and after changing the sorting in the stored procedure to AddressID, only 5 rows are returned as shown below.
Now its time to implement our solution, and take control of the sorting into SSIS. For this we need to make the ORDER BY statement parameterized. Modify the GetEmployeeAddress as shown in the below picture. Also modify the "SQL Command Text" of EmployeeAddress Source Adapter to "EXEC GetEmployeeAddress 'EmployeeID'".
Execute the package and it should return again all the 290 rows. By implementing this solution, we take the control of sorting at the discretion of SSIS requirements, also making the stored procedure reuseable for all of the different packages that need to use the same logic, but different sorting. Also this reduces the risk of unpredictable behavior with the Merge Join transformation due to changes in the sort order.
- Implement the same steps for GetEmployee stored procedure and test the solution.
- Try implementing this logic in your existing solutions to make it more robust and controlled.
- Check out the SQL Server Integration Services tips.
- Check out the SQL Sever Integration Services tutorial.
About the author
View all my tips