SSIS Bad Habits: Using Sort and Merge Join Components
Even if it is a popular approach of integrating data from multiple sources, using the SSIS Merge Join alongside the sort components has poor performance, especially while handling medium and large data sets. In this tutorial, we look at three different ways of merging multiple data sources and the performance differences.
This tutorial explains the main cause of the performance problem faced while merging different data sources in SQL Server Integration Services and provides a few alternatives.
SSIS data transformation types
Before we start, it is worth mentioning the three types of transformations in SSIS:
- Non-blocking components: It is the fastest type of component. Data is manipulated synchronously, meaning that the component output uses the same buffer as the input. In other words, the number of input rows is always equal to the number of output rows. For example: derived column, data conversion, conditional split transformations.
- Semi-Blocking components: Manipulate data asynchronously, which means that the component output uses a different buffer from the one used by the input data. At the same time, semi-blocking implies that the component must wait for a specific amount of input data to generate a set of output rows. For example: Merge Join and Term Lookup transformations.
- Blocking components: Blocking means that the component must read all input data before generating the output data. For example: Sort and aggregate transformations.
The following table summarizes the difference between these types:
|Input rows count affected||False||Possible||Possible|
|The output buffer is the same as the input buffer||True||False||False|
|Input rows count required to generate an output||1 row||One or more row||All input data|
|Performance||Fastest||Slower than non-blocking||Slower than semi-blocking|
Why does SSIS Merge Join require sorting?
To answer this question, let us assume that there is no merge join component, and we are currently looking to design a component that allows joining two sources based on a key column (similar to INNER JOIN in SQL). Based on the above explanation, INNER JOIN operations cannot be executed synchronously since the output rows are not always equal to the input rows.
The SSIS data flow task transfers data as batches using pipelines. This means that there are two possible approaches for joining multiple sources within a data flow:
- Add a blocking component that reads all data from both data sources into two in-memory result sets. Then, the component should loop over each row in the first result set and search for the matching rows in the second one.
- Add a semi-blocking component that requires two sorted inputs based on the column used as a joining key. The component should compare the key values from each input buffer at each step. If they are equal, the component output is a joined row and continues. If the join key values are not matching, the component should discard the smaller value of the two inputs and read the next row from that input. Since the inputs are sorted, discarding that row is acceptable since it is smaller than any remaining rows in the other input, which means it is useless in any new join operations.
While the first approach requires a lot of memory and computations to join both input data sources, the second approach is more realistic, especially when handling medium and large data sets or executing the operation on a machine with limited resources. On the other hand, designing the merge join as a semi-blocking component rather than reading all data and sorting it in memory will allow the SSIS developers to avoid sorting the data within the data flow and adopt some alternatives.
The following section will describe the alternatives of using the SSIS merge join and sort component together to perform a join operation.
Joining the data in the source database
One approach that could be very efficient is joining the data within the source database. Even if they join, operations could be expensive in some cases. Several factors make joining the data within the source database more efficient:
- The database engine query optimizer tries to execute the SQL commands most efficiently.
- There is a possibility of creating indexes and statistics that can support the join operation and decrease the amount of memory required.
- No network overhead.
- Sorting data could be performed by indexes rather than in memory.
Those reasons make it worth bulk inserting the data into a staging table and performing the join operation within the database if the source data set is an external file or on another server.
Important note: If the database server does not have sufficient resources or other heavy workloads are performed, this approach may not be efficient, and joining data on the ETL server could be more performant.
Using covering indexes
The second approach is to create covering indexes for your select queries to load the data sorted in a specific order. This is one important technique used in performance tuning. It avoids performing in-memory sort operations even if the source database server does not have sufficient resources to complete the join operations or the source data sets are on different servers. You can create covering indexes for your ordered queries on each server and load the data in a sorted way.
You can learn more about covering indexes in the following article: SQL Server Covering Index Performance (mssqltips.com).
Minimizing the number of sort components
Many developers think that they should never use the sort components. In contrast, there are some scenarios in that alternatives cannot be implemented. In that case, we should try to minimize in-memory ordering as much as possible, even if this were applied to only one of the SSIS merge join component inputs.
To illustrate the performance impact of using the database engine to perform the join or sort operation, we ran an experiment using the StackOverflow2013 database. We created three packages that export data from the Users (2465713 rows) and Posts (17142169 rows) tables and into a destination table. The results are compared at the end.
The first packages use a sort component for each data source and then use the merge join component to combine both sources before inserting the data into the destination table.
We removed the sort components in the second package and created the needed indexes on both data sources. Then we used Merge Join to combine them.
To use the Merge Join transformation in this scenario, we should open the advanced editor of each OLE DB source and set the IsSorted property to True on each output buffer, as shown in the image below.
Next, we should set the SortKeyPosition property to 1 on the column used for sorting.
We removed the Merge Join component in the third package and used a SQL command to join both data sources.
The data is imported directly to the OLE DB destination component.
Comparison of Findings
We executed the packages on a laptop having four logical processors and 20 Gigabytes of RAM. The execution results were as follows:
- The first package took about four days to complete the execution.
- The second package took one day to complete the execution.
- The third package took about 17 minutes to complete the execution.
Those results show that more than three days of the execution time was saved by removing the merge join and sort component from the data pipeline and implementing them in the database engine.
- After improving the performance of the Merge Join operation in SSIS, you should read more about the SSIS package performance optimization methods: SQL Server Integration Services SSIS Performance Tuning Techniques (mssqltips.com)
About the author
View all my tips
Article Last Updated: 2022-04-05