SSIS Bad Habits: Using Sort and Merge Join Components

By:   |   Updated: 2022-04-05   |   Comments (4)   |   Related: > Integration Services Performance


Problem

Even if it is a popular approach of integrating data from multiple sources, using the SSIS Merge Join alongside the sort components has poor performance, especially while handling medium and large data sets. In this tutorial, we look at three different ways of merging multiple data sources and the performance differences.

Solution

This tutorial explains the main cause of the performance problem faced while merging different data sources in SQL Server Integration Services and provides a few alternatives.

SSIS data transformation types

Before we start, it is worth mentioning the three types of transformations in SSIS:

  • Non-blocking components: It is the fastest type of component. Data is manipulated synchronously, meaning that the component output uses the same buffer as the input. In other words, the number of input rows is always equal to the number of output rows. For example: derived column, data conversion, conditional split transformations.
  • Semi-Blocking components: Manipulate data asynchronously, which means that the component output uses a different buffer from the one used by the input data. At the same time, semi-blocking implies that the component must wait for a specific amount of input data to generate a set of output rows. For example: Merge Join and Term Lookup transformations.
  • Blocking components: Blocking means that the component must read all input data before generating the output data. For example: Sort and aggregate transformations.

The following table summarizes the difference between these types:

  Non-blocking Semi-blocking Blocking
Type Synchronous Asynchronous Asynchronous
Input rows count affected False Possible Possible
The output buffer is the same as the input buffer True False False
Input rows count required to generate an output 1 row One or more row All input data
Performance Fastest Slower than non-blocking Slower than semi-blocking
Memory consumption Low Normal High
Example Lookup Merge Join Sort

Why does SSIS Merge Join require sorting?

To answer this question, let us assume that there is no merge join component, and we are currently looking to design a component that allows joining two sources based on a key column (similar to INNER JOIN in SQL). Based on the above explanation, INNER JOIN operations cannot be executed synchronously since the output rows are not always equal to the input rows.

The SSIS data flow task transfers data as batches using pipelines. This means that there are two possible approaches for joining multiple sources within a data flow:

  1. Add a blocking component that reads all data from both data sources into two in-memory result sets. Then, the component should loop over each row in the first result set and search for the matching rows in the second one.
  2. Add a semi-blocking component that requires two sorted inputs based on the column used as a joining key. The component should compare the key values from each input buffer at each step. If they are equal, the component output is a joined row and continues. If the join key values are not matching, the component should discard the smaller value of the two inputs and read the next row from that input. Since the inputs are sorted, discarding that row is acceptable since it is smaller than any remaining rows in the other input, which means it is useless in any new join operations.

While the first approach requires a lot of memory and computations to join both input data sources, the second approach is more realistic, especially when handling medium and large data sets or executing the operation on a machine with limited resources. On the other hand, designing the merge join as a semi-blocking component rather than reading all data and sorting it in memory will allow the SSIS developers to avoid sorting the data within the data flow and adopt some alternatives.

Alternatives

The following section will describe the alternatives of using the SSIS merge join and sort component together to perform a join operation.

Joining the data in the source database

One approach that could be very efficient is joining the data within the source database. Even if they join, operations could be expensive in some cases. Several factors make joining the data within the source database more efficient:

  1. The database engine query optimizer tries to execute the SQL commands most efficiently.
  2. There is a possibility of creating indexes and statistics that can support the join operation and decrease the amount of memory required.
  3. No network overhead.
  4. Sorting data could be performed by indexes rather than in memory.

Those reasons make it worth bulk inserting the data into a staging table and performing the join operation within the database if the source data set is an external file or on another server.

Important note: If the database server does not have sufficient resources or other heavy workloads are performed, this approach may not be efficient, and joining data on the ETL server could be more performant.

Using covering indexes

The second approach is to create covering indexes for your select queries to load the data sorted in a specific order. This is one important technique used in performance tuning. It avoids performing in-memory sort operations even if the source database server does not have sufficient resources to complete the join operations or the source data sets are on different servers. You can create covering indexes for your ordered queries on each server and load the data in a sorted way.

You can learn more about covering indexes in the following article: SQL Server Covering Index Performance (mssqltips.com).

Minimizing the number of sort components

Many developers think that they should never use the sort components. In contrast, there are some scenarios in that alternatives cannot be implemented. In that case, we should try to minimize in-memory ordering as much as possible, even if this were applied to only one of the SSIS merge join component inputs.

Experiments

To illustrate the performance impact of using the database engine to perform the join or sort operation, we ran an experiment using the StackOverflow2013 database. We created three packages that export data from the Users (2465713 rows) and Posts (17142169 rows) tables and into a destination table. The results are compared at the end.

Test 1

The first packages use a sort component for each data source and then use the merge join component to combine both sources before inserting the data into the destination table.

First package data flow task where SSIS Merge join and sort are used

Test 2

We removed the sort components in the second package and created the needed indexes on both data sources. Then we used Merge Join to combine them.

Second package data flow task where only SSIS merge join is used

To use the Merge Join transformation in this scenario, we should open the advanced editor of each OLE DB source and set the IsSorted property to True on each output buffer, as shown in the image below.

OLE DB Source advanced configuration

Next, we should set the SortKeyPosition property to 1 on the column used for sorting.

Sorting key configuration in the OLE DB Source

Test 3

We removed the Merge Join component in the third package and used a SQL command to join both data sources.

Joining data using an SQL Command

The data is imported directly to the OLE DB destination component.

Third package data flow task after removing the SSIS merge join component

Comparison of Findings

We executed the packages on a laptop having four logical processors and 20 Gigabytes of RAM. The execution results were as follows:

  • The first package took about four days to complete the execution.
  • The second package took one day to complete the execution.
  • The third package took about 17 minutes to complete the execution.

Those results show that more than three days of the execution time was saved by removing the merge join and sort component from the data pipeline and implementing them in the database engine.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Hadi Fadlallah Hadi Fadlallah is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com. He holds a Ph.D. in data science focusing on context-aware big data quality and two master's degrees in computer science and business computing.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2022-04-05

Comments For This Article




Saturday, January 21, 2023 - 3:23:28 AM - Hadi Fadlullah Back To Top (90841)
@Patrick, in that case I prefer staging one of the data source on the other sever, then joining the staged table with the other data source using SQL

Saturday, January 21, 2023 - 3:21:48 AM - Hadi Fadlullah Back To Top (90840)
@AT, Your comment is highly appreciated

Sunday, November 6, 2022 - 6:23:09 PM - AT Back To Top (90667)
Thank you so much! This was the life saving tip! :)

Friday, September 30, 2022 - 11:04:21 AM - Patrick Back To Top (90537)
hi... what alternatives can you offer when the data sources come from different servers? In that scenario, the servers do not have "trust" enabled, so the data cannot be joined on the server.














get free sql tips
agree to terms