SSIS Bad Habits: Slow lookup transformations
Several Microsoft SQL Server Integration Services (SSIS) developers report bad package performance while using a lookup transformation. Why is this an issue?
This tutorial explains how lookups may highly decrease SSIS package performance and when they should only be used in Business Intelligence projects.
What is a lookup table?
Before getting started, it would be good to explain when a table is called a lookup table. This will help explain when a lookup transformation is needed within the ETL process.
As defined in Wikipedia, "In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as direct addressing… The savings in processing time can be significant because retrieving a value from memory is often faster than carrying out an expensive computation or input/output operation."
Lookup tables are mainly used to standardize and unify values within a text column in order to minimize the data size, ensure data integrity, and improve the data load performance. The values within a table are replaced with a code that points to the value stored within a lookup table.
When should a lookup table be used?
After explaining what a lookup table is and how it is used, a question may come to mind: When should we standardize the values of a column and use a lookup table as a reference? Is it always a good solution?
There is no general answer to this question. The usage of lookup tables depends on the data's intended usage. For example, using a lookup table as reference data is highly recommended for a transactional database, while it may be less important when transferring data to an external system. In addition, it depends on the data type and number of distinct values; it is useless to standardize the values of "first names" while it is useful for other columns such as gender or city.
SSIS lookup transformation
In SSIS, the lookup transformation is a data flow task component used to join additional columns to the data pipeline by looking up values in a table using equality join comparison. When released in SSIS 2005, this transformation only supported the usage of lookup tables stored within a relational database. In SSIS 2008, the lookup transformation is improved to allow using a table stored within the cache memory using a cache connection manager.
Currently, the SSIS lookup transformation supports two connection types to retrieve a lookup table:
- OLE DB connection manager: it only supports SQL Server, Oracle, and DB2 databases.
- Cache connection manager: to store data within cache, we should use the SSIS cache transform task.
It supports three data caching modes:
- Full cache: it fully loads the lookup table (reference dataset) in memory before executing the lookup transformation. It is the only available mode using a cache connection manager.
- Partial cache: the lookup data is generated and loaded during the package execution. The rows with matching entries in the reference dataset and – optionally - those without matching entries in the dataset are stored in the cache.
- No cache: the lookup data is generated and loaded during the package execution. No data is stored within the cache memory.
Note: this article only addresses the SSIS lookup transformation performance problems, while the fuzzy lookup transformation component is out of this article's scope.
Why may a lookup transformation decrease the SSIS package performance?
After briefly describing this transformation, I will try to explain some of the bad habits that may decrease the lookup transformation performance in this section.
Using large lookup tables
One of the most common misusages of the SSIS lookup transformation is dealing with a huge reference table. Why? This is due to:
- Large data cannot always fit in memory.
- We don't need the whole data during the lookup operation
- The join operation is performed on the fly during the package execution. This will ignore all indexes that can support the join operation.
A good alternative for this approach is performing the lookup transformation in the database engine rather than the data pipeline. We can use a SQL statement \ T-SQL code as a data access mode in the OLE DB source. This approach will let the SQL query optimizer benefit from all created indexes to improve the join operation and minimize the amount of data loaded, especially when using covering indexes.
Another alternative is to load the most commonly used values within the cache memory. We can use a cascading lookup pattern where one lookup uses the cache, with its No Match output falling through to a second lookup running in a partial cache mode to hit the database to handle the remaining rows.
Redundant lookup table loads
Another reason to decrease the SSIS package performance is loading data several times during the same execution. This is one of the main purposes for developing the cache connection manager. For example, a lookup table containing information about countries can be used several times during an ETL job since it is used for different entities such as persons, locations, and organizations.
Using the lookup transformation several times within the package to load the data from the same table can highly affect the package performance if the data volume is considerable. While loading the data into a cache will reduce the number of data loads and decrease the disk reads and package execution time. In addition, if the lookup transformation is repeated within multiple packages, we can also decrease the amount of data load by saving the data into a cache file (.caw) and loading it using the cache connection manager.
Staging data before lookup
When a lookup table is stored within an external file or a data source that is not OLE DB compatible, several SSIS developers insert this data into a staging table to use it in a lookup transformation. If the data size is acceptable, using a cache transform task to insert this data into the cache may guarantee higher performance.
Is always using a cache connection manager faster?
The usual answer is NO. Using the cache connection manager is not always faster than OLE DB since the cost of disk access can out weight the benefits of pre-creating the cache.
- After improving the performance of lookup operations in SSIS, you should read more about the SSIS package performance optimization methods: SQL Server Integration Services SSIS Performance Tuning Techniques (mssqltips.com).
About the author
View all my tips
Article Last Updated: 2022-06-07