Lookup and Cache Transforms in SQL Server Integration Services
One of the new SQL Server 2008 features in SSIS is an enhanced Lookup transform that includes the ability to cache the lookup data locally. Can you provide the details of this new functionality and an example of how to use it?
The enhanced Lookup transform works hand-in-hand with the Cache transform. The Cache transform is a brand new feature in SQL 2008, allowing you to cache the data used in the Lookup transform. The Lookup transform can then utilize this cached data to perform the lookup operation. Working with cached data will most likely be more efficient than querying the database. In addition the Cache Connection Manager (another new feature) can persist the cache to a local file, allowing the cache to be shared between multiple SSIS packages and steps within a package. A perfect example of where this will be useful is in the extract, transform and load (ETL) packages that we create to update a data warehouse. We typically build dimension tables in the data warehouse that use a surrogate key as a primary key (in lieu of the source key in the business application). As we are processing our fact tables we need to lookup the dimension surrogate keys based on the key in our source system and store the surrogate key in our fact tables.
To demonstrate how to use the new Lookup and Cache transforms, we will create a simple SSIS package as shown below:
The Build Customer Cache Data Flow task will use the new Cache Transform component to populate the cache with the source key and surrogate key for every row in the Customer dimension from the AdventureWorksDW 2008 database. Note that we could create a separate SSIS package to populate our cache then simply use the cache in multiple SSIS packages. We could run the SSIS package only when there has been a change to the Customer dimension. For purposes of demonstration, we will just use a single package.
The Lookup Customer Surrogate Keys Data Flow task will use the enhanced Lookup transform to lookup the Customer dimension surrogate key based on the source key in the Sales.Customer table of the AdventureWorks2008 OLTP database.
Before we walk through the details of the two Data Flow tasks, let's first setup the Cache Connection Manager. The Cache Connection Manager is a new connection manager which is used to define the cache. To add a Cache Connection Manager right click in the Connection Managers area of the Control Flow design surface and select New Connection. Select CACHE from the Add SSIS Connection Manager dialog then click the Add button:
After adding the Cache Connection Manager there are some properties that need to be configured. On the General tab click the Use file cache checkbox and enter a filename for the cache; this will persist the cache to a local file, allowing it to be reused on a subsequent run of this package or another package.
Click the Columns tab to configure the data columns to be stored in the cache. SourceKey is the name we are using for the lookup column; i.e. it is the column that is joined to the source table to perform the lookup; the Index Position must be 1. The SurrogateKey column is the value that we want to retrieve via the lookup; i.e. it's the value that we want to retrieve based on our SourceKey; its Index Position must be zero. The Columns dialog below assumes a single column lookup; if more than one column is required for the lookup, add the additional columns and increment the Index Position as appropriate.
Build Customer Cache
The Build Customer Cache Data Flow is shown below:
Get Customer Source Keys and Surrogate Keys is an OLE DB Source that selects the CustomerAlternateKey (i.e. the SourceKey) and the CustomerKey (i.e. the surrogate key) for every row in the Customer dimension. The SELECT statement aliases the columns to SourceKey and SurrogateKey for convenience; they will match the column names in the cache and the Mappings dialog in the Cache transform (shown below) will automatically perform the mapping.
Load Customer Lookup Cache is a Cache Transform component that loads the cache. The Cache Transform component requires a Cache Connection Manager which defines the cache. We defined the Cache Connection Manager earlier in the Setup section. Select the Cache connection manager on the Connection Manager page of the Cache Transformation Editor:
Click Mappings to map the input columns in the Data Flow to the destination columns in the cache. Based on the aliases in our SELECT statement above, the mapping is performed automatically.
Lookup Customer Surrogate Keys
The Lookup Customer Surrogate Keys Data Flow task is shown below:
Get Orders is an OLE DB Source that retrieves the SourceKey by joining the Sales.SalesOrderHeader table to the Sales.Customer table. The CAST is required to match the type in the cache and the Customer dimension. The SourceKey is what we join to the cache to perform our lookup.
Lookup Customer Surrogate Keys is a Lookup transform. It allows us to retrieve the SurrogateKey for the SourceKey in our Data Flow. There are a number of properties to configure starting with the General page:
For Cache mode we select Full cache which preloads the cache before performing the lookups. For Connection type we select Cache connection manager (a new feature) which allows us to preload the cache from a file that exists locally (we created it in the initial Data Flow task). These two options maximize performance of the Lookup transform in most instances. You can still choose an OLE DB connection manager which is essentially the functionality you get with SQL Server 2005. Another new feature in SQL Server 2008 is the ability to redirect rows with no match to a new output (Lookup No Match Output) rather than redirecting these rows to the Error output. In most cases a lookup with no match will need to be handled separately but it may not be treated as an error.
Click Connection to specify the Cache Connection Manager; we defined this in the Setup section. It links the Lookup transform to the underlying cache.
Click Columns to configure the lookup operation; the completed dialog is shown below:
Start out by right clicking on the lookup column in the Available Input Columns list; for our example right click the AccountNumber column. The Create Relationships dialog will be displayed; select AccountNumber as the Input Column and SourceKey as the Lookup Column.
Click SurrogateKey in the Available Lookup Columns list; this is the value that we want to retrieve from the lookup operation.
Save Customer Lookup Matches and Save Customer No Match Lookups are both OLE DB Destinations; we save the successful lookups and any source keys with no match in the cache to separate tables just to see the result of the Lookup transform. In a typical SSIS package you would assign new surrogate keys for the No Match Lookups then insert all of the rows extracted into the appropriate fact table. The No Match lookups represent what we call early arriving facts; i.e. a row from the source system that references a dimension source key that doesn't exist yet. For additional information on handling these early arriving facts, refer to our earlier tip Handling Early Arriving Facts in SQL Server Integration Services (SSIS) Packages.
Let's highlight the key points in our sample SSIS package that uses the enhanced Lookup and new Cache transforms:
- Use the new Cache Connection Manager to define your cache.
- Load the cache using the new Cache transform.
- Configure the enhanced Lookup transform to use the new Cache Connection Manager.
- Download a copy of the latest Community Technology Preview of SQL Server 2008 from this site. The above example was created using the February, 2008 CTP.
- Download the sample code and experiment with the new Lookup transform in an SSIS package. Note that you will probably have to change the cache filename in the Cache Connection Manager; the example uses a hard-coded path. You will also have to create the tables used in the Lookup Customer Surrogate Keys Data Flow task.
- Download the AdventureWorks sample databases for SQL Server 2008 here.
About the author
View all my tips