Lookup and Cache Transforms in SQL Server Integration Services

By:   |   Comments (7)   |   Related: More > Integration Services Data Flow Transformations


Problem

One of the new SQL Server 2008 features in SSIS is an enhanced Lookup transform that includes the ability to cache the lookup data locally. Can you provide the details of this new functionality and an example of how to use it?

Solution

The enhanced Lookup transform works hand-in-hand with the Cache transform. The Cache transform is a brand new feature in SQL 2008, allowing you to cache the data used in the Lookup transform. The Lookup transform can then utilize this cached data to perform the lookup operation. Working with cached data will most likely be more efficient than querying the database. In addition the Cache Connection Manager (another new feature) can persist the cache to a local file, allowing the cache to be shared between multiple SSIS packages and steps within a package. A perfect example of where this will be useful is in the extract, transform and load (ETL) packages that we create to update a data warehouse. We typically build dimension tables in the data warehouse that use a surrogate key as a primary key (in lieu of the source key in the business application). As we are processing our fact tables we need to lookup the dimension surrogate keys based on the key in our source system and store the surrogate key in our fact tables.

To demonstrate how to use the new Lookup and Cache transforms, we will create a simple SSIS package as shown below:

Sample pkg

The Build Customer Cache Data Flow task will use the new Cache Transform component to populate the cache with the source key and surrogate key for every row in the Customer dimension from the AdventureWorksDW 2008 database. Note that we could create a separate SSIS package to populate our cache then simply use the cache in multiple SSIS packages. We could run the SSIS package only when there has been a change to the Customer dimension. For purposes of demonstration, we will just use a single package.

The Lookup Customer Surrogate Keys Data Flow task will use the enhanced Lookup transform to lookup the Customer dimension surrogate key based on the source key in the Sales.Customer table of the AdventureWorks2008 OLTP database.

Initial Setup

Before we walk through the details of the two Data Flow tasks, let's first setup the Cache Connection Manager. The Cache Connection Manager is a new connection manager which is used to define the cache. To add a Cache Connection Manager right click in the Connection Managers area of the Control Flow design surface and select New Connection. Select CACHE from the Add SSIS Connection Manager dialog then click the Add button:

add cache connmgr

After adding the Cache Connection Manager there are some properties that need to be configured. On the General tab click the Use file cache checkbox and enter a filename for the cache; this will persist the cache to a local file, allowing it to be reused on a subsequent run of this package or another package.

cache connmgr general

Click the Columns tab to configure the data columns to be stored in the cache. SourceKey is the name we are using for the lookup column; i.e. it is the column that is joined to the source table to perform the lookup; the Index Position must be 1. The SurrogateKey column is the value that we want to retrieve via the lookup; i.e. it's the value that we want to retrieve based on our SourceKey; its Index Position must be zero. The Columns dialog below assumes a single column lookup; if more than one column is required for the lookup, add the additional columns and increment the Index Position as appropriate.

cache cols

Build Customer Cache

The Build Customer Cache Data Flow is shown below:

build cache dataflow

Get Customer Source Keys and Surrogate Keys is an OLE DB Source that selects the CustomerAlternateKey (i.e. the SourceKey) and the CustomerKey (i.e. the surrogate key) for every row in the Customer dimension. The SELECT statement aliases the columns to SourceKey and SurrogateKey for convenience; they will match the column names in the cache and the Mappings dialog in the Cache transform (shown below) will automatically perform the mapping.

oledb source

Load Customer Lookup Cache is a Cache Transform component that loads the cache. The Cache Transform component requires a Cache Connection Manager which defines the cache. We defined the Cache Connection Manager earlier in the Setup section. Select the Cache connection manager on the Connection Manager page of the Cache Transformation Editor:

cache xfm connmgr

Click Mappings to map the input columns in the Data Flow to the destination columns in the cache. Based on the aliases in our SELECT statement above, the mapping is performed automatically.

cache xfm map

Lookup Customer Surrogate Keys

The Lookup Customer Surrogate Keys Data Flow task is shown below:

lookup dataflow

Get Orders is an OLE DB Source that retrieves the SourceKey by joining the Sales.SalesOrderHeader table to the Sales.Customer table. The CAST is required to match the type in the cache and the Customer dimension. The SourceKey is what we join to the cache to perform our lookup.

get orders oledb

Lookup Customer Surrogate Keys is a Lookup transform. It allows us to retrieve the SurrogateKey for the SourceKey in our Data Flow. There are a number of properties to configure starting with the General page:

lookup general

For Cache mode we select Full cache which preloads the cache before performing the lookups. For Connection type we select Cache connection manager (a new feature) which allows us to preload the cache from a file that exists locally (we created it in the initial Data Flow task). These two options maximize performance of the Lookup transform in most instances. You can still choose an OLE DB connection manager which is essentially the functionality you get with SQL Server 2005. Another new feature in SQL Server 2008 is the ability to redirect rows with no match to a new output (Lookup No Match Output) rather than redirecting these rows to the Error output. In most cases a lookup with no match will need to be handled separately but it may not be treated as an error.

Click Connection to specify the Cache Connection Manager; we defined this in the Setup section. It links the Lookup transform to the underlying cache.

lookup conn

Click Columns to configure the lookup operation; the completed dialog is shown below:

lookup cols

Start out by right clicking on the lookup column in the Available Input Columns list; for our example right click the AccountNumber column. The Create Relationships dialog will be displayed; select AccountNumber as the Input Column and SourceKey as the Lookup Column.

lookup map

Click SurrogateKey in the Available Lookup Columns list; this is the value that we want to retrieve from the lookup operation.

Save Customer Lookup Matches and Save Customer No Match Lookups are both OLE DB Destinations; we save the successful lookups and any source keys with no match in the cache to separate tables just to see the result of the Lookup transform. In a typical SSIS package you would assign new surrogate keys for the No Match Lookups then insert all of the rows extracted into the appropriate fact table. The No Match lookups represent what we call early arriving facts; i.e. a row from the source system that references a dimension source key that doesn't exist yet. For additional information on handling these early arriving facts, refer to our earlier tip Handling Early Arriving Facts in SQL Server Integration Services (SSIS) Packages.

Summary

Let's highlight the key points in our sample SSIS package that uses the enhanced Lookup and new Cache transforms:

  • Use the new Cache Connection Manager to define your cache.
  • Load the cache using the new Cache transform.
  • Configure the enhanced Lookup transform to use the new Cache Connection Manager.
Next Steps
  • Download a copy of the latest Community Technology Preview of SQL Server 2008 from this site. The above example was created using the February, 2008 CTP.
  • Download the sample code and experiment with the new Lookup transform in an SSIS package. Note that you will probably have to change the cache filename in the Cache Connection Manager; the example uses a hard-coded path. You will also have to create the tables used in the Lookup Customer Surrogate Keys Data Flow task.
  • Download the AdventureWorks sample databases for SQL Server 2008 here.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Ray Barley Ray Barley is a Principal Architect at IT Resource Partners and a MSSQLTips.com BI Expert.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, March 17, 2014 - 5:50:10 PM - Claudia Back To Top (29788)

graciasssss!!!!!!!!!!!!!!!!

llevo mucho buscando esto :)


Tuesday, June 5, 2012 - 11:54:28 AM - Alex Back To Top (17805)

Thank you for your answer! Your help is very appreciated Ray.


Thursday, May 31, 2012 - 5:03:58 PM - Ray Barley Back To Top (17763)

Here's the description of Index Position from the Cache Connection Manager Editor documentation at http://msdn.microsoft.com/en-us/library/bb895364; my description in the tip left this out

Index Position

Specify which columns are index columns by specifying the index position of each column. The index is a collection of one or more columns.

For non-index columns, the index position is 0.

For index columns, the index position is a sequential, positive number. This number indicates the order in which the Lookup transformation compares rows in the reference dataset to rows in the input data source. The column with the most unique values should have the lowest index position.


Thursday, May 31, 2012 - 2:53:50 PM - Alex Back To Top (17755)

Why Index Position for SurrogateKey column must be zero and 1 for SourceKey?


Thursday, May 26, 2011 - 3:05:29 PM - Jeremy Kadlec Back To Top (13914)

Clement,

Per Ray, here's what you can do to make the sample code in the tip work:

Download AdventureWorks2008_SR4.exe from http://msftdbprodsamples.codeplex.com/releases/view/37109 then run it.

Install these two AdventureWorks Data databases:

  • AdventureWorks Data Warehouse 2008
  • AdventureWorks OLTP 2008

The OLTP database requires the following:
* Full-text Search must be installed
* The SQL Full-text Filter Daemon Launcher service must be running
* FILESTREAM must be enabled

If you are missing any of these you can't install AdventureWorks OLTP 2008.  You can find the
instructions for setting these up here:

http://msftdbprodsamples.codeplex.com/wikipage?title=Database%20Prerequisites

Thank you,
Jeremy Kadlec


Friday, May 6, 2011 - 6:07:09 AM - Clement Back To Top (13764)

Hello,

Thank you for the article.
The URL for downloading "the AdventureWorks sample databases for SQL Server 2008" is not working anymore.
The dead URL is http://www.codeplex.com/MSFTDBProdSamples/Release/ProjectReleases.aspx?ReleaseId=10901
FYI there is no AdventureWorks2008 database in AdventureWorks2008_SR4.exe, so the sample's connection manager don't work no more.

a+, =)
-=Clement=-

Configuration :
SQL Server v10.50.1600 (SQL Server 2008r2 + CU7)
AdventureWorks2008_SR4.exe


Wednesday, August 19, 2009 - 11:25:53 PM - blaze Back To Top (3925)

 Hi,

It is clear from your blog how to use Lookup with cach from OLEDB, but have you ever tried  using an excel file as the source?

I just can't make it work with an excell file as the source.  (using: excel 2007, SQL2008)

Regards,















get free sql tips
agree to terms