![]() |
|

SQL Monitor offers straightforward server monitoring through a web-based UI, to help you prioritize your workload:
Start monitoring your servers today with a free trial.
|
|
By: Ray Barley | Read Comments (14) | Related Tips: More > Integration Services Analysis Services Tasks |
Problem
We have experimented with the Slowly Changing Dimension (SCD) Data Flow Transformation that is available in the SSIS designer and have found a few issues with it. Our major concern is the use of the OLE DB Command Data Flow Transformation for all updates (inferred, type 1 and type 2) to rows in the dimension table. Do you have any suggestions?
Solution
You have hit upon the one issue with the Slowly Changing Dimension Data Flow Transformation that likely requires an alternative approach. The issue with the OLE DB Command is that it executes a single SQL statement for each row passing through the Data Flow. When the volume of rows is substantial, this creates an unnecessary and probably unacceptable performance hit. Let's take a step back and analyze the task at hand then discuss an alternative solution.
The term slowly changing dimensions encompasses the following three different methods for handling changes to columns in a data warehouse dimension table:
A dimension that implements Type 2 changes would typically have the following housekeeping columns to identify the current row and the effective date range for each row:
The effective date range columns retain the history of a natural key in the dimension, allowing us to see the column values at any point in time. Fact table rows can be joined to the dimension row where the fact row transaction date is between the effective date range of the dimension row.
When you add the SCD Data Flow Transformation to the Data Flow designer, you step through a wizard to configure the task, and you will wind up with the Slowly Changing Dimension task and everything that follows below being added to the Data Flow designer (the task names generated by the SCD wizard have been updated to add clarification):
Main points about the above screen shot:
Now that we have described how the SCD transformation implements slowly changing dimension processing, we can discuss an alternative solution. As an example we will use a Customer dimension that is updated with source system data in an Excel spreadsheet. The SSIS package Control Flow looks like this:
Main points about the above solution:
An additional detail about Type 1 and Type 2 processing is that a dimension may implement both. In other words some column changes may be handled as Type 1 and other column changes may be handled as Type 2. An elegant way to implement this is to take advantage of the SQL Server CHECKSUM function. CHECKSUM calculates a unique integer hash value based on the values of every column in a row or a subset of columns. We can use a hash value comparison to determine whether anything has changed in our list of columns in the staging table versus the dimension table.
Let's take a look at our Customer dimension table:
The checksum columns are defined as follows:
[Type2Checksum]
AS CHECKSUM([Address],[City],[Region],[PostalCode],[Country])There is a separate CHECKSUM value calculated for the list of Type 1 columns and the list of Type 2 columns. In our staging table we have the same two CHECKSUM computed columns; the column lists must match exactly in order for this to work.
As a general rule the staging table schema mirrors the dimension table schema but includes a couple of other housekeeping columns as shown below:
The housekeeping columns in staging are as follows:
IsNew is set to 1 if this is a new dimension row.
IsType1 is set to 1 if there is a change to any column handled as Type 1.
IsType2 is set to 1 if there is a change to any column handled as Type 2.
Finally let's review the single stored procedure that implements the Type 1 and Type 2 processing and is invoked in the Update Customer Dimension Execute SQL task as noted above. The first step is to update the housekeeping columns in the staging table to specify whether the row is new, has a Type 1 change, or a Type 2 change. Remember that Type 1 and Type 2 changes are not mutually exclusive; you can have one, both, or neither. We simply join the staging table to the dimension on the natural key and CurrentMember = 1 to set the housekeeping flags.
UPDATE stg SET
wk_Customer = dim.wk_Customer
,IsNew = CASE WHEN dim.wk_Customer IS NULL
THEN 1 ELSE 0 END
,IsType1 = CASE WHEN dim.wk_Customer IS NOT NULL
AND stg.Type1Checksum <> dim.Type1Checksum
THEN 1 ELSE 0 END
,IsType2 = CASE WHEN dim.wk_Customer IS NOT NULL
AND stg.Type2Checksum <> dim.Type2Checksum
THEN 1 ELSE 0 END
FROM dbo.stg_dim_Customer stg
LEFT OUTER JOIN dbo.dim_Customer dim ON
dim.nk_CustomerID = stg.nk_CustomerID
AND dim.CurrentMember = 1 |
The Type 1 changes are handled by updating the dimension table from staging where the IsType1 column = 1. Note that if there are multiple rows for the natural key in the dimension, all rows will be updated. This is typically how Type1 changes are handled but you can easily restrict the update to the current row if desired.
UPDATE dim SET [ContactName] = stg.[ContactName] ,[ContactTitle] = stg.[ContactTitle] ,[Phone] = stg.[Phone] ,[Fax] = stg.[Fax] FROM dbo.stg_dim_Customer stg JOIN dbo.dim_Customer dim ON dim.nk_CustomerID = stg.nk_CustomerID WHERE IsType1 = 1 |
The Type 2 changes are handled by expiring the current dimension row. The ExpirationDate is set to the ModifiedDate per the staging table less 1 minute.
UPDATE dim SET CurrentMember = 0 ,ExpirationDate = DATEADD(minute, -1, stg.ModifiedDate) FROM dbo.stg_dim_Customer stg JOIN dbo.dim_Customer dim ON dim.wk_Customer = stg.wk_Customer WHERE IsType2 = 1 |
A row is inserted into the dimension table for new rows as well as Type 2 changes. Typically the EffectiveDate in new rows may be set to the minimum value of the datetime column as a convenience instead of the actual ModifiedDate (i.e. created date) just so that if a fact row had a transaction date before the dimension row's EffectiveDate it would still be in the range of the earliest dimension row. The ExpirationDate is set to the maximum value of the datetime column; some folks prefer NULL which also works.
INSERT INTO dbo.dim_Customer ( nk_CustomerID ,CurrentMember ,EffectiveDate ,ExpirationDate ,CompanyName ,ContactName ,ContactTitle ,Address ,City ,Region ,PostalCode ,Country ,Phone ,Fax ) SELECT nk_CustomerID ,1 ,CASE WHEN IsNew = 1 THEN '1900-01-01' -- MIN of smalldatetime ELSE ModifiedDate END ,'2079-06-06' -- MAX of smalldatetime ,CompanyName ,ContactName ,ContactTitle ,Address ,City ,Region ,PostalCode ,Country ,Phone ,Fax FROM dbo.stg_dim_Customer stg WHERE IsType2 = 1 OR IsNew = 1 |
Let's take a look at an example of Type 2 processing in the dim_Customer table. The following query results show a customer after the region has been updated. Region is one of the columns that is handled as a Type 2 change. As you can see a new row has been inserted with CurrentMember = 1, an EffectiveDate = the MdifiedDate when the change was processed, and an ExpirationDate which is the maximum value for a smalldatetime. The original row was expired and its CurrentMember = 0 and ExpirationDate is set to the ModifiedDate from the source system less 1 minute. The 1 minute subtraction eliminates any overlap in the effective date range.
Next Steps
| Friday, June 20, 2008 - 12:32:50 AM - bhanurapeti | Read The Tip |
|
hi i had a problem with dealing this if i delete any row then the primary key column generating new column which is an identity column so when ever i running the package a new row is inserting can you help on this problem |
|
| Friday, June 20, 2008 - 2:38:30 AM - raybarley | Read The Tip |
|
Are you saying that you deleted a row from the dimension table then when you rerun the package it gets inserted again from the source? If so then you want to add a deleted flag column to the dimension. Rather than physically deleting the row you set the deleted flag. Then the package won't add the deleted row back. |
|
| Thursday, March 25, 2010 - 12:01:42 PM - toddmcdermid | Read The Tip |
|
The other slow part about using the included Slowly Changing Dimension component in SSIS is that it performs a lookup for each and every row that passes through it - and those lookups are all uncached, resulting in very slow performance. There are alternatives:
|
|
| Monday, September 20, 2010 - 1:49:26 PM - andygarner | Read The Tip |
|
I think this is a great tip - many thanks for sharing. Just one suggestion ... When updating the Type 1 fields the JOIN as published uses the natural key fields: dim.nk_CustomerID = stg.nk_CustomerID which is fine if there is only one field, but if the natural key is multipart it might be preferable to use the warehouse key: dim.wk_Customer = stg.wk_Customer (as is published for the Type 2 update). |
|
| Monday, September 20, 2010 - 2:16:05 PM - Ray Barley | Read The Tip |
You're referring to this T-SQL:UPDATE dim SET [ContactName] = stg.[ContactName] ,[ContactTitle] = stg.[ContactTitle] ,[Phone] = stg.[Phone] ,[Fax] = stg.[Fax] FROM dbo.stg_dim_Customer stg JOIN dbo.dim_Customer dim ON dim.nk_CustomerID = stg.nk_CustomerID WHERE IsType1 = 1 |
|
| Tuesday, April 05, 2011 - 4:23:50 AM - Sam | Read The Tip |
|
This is an excellent article. One thing I need to know: How do I deal with fact tables? I am assuming since you are using truncate that you are not staging the fact data? Only the Dimension data? So in that case are you going to go straight from source to DW for the fact data? Much appreciated. |
|
| Tuesday, April 05, 2011 - 7:27:24 AM - Ray Barley | Read The Tip |
|
This tip only talked about dimensions. Fact tables would typically go through a staging process where you extract data from the source system into staging then at a minimum join the data with the dimensions via the surrogate keys. In addition there may be other cleanup, transformations, calculations, etc. that are done in staging before populating the fact tables. One thing different about facts is that you usually only have inserts. Facts don't change, they represent a historical transaction of some sort - think of an invoice. When a purchase is made the invoice is generated. If a customer were to return the purchase they get a new invoice; the original one doesn't change. |
|
| Tuesday, April 05, 2011 - 7:38:57 AM - Sam | Read The Tip |
|
Thanks for the prompt response Ray What I mean is how can you truncate a dimension if you have foreign key constraints referencing the dimension table? Thanks a lot |
|
| Tuesday, April 05, 2011 - 7:51:48 AM - Ray Barley | Read The Tip |
|
The purpose of truncate is to clear out the staging table so that you can add in the rows from the source system tht have changed. The fact staging tables don't have foreign keys that reference the dimension staging tables. The ETL process performs lookups to supply the surrogate keys for the dimensions; the process also deals with missing dimensions (aka early arriving facts) so you don't need to rely on foreign keys for integrity. No magic - there are no foreign keys.
|
|
| Tuesday, April 05, 2011 - 8:09:04 AM - Sam | Read The Tip |
|
Cheers Ray, thats what I was kind of thinking. The staging environment won't use keys between the dimensions and fact tables.
|
|
| Sunday, May 29, 2011 - 12:57:06 AM - Rich | Read The Tip |
|
Is there a risk in this implementation on the checksum saying something is the same when in actuality it has changed - due to the algorithm associated with checksum()? |
|
| Sunday, May 29, 2011 - 8:07:35 AM - Ray Barley | Read The Tip |
|
According to the documentation, talking about how CHECKSUM computes a value based on all of the columns in a table or a specific list of columns: If one of the values in the expression list changes, the checksum of the list also usually changes. However, there is a small chance that the checksum will not change. So the answer is yes it could give the wrong result. |
|
| Monday, May 30, 2011 - 12:05:19 AM - Rich | Read The Tip |
|
What are your thoughts on using hashbytes? |
|
| Monday, May 30, 2011 - 8:23:03 AM - Ray Barley | Read The Tip |
|
I've used it to store the hash value of a password instead of the password itself. You could use it in lieu of checksum but it's awkward having to concatenate all of the columns into a variable that you pass in to the hashbytes function. CHECKSUM is certainly more convenient. Since the original tip was written SQL Server 2008 came out and it supports change data capture which I typically use instead of CHECKSUM. |
|
|
privacy | disclaimer | copyright | advertise | about authors | contribute | feedback | giveaways | user groups Some names and products listed are the registered trademarks of their respective owners. Edgewood Solutions LLC | MSSharePointTips.com | MSSQLTips.com |