Additional Optimizations for Type 2 Slowly Changing Dimensions in SQL Server Integration Services - Part 3

By:   |   Updated: 2018-09-03   |   Comments (3)   |   Related: 1 | 2 | 3 | 4 | More > Integration Services Development

Problem

In the previous two parts of this tip, an implementation of a SQL Server slowly changing dimension of type 2 was suggested. This design only uses out-of-the-box components and is optimized for performance. It doesn’t use the SCD Wizard. However, the initial design can be optimized even further. In this part, we’ll go over these optimizations.

Solution

Before you read this tip, it is advised to read part 1 and part 2 if you haven’t already.

There are three optimizations possible in the design:

  • Detect type 1 changes before they are sent to the update table.
  • Uses hashes to detect changes instead of comparing column values directly.
  • Handle intra-day changes.

Using Hashes in SQL Server T-SQL Code to Detect Changes

In this section we’ll introduce hashes for detecting changes. At the same time, we’ll use these to detect type 1 changes in the data flow itself, combining the first two points of the optimizations.

The point of using a hash is that you create a concatenation of all the columns you want to compare them and hash them into one single column. Now you have to compare only one single value instead of multiple columns. Once one of the columns has a different value, the hash will be different and we’ve effectively detected a change. There are two options for creating a hash:

  • Using T-SQL with the HASHBYTES function. In many data warehouse scenarios, data is first staged into a landing zone or staging environment. This means we read data from a relational source, so we can use T-SQL to add the hash column.
  • If you don’t have a relational source for your dimension data, you can add the hash in the data flow. The tip Using hash values in SSIS to determine when to insert or update rows gives an example using a script component.

In this tip, we’ll use HASHBYTES since we read from a staging table in SQL Server. The first step is to adjust the SELECT statement that reads the data from the staging table. We need to add two hashes: one for the type 2 columns and one for the type 1 columns.

SELECT
    [CustomerName]
   ,[Location]
   ,[Email]
   ,ValidFrom = CONVERT(DATE,DATEADD(DAY,-1,GETDATE()))
   ,[Hash_SCD1] = CONVERT(CHAR(66)
                   ,HASHBYTES(  'SHA2_256'
                                ,[Email]
                             )
                    ,1)
   ,[Hash_SCD2] = CONVERT(CHAR(66)
                   ,HASHBYTES(  'SHA2_256'
                                ,[Location]
                             )
                    ,1)
FROM [dbo].[CustomerStaging];

The output of this SELECT statement:

result set staging

In this simple example, there is only one column for each hash. However, in reality you’ll have many more columns. Typically you can use the CONCAT function or the CONCAT_WS function (since SQL Server 2017) to concatenate all the columns together. The advantage of those functions is you don’t have to cast columns to string values and they implicitly convert NULL values to blank strings. In the example, we’re using the SHA2 hash function. We’re also converting it to a character string using binary conversion (the CONVERT function has a third parameter, which is set to 1 in this case). We’re using a string value to compare since this works better in the SSIS conditional split transformation. Not using the binary conversion leads to strange symbols instead of a nice hexadecimal string:

hash values

In the left column, the binary conversion is used while in the right column the result of the hash function is directly cast to a char(66) string.

We’ve added hashes to the source data, but we need to compare it of course. You have two options:

  • Store the hash values from the source along with the other columns when loading new rows.
  • Use calculated column in the dimension tables to calculate the hashes on the fly.

The disadvantage of the first option is that you’ll have to recalculate the hash for every type 1 update or every time new columns are added or removed from the dimension. With calculated columns, you only have to update their definition when the table schema changes.

Let’s add the two calculated columns to the dimension:

ALTER TABLE [dbo].[DimCustomer]
ADD [Hash_SCD1] AS CONVERT(CHAR(66)
                   ,HASHBYTES(  'SHA2_256'
                                ,[Email]
                             )
                    ,1)
   ,[Hash_SCD2] AS CONVERT(CHAR(66)
                   ,HASHBYTES(  'SHA2_256'
                                ,[Location]
                             )
                    ,1);

The dimension data now looks like this:

dimension data with hashes

The SQL statement for the lookup component needs to be adapted as well:

SELECT
    [SK_Customer]
   ,[CustomerName]
   ,[Hash_SCD1_OLD] = [Hash_SCD1]
   ,[Hash_SCD2_OLD] = [Hash_SCD2]
FROM [dbo].[DimCustomer]
WHERE [ValidTo] IS NULL;

In the Columns pane of the lookup, the two hashes need to be retrieved:

retrieve hashes from dimension

The conditional split checking for type 2 changes needs altering as well. The Location column is now removed from the data flow, so we need to replace it with the hashes in the formula:

change type 2 conditional split

We can also add a new line that will check if there are type 1 columns with new values:

check for type 1 changes as well

In the data flow, we have to replace the “No Change” output with the “Type 1 change” output:

change output

The final data flow takes the following form:

final data flow with hashes

Now data will only be written to the “OLE_DST Write Type 1 changes” destination if one or more type 1 columns have new values. One last adjustment is to remove the WHERE clause of the UPDATE statement in the Execute SQL Task:

sql task query

Conclusion

In this tip we introduced a mechanism using hashing of columns to easily detect changes. When one of the columns has a new value, the hash will change entirely and it will differ from the hash of the previous version of the row. We used this hash mechanism to detect changes for type 1 and type 2 attributes. Because of this, we can optimize performance by only writing rows to the type 1 update table if there is an actual change.

Next Steps


Last Updated: 2018-09-03


get scripts

next tip button



About the author




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name
*Email
Email me updates

Signup for our newsletter

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.





Friday, August 30, 2019 - 11:20:55 AM - Dmitry Vakhrushev Back To Top

I believe the logical flow of the three articles was to show how to distinguish and apply Type1 and Type2 changes. Suppose, you had 2 records in the dimension with the email "testemail" and location "LocationA". Then the source data was changed: the email is now "testemailnew" and location "LocationZ". Your dimension table will have the third row inserted with those new values. However, the email won't be updated for the first two records. That is not a correct setup for the Type1 change that you described in the article. Thanks.


Friday, August 30, 2019 - 10:03:28 AM - Koen Verbeeck Back To Top

Hi Dmitry,

I would say the type 1 change is not "lost". It is included in the new type 2 record which was just inserted.
The choice is then if you want to "destroy" the previous history of your type 1 records (as you suggested by using the multicast) or not. In this tip, I deliberately made the choice not to, as it would complicate the set-up. In the end, it's personal preference and/or business requirements that dictate the choice.

Regards,

Koen


Friday, August 30, 2019 - 9:15:43 AM - DMITRY VAKHRUSHEV Back To Top

Thank you, Koen for the great SCD article series. I would like to add that if the update happens to both Type1 and Type2 columns at the same time (e.g. the email and location were changed at the same time) then in your version it's captured as Type2 change only and Type1 change gets lost. To adress that I suggest to add a "multicact" right after the Lookup and create two separate streams for Type1 and Type2.

P.S.

Not sure how to add a picture to the comments.



download

























get free sql tips

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



Learn more about SQL Server tools