Optimizing Type 2 Slowly Changing Dimensions in Integration Services - Part 2

By:   |   Comments (4)   |   Related: 1 | 2 | 3 | 4 | More > Integration Services Development


Problem

In the first part we introduced the reasoning behind type 2 slowly changing dimensions. We also discussed different implementation options in Integration Services (SSIS). Finally, we started with an implementation in the data flow using out-of-the-box transformations only, so we can build an optimized data flow for loading a type 2 dimension. In this tip, we’ll continue the implementation.

Solution

In the first part of the tip we ended with a data flow where we checked if a row was an insert or an update using the lookup component. If you haven’t read the first part yet, it’s recommended to do so.

For the updated rows, we’re going to verify if a row has columns that have changed compared with the most recent row of that business key. First, we’re going to check for type 2 changes. If a type 2 change has occurred, a new row will be inserted and the previous version will get a timestamp for the ValidTo column, to indicate how long that version was valid in time.

Add a conditional split to the data flow canvas and connect it with the match output of the lookup component.

add conditional split to check for type 2

In the condition, we check if the new value of the field Location is different from the current value (which was retrieved in the lookup):

check for type 2 changes

If a change has occurred, a new row has to be inserted into the dimension. But at the same time, we also need to update the previous version. To solve this issue, we’ll use a multicast to create two copies of the row. One we’ll send to the OLE DB Destination, the other to an update. We’re going to reuse the OLE DB Destination that inserts new rows into the dimension. To do this, we place a UNION ALL to merge the two streams together.

The data flow looks like this:

data flow after adding union all

The UNION ALL component has the following configuration:

union all new rows

The insert of new rows is now covered in the data flow. We’re still left with the updates though. There are two types of updates:

  • Updating the ValidTo field of the previous version when a Type 2 field has occurred
  • Updating all other fields when Type 1 changes have occurred

When using the SCD Type 2 wizard in SSIS, the OLE DB Command is used to issue the updates against the database. The problem with this transformation is that an update is sent to the database for every single row. This puts a burden on the transaction log and is much slower than a single batch update. For large dimensions, this can cause performance issues. To work around this, we’ll create two tables. The following T-SQL is used:

DROP TABLE IF EXISTS [dbo].[UPD_DimCustomer_SCD2];
CREATE TABLE [dbo].[UPD_DimCustomer_SCD2](
   [SK_Customer] [int] NOT NULL,
   [ValidFrom] [date] NOT NULL
);

DROP TABLE IF EXISTS [dbo].[UPD_DimCustomer_SCD1];
CREATE TABLE [dbo].[UPD_DimCustomer_SCD1](
   [CustomerName] [VARCHAR](50) NOT NULL,
   [Email] [varchar](50) NULL
);

Keep in mind that DROP TABLE IF EXISTS is only valid since SQL Server 2016. These T-SQL statements are executed in an Execute SQL Task right before the data flow task.

create update tables

After the data flow task, another Execute SQL Task can be added to drop the update tables, as some sort of clean-up. While debugging, it might be interesting not to drop these tables as you might want to inspect their contents.

drop update tables

To make sure this pattern works, you have to set the DelayValidation property of the data flow to True. If not, the data flow will fail validation because the update tables won’t exist yet when the package starts. Let’s continue with the data flow. Add another OLE DB Destination to write the type 2 changes to the dbo.UPD_DimCustomer_SCD2 table:

add second ole db destination

Here we’ll map the surrogate key retrieved from the lookup and the ValidFrom field:

map fields type 2

Finally, add a third OLE DB Destination to write the type 1 changes to the dbo.UPD_DimCustomer_SCD1 table.

finish data flow

In the mapping pane, the business key of the dimension – CustomerName – and all type 1 fields of the dimension are mapped to the update table.

map type 1 columns

The data flow is now finished. The last step is now configuring the updates. Add an Execute SQL Task between the data flow and the last Execute SQL Task:

add last execute SQL task

First, we need a T-SQL update statement that sets the ValidTo field of the previous version of a row:

UPDATE d
SET  [ValidTo] = DATEADD(DAY,-1,[u].[ValidFrom])
FROM [dbo].[DimCustomer]            d
JOIN [dbo].[UPD_DimCustomer_SCD2]   u ON [u].[SK_Customer] = [d].[SK_Customer];

Because we join on the surrogate key, this update statement will have good performance since this column is typically the primary key of the dimension (and thus has a clustered index on it when the defaults were followed).

The ValidTo field is set to one day prior to the ValidFrom field.

Next, we update the type 1 columns using the business key:

UPDATE d
SET [Email] = u.[Email]
FROM dbo.[DimCustomer]          d
JOIN dbo.[UPD_DimCustomer_SCD1] u ON u.[CustomerName] = d.[CustomerName]
WHERE ISNULL(d.[Email],’’) <> ISNULL(u.[Email],’’);

Here we join on the business key. Ideally, a unique index is placed on the business key columns to enforce uniqueness. Therefore, this update statement should have good performance as well. To minimize the number of updates, we also check if the value for email has changed it all. To take NULL values into account, we wrap both columns in an ISNULL function. If the column is not nullable, this isn’t necessary of course.

Testing the Package

The package is now finished, so let’s test some scenarios. Currently, the following data is in the staging table and in the dimension:

source data and dim data

Let’s change the location to California in the staging table, triggering a type 2 change. The data flow looks like this when running the package:

data flow with type 2 change

The data in the tables:

result of type 2 change

The package was run on the 4th of August. This means ValidFrom is the 3rd of August, since we fetch data that is valid until the previous day (since typically batch ETL runs are executed at night). You can change this depending on your business logic. The ValidTo field of the previous version (SK_Customer = 2) is set to one day prior, the 2nd of August.

Let’s make a type 1 change by changing the email address. Running the data flow results in the following:

type 1 change test

The data in the tables:

data after testing type 1

As you can see, the email is updated for every version.

Conclusion

In this part we finalized the design of the SSIS package that loads a Type 2 slowly changing dimension. However, there are still some optimizations possible:

  • Currently we always update type 1 fields, even when there is no actual change. There’s a check in the T-SQL update statement, but this is after we’ve written all the rows to the temporary table.
  • Detecting updates can be made more efficient by using hashes.
  • What about intra-day changes?

In the next part of the tip, we’ll cover each of those scenarios.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a seasoned business intelligence consultant at AE. He has over a decade of experience with the Microsoft Data Platform in numerous industries. He holds several certifications and is a prolific writer contributing content about SSIS, ADF, SSAS, SSRS, MDS, Power BI, Snowflake and Azure services. He has spoken at PASS, SQLBits, dataMinds Connect and delivers webinars on MSSQLTips.com. Koen has been awarded the Microsoft MVP data platform award for many years.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Friday, July 2, 2021 - 8:03:26 AM - Koen Verbeeck Back To Top (88936)
Hi Jeremy,

1. Yes you are correct. However, with multiple columns this can become cumbersome. I talk about an optimization using hashes in part 3 of the tip. You can find it here:
https://www.mssqltips.com/sqlservertip/5640/additional-optimizations-for-type-2-slowly-changing-dimensions-in-sql-server-integration-services--part-3/

2. No, you can combine everything in one single update, by combining the WHERE clauses using OR. Or you use the hashes as described above.

Regards,
Koen

Thursday, June 24, 2021 - 2:17:12 PM - Jeremy Back To Top (88890)
Hi Koen,

Two questions for you.

1. In the conditional split, if we have more than 1 Type2 attribute in our dimension, we'll have to include all of the those in the Expression seperated by the 'or' operator ('||'), is that correct?
(ex. Location_OLD != Location || otherType2Attribute_OLD != otherType2Attribute || ...etc)

2. If we have multiple type 1 updates, we will have to define multiple update statements in the last Execute SQL Task in Control Flow, is that correct?

ex.
--Update 1 Email
UPDATE d
SET
d.[Email] = u.[Email]
FROM dbo.[Dim_Customer] d
JOIN dbo.[UPD_DimCustomer_SCD1] u ON u.[ship_to_key] = d.[ship_to_key]
WHERE
ISNULL(d.[Email],’’) <> ISNULL(u.[Email],’’);

--Update 2 Customer Created Year
UPDATE d
SET
d.[customer_created_year] = u.[customer_created_year]
FROM dbo.[Dim_Customer] d
JOIN dbo.[UPD_DimCustomer_SCD1] u ON u.[ship_to_key] = d.[ship_to_key]
WHERE
ISNULL(d.[customer_created_year],’’) <> ISNULL(u.[customer_created_year],’’);

--Update 3 etc...

Thanks in advance for your tutorial!! This is a great walk-through!

Kind regards,

-J

Wednesday, February 12, 2020 - 1:53:35 AM - Koen Verbeeck Back To Top (84368)

Hi Pelle,

in this set-up, when an SCD Type 2 chagne occurs, a new line is inserted. The type 1 change will be reflected in the new row, but older rows will still have the old value (not-destructive). If you do want all rows to have the new type 1 value (destructive), you need to connect the multicast to the type 1 destination as well.

Regards,
Koen


Tuesday, February 11, 2020 - 10:45:56 AM - Pelle Back To Top (84347)

Hi Koen!
Found this article recently and try to following along.
I can't see how a change to both scd2 and scd1 attribute is handled e.g Location = Stocholm, email = [email protected].
Am I missing some thing here?















get free sql tips
agree to terms