By: Koen Verbeeck | Comments | Related: 1 | 2 | 3 | 4 | More > Integration Services Development
Problem
In the first two parts of this tip, an implementation of a slowly changing dimension of type 2 was suggested with SQL Server Integration Services. This design only uses out-of-the-box components and is optimized for performance. However, the initial design can be optimized even further. Part 3 introduced hashing to efficiently detect changes. In this part, we’ll take a look at intra-day changes.
Solution
Before you read this tip, it is advised to read part 1, part 2 and part 3 if you haven’t already.
There are three optimizations possible in the design of the SQL Server Integration Services SSIS package:
- Detect type 1 changes before they are sent to the update table.
- Uses hashes to detect changes instead of comparing column values directly.
- Handle intra-day changes.
In this final part, we’ll introduce a solution for dealing with intra-day changes. The first two optimizations were covered in part 3.
Handling Intra-Day Changes
What exactly is an intra-day change? When you run your ETL batch load multiple times in one day, it’s possible that a type 2 attribute (see part 1 for an introduction on the different types of slowly changing dimensions) changes multiple times in that same day. Ideally, you only need the last value of the day stored in your dimension. In this part of the tip, we’ll look at how we can enhance the pattern of loading a slowly changing dimension type 2 to deal with this scenario.
So, what is exactly the problem with changing a value multiple times a day? Let’s illustrate with an example. Suppose we have the following row in the dimension table:
CustomerName | Location | ValidFrom | ValidTo |
---|---|---|---|
CustomerA | Antwerp | 2018-01-01 | NULL |
Suppose a new comes in from the source, where the location has been changed to New York, on August the 6th. This would result in the following data in the dimension table:
CustomerName | Location | ValidFrom | ValidTo |
---|---|---|---|
CustomerA | Antwerp | 2018-01-01 | 2018-08-04 |
CustomerA | New York | 2018-08-05 | NULL |
Remember the new ValidFrom field value is GETDATE() -1 and the ValidTo field of the previous version is set to ValidFrom of the current row minus one day.
For some reason, the location changes again on the same day to Delhi. Maybe someone made a mistake and corrected it, or it is just possible that the data can change multiple times a day. Coincidentally, the ETL also runs again on the same day, picking up this new change. The ValidFrom date will again be the 5th of August:
CustomerName | Location | ValidFrom | ValidTo |
---|---|---|---|
CustomerA | Antwerp | 2018-01-01 | 2018-08-04 |
CustomerA | New York | 2018-08-05 | 2018-08-04 |
CustomerA | Delhi | 2018-08-05 | NULL |
And this is where the trouble starts. For the second version, we have a ValidTo value that is before the value of the ValidFrom field in time. If there’s a unique index on the CustomerName and ValidFrom columns, an error will be thrown because the unique constraint would be violated, since the 2nd and 3rd version have the same values for those fields.
There are a couple of reasons why the ETL would run multiple times a day:
- There was an error later in the ETL batch run and the whole ETL job has been restarted, which is why the dimension is loaded again.
- The ETL job is started on demand to pick up new data. For example, the finance department has entered a new forecast in the system and they want to see the numbers in the reports. Instead of waiting a whole day till the ETL has run again, they asked the IT department to start a new run.
- The ETL runs multiple times a day, as a business requirement.
Whatever the reason, running an ETL job multiple times a day can have unintended consequences, especially when loading slowly changing dimensions. You need to be prepared to handle those scenarios. There are two possible solutions:
- Also use a time portion in the ValidFrom and the ValidTo fields. However, this makes querying and lookups more complex as it’s harder to find the correct row for a certain date.
- Instead of inserting a new row, update the current record. The downside is the dimension table will only store the most recent values of the day and it “forgets” all intermediary versions. In this tip, we’ll use this option.
Let’s modify the data flow we created in the first three parts of this tip. The trick is checking if a type 2 change is an intra-day change before we sent the row to the OLE DB Destination to be inserted into the dimension.
First, we need the ValidFrom value from the current record, which we retrieve in the Lookup component. Its query becomes:
SELECT [SK_Customer] ,[CustomerName] ,[Hash_SCD1_OLD] = [Hash_SCD1] ,[Hash_SCD2_OLD] = [Hash_SCD2] ,ValidFrom_OLD = [ValidFrom] FROM [dbo].[DimCustomer] WHERE [ValidTo] IS NULL;
Now we use a conditional split to check if the new ValidFrom value is equal to the old one:
If there is such a change, we flag the record using a new field which is added by a derived column transformation:
The records having an intraday change are then merged with the type 1 changes:
After the UNION ALL, we set the IntraDay flag to 0 if it is NULL (these are the records with only a type 1 change).
Now we have almost finished modifying the data flow, but first we need to adapt the table holding the type 1 updates. It needs an extra column to hold the IntraDay flag, but it now also needs to store the Type 2 attributes. We’re also going to store the surrogate key of the current version to optimize performance of the UPDATE statement. It’s DROP and CREATE TABLE statement becomes:
DROP TABLE IF EXISTS [dbo].[UPD_DimCustomer_SCD1]; CREATE TABLE [dbo].[UPD_DimCustomer_SCD1]( [CustomerName] [VARCHAR](50) NOT NULL, [Email] [varchar](50) NULL, [Location] [varchar](50) NOT NULL, IntraDay [INT] NOT NULL, SK_Customer [INT] NOT NULL );
Don’t forget to update the mapping for the OLE DB Destination component!
The data flow now has the following layout:
The last step is to add an extra UPDATE statement to the Execute SQL Task in the control flow:
UPDATE c SET [Location] = u.[Location] FROM dbo.[DimCustomer] c JOIN dbo.[UPD_DimCustomer_SCD1] u ON u.[SK_Customer] = c.[SK_Customer] WHERE [u].[IntraDay] = 1 AND c.[ValidTo] IS NULL;
Testing the Package
Let’s test our solution. First, we change the location in the staging table to New York and we run the package.
Then we change the location to Delhi and then we run the package again.
As you can see, the location New York was replaced with Delhi.
Conclusion
In this final part of the tip on slowly changing dimensions, we discussed how you can handle intra-day changes inside the SSIS package. The result of this change is that only the last value of the day is kept, while all previous values are discarded.
Next Steps
- Read part 1, part 2 and part 3 of this tip to learn more on how to implement a pattern for loading a slowly changing dimension of type 2 and how to optimize performance using hashes.
- You can find the final SSIS package and an *.sql script to create the sample tables here.
- You can find more SSIS development tips in this overview.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips