![]() |
|

|
|
By: Dallas Snider | Read Comments (15) | Related Tips: More > T-SQL |
In my 18-plus years of T-SQL experience, the MERGE statement has got to be one of the most difficult statements I have had to implement. It is powerful and multifunctional, yet it can be hard to master. Oftentimes I would find examples of the MERGE statement that just didn't do what I needed it to do, that is to process a Type 2 slowly changing dimension. Check out this tip to learn more.
The solution presented in this tip will walk through the steps of how to use the MERGE statement nested inside an INSERT statement to handle both new records and changed records in a Type 2 Slowly Changing Dimension table within a data warehouse. This solution will walk through the processing over three days. Each day, new and changed records are processed. Records are first inserted into a staging table and then the MERGE statement will insert new records into the slowly changing dimension table.
I suggest that you copy and paste each of the code samples into its own editor window in SQL Server Management Studio for ease of execution.
In Code Sample 1 below, we will create our staging table and our slowly changing dimension table.
--=============================================================================
-- Code Sample 1
--=============================================================================
-- Create the staging table for the type two slowly changing dimension table data
create table dbo.tblStaging
(
SourceSystemID int not null,
Attribute1 varchar(128) not null
constraint DF_tblStaging_Attribute1 default 'N/A',
Attribute2 varchar(128) not null
constraint DF_tblStaging_Attribute2 default 'N/A',
Attribute3 int not null
constraint DF_tblStaging_Attribute3 default -1,
DimensionCheckSum int not null
constraint DF_tblStaging_DimensionCheckSum default -1,
LastUpdated datetime not null
constraint DF_tblStaging_LastUpdated default getdate(),
UpdatedBy varchar(50) not null
constraint DF_tblStaging_UpdatedBy default suser_sname()
)
-- Create the type two slowly changing dimension table
create table dbo.tblDimSCDType2Example
(
SurrogateKey int not null identity(1,1) PRIMARY KEY,
SourceSystemID int not null,
Attribute1 varchar(128) not null
constraint DF_tblDimSCDType2Example_Attribute1 default 'N/A',
Attribute2 varchar(128) not null
constraint DF_tblDimSCDType2Example_Attribute2 default 'N/A',
Attribute3 int not null
constraint DF_tblDimSCDType2Example_Attribute3 default -1,
DimensionCheckSum int not null
constraint DF_tblDimSCDType2Example_DimensionCheckSum default -1,
EffectiveDate date not null
constraint DF_tblDimSCDType2Example_EffectiveDate default getdate(),
EndDate date not null
constraint DF_tblDimSCDType2Example_EndDate default '12/31/9999',
CurrentRecord char(1) not null
constraint DF_tblDimSCDType2Example_CurrentRecord default 'Y',
LastUpdated datetime not null
constraint DF_tblDimSCDType2Example_LastUpdated default getdate(),
UpdatedBy varchar(50) not null
constraint DF_tblDimSCDType2Example_UpdatedBy default suser_sname()
)
In Code Sample 2, we insert two "new" records from the source system into the staging table. Then, we use the BINARY_CHECKSUM function to create a checksum value for the records in the staging table. This checksum value will be utilized later during the MERGE statement to detect changed records.
--============================================================================= --Code Sample 2 --============================================================================= -- Start of Day 1 - truncate the staging table truncate table dbo.tblStaging -- insert a new record into the staging table into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (1 , 'Mary Brown', 'Single' , 143) -- insert a new record into the staging table into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (2 , 'Ricky Green', 'Married' , 189) -- Update the checksum value in the staging table update dbo.tblStaging set DimensionCheckSum= BINARY_CHECKSUM(SourceSystemID, Attribute1, Attribute2, Attribute3)
Code Sample 3 shows the MERGE statement (which happens to be embedded within an INSERT statement) that will be used to process the records between the two tables. The comments in the T-SQL code will help to explain what is happening throughout the statement. We will reuse Code Sample 3 throughout this tip.
There are two things that need to be noticed.
--=============================================================================
-- Code Sample 3
--=============================================================================
-- begin of insert using merge
insert into dbo.tblDimSCDType2Example
( --Table and columns in which to insert the data
SourceSystemID,
Attribute1,
Attribute2,
Attribute3,
DimensionCheckSum,
EffectiveDate,
EndDate
)
-- Select the rows/columns to insert that are output from this merge statement
-- In this example, the rows to be inserted are the rows that have changed (UPDATE).
select
SourceSystemID,
Attribute1,
Attribute2,
Attribute3,
DimensionCheckSum,
EffectiveDate,
EndDate
from
(
-- This is the beginning of the merge statement.
-- The target must be defined, in this example it is our slowly changing
-- dimension table
MERGE into dbo.tblDimSCDType2Example AS target
-- The source must be defined with the USING clause
USING
(
-- The source is made up of the attribute columns from the staging table.
SELECT
SourceSystemID,
Attribute1,
Attribute2,
Attribute3,
DimensionCheckSum
from dbo.tblStaging
) AS source
(
SourceSystemID,
Attribute1,
Attribute2,
Attribute3,
DimensionCheckSum
) ON --We are matching on the SourceSystemID in the target table and the source table.
(
target.SourceSystemID = source.SourceSystemID
)
-- If the ID's match but the CheckSums are different, then the record has changed;
-- therefore, update the existing record in the target, end dating the record
-- and set the CurrentRecord flag to N
WHEN MATCHED and target.DimensionCheckSum <> source.DimensionCheckSum
and target.CurrentRecord='Y'
THEN
UPDATE SET
EndDate=getdate()-1,
CurrentRecord='N',
LastUpdated=getdate(),
UpdatedBy=suser_sname()
-- If the ID's do not match, then the record is new;
-- therefore, insert the new record into the target using the values from the source.
WHEN NOT MATCHED THEN
INSERT
(
SourceSystemID,
Attribute1,
Attribute2,
Attribute3,
DimensionCheckSum
)
VALUES
(
source.SourceSystemID,
source.Attribute1,
source.Attribute2,
source.Attribute3,
source.DimensionCheckSum
)
OUTPUT $action,
source.SourceSystemID,
source.Attribute1,
source.Attribute2,
source.Attribute3,
source.DimensionCheckSum,
getdate(),
'12/31/9999'
) -- the end of the merge statement
--The changes output below are the records that have changed and will need
--to be inserted into the slowly changing dimension.
as changes
(
action,
SourceSystemID,
Attribute1,
Attribute2,
Attribute3,
DimensionCheckSum,
EffectiveDate,
EndDate
)
where action='UPDATE';
So let's review the steps to get this example to work:
--============================================================================= --Code Sample 4 --============================================================================= select * from dbo.tblStaging order by SourceSystemID select * from dbo.tblDimSCDType2Example order by SourceSystemID, SurrogateKey
In the figure below, we see the results from Code Sample 4 where our two new records in the staging table have been inserted into the slowly changing dimension table. Please notice the SurrogateKey, EffectiveDate, EndDate, and CurrentRecord columns as they will change as we move forward with this example.

In Code Sample 5 shown below, two new records (SourceSystemID 3 and SourceSystemID 4) and one updated record (SourceSystemID 2) are inserted into the staging table and checksums are calculated.
--============================================================================= --Code Sample 5 --============================================================================= --Start of Day 2 - truncate the staging table truncate table dbo.tblStaging -- insert a new record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (3 , 'Jane Doe', 'Single' , 123) -- insert a new record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (4 , 'John Doe', 'Married' , 246) -- insert a changed record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (2 , 'Ricky L. Green', 'Married' , 189) -- Update the checksum value in the staging table update dbo.tblStaging set DimensionCheckSum= BINARY_CHECKSUM(SourceSystemID, Attribute1, Attribute2, Attribute3)
After executing Code Sample 5 to insert records into the staging table, execute the MERGE/INSERT statement in Code Sample 3 and the SELECT statements in Code Sample 4. The results of the SELECT statements are shown in the figure below. Notice how there is a second record for SourceSystemID 2 (SurrogateKey 5) that reflects the change in the Attribute2 column. Also, notice how the DimensionCheckSum column is different between SurrogateKeys 2 and 5. Furthermore, the EndDate for SurrogateKey 2 has changed from 12/31/9999 to 01/27/2013 and the CurrentRecord is set to 'N'.

In Code Sample 6 shown below, one new record (SourceSystemID 5) and three updated records (SourceSystemIDs 3, 4 and 5) are inserted into the staging table and checksums are calculated.
--============================================================================= -- Code Sample 6 --============================================================================= --Start of Day 3 - truncate the staging table truncate table dbo.tblStaging -- insert a changed record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (3 , 'Jane Doe-Jones', 'Married' , 123) -- insert a changed record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (4 , 'John Doe', 'Married' , 220) -- insert a new record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (5 , 'Bill Smith', 'Married' , 198) -- insert a changed record into the staging table insert into dbo.tblStaging (SourceSystemID, Attribute1, Attribute2, Attribute3) values (1 , 'Mary Brown', 'Single' , 136) -- Update the checksum value in the staging table update dbo.tblStaging set DimensionCheckSum= BINARY_CHECKSUM(SourceSystemID, Attribute1, Attribute2, Attribute3)
After executing Code Sample 6 to insert records into the staging table, again execute the MERGE/INSERT statement in Code Sample 3 and the SELECT statements in Code Sample 4. The results of the latest SELECT statements are shown in the figure below. Notice how there are new records for SourceSystemID 1, 3 and 4 where the CurrentRecord column is set to 'Y' and the EndDate is 12/31/9999.

| Monday, February 18, 2013 - 10:39:35 AM - John | Read The Tip |
|
Disclaimer: I didn't tke the time to read your whole article, so please forgive me if you address this. My concern is about the "BINARY_CHECKSUM()" function. Will this option always tell you when data has changed? Consider a "pathological" worst case where "Attribute1" is set to "... San Diego, CA 91311" for one data row. Let's say the data row gets changed to "... San Diego, CA 91302". Here the ASCII value of one byte (1) was decremented by one to (0) and then the ASCII value of the next byte (1) was incremented by one to (2) -- with a net change of zero for the computed checksum. Will this case be correctly flagged as data changed? |
|
| Tuesday, February 19, 2013 - 7:12:18 AM - the sqlist | Read The Tip |
|
I don't get it. Isn't MERGE supposed to do the insert already, then why the INSERT ststement above it? The purpose of this statement is to do the I/U/D operations in one single shot based on the rules defined and by joining the target table rows with the source rows. The only output from MERGE would be the results of the changes from the OUTPUT clause. Is that needed in the target table too? |
|
| Tuesday, February 19, 2013 - 7:19:39 AM - the sqlist | Read The Tip |
|
From my understanding what that INSERT from MERGE statement will do is update the target table and then insert the rows that were updated and returned by the OUTPUT clause again into it. Is that really the intention? |
|
| Tuesday, February 19, 2013 - 10:31:35 PM - John | Read The Tip |
|
I kind of agree with the first post made by "the sqlist". Normally, when I've used the MERGE statement in the past, it is pretty much a self-contained statement that does the INSERT, UPDATE, and/or DELETE. Not fully sure why one needs to use a MERGE inside an INSERT statement. You can create a "derived table" expression within the USING-clause as necessary. And it's easy if you just code your UPDATE section based upon the PK in the target table. Here's a sample of a simple call:
MERGE INTO [dbo].[dimCUSTOMER_EMAIL] AS TGT |
|
| Wednesday, February 20, 2013 - 2:39:30 PM - Ty | Read The Tip |
|
The reason why he has the merge inside the insert is because the update in the merge isn't the data update. It's just updating the existing record and invalidating it as the current record. In other words, keeping a history of record updates. You'll note it flags the record as 'N' and updates the enddate. This effectively archives this record to determine when it last changed and what values it had at that time. It's the outer insert itself which 'updates' the table with the current data record. |
|
| Wednesday, February 20, 2013 - 2:41:34 PM - Ty | Read The Tip |
|
http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_II |
|
| Wednesday, February 20, 2013 - 4:38:17 PM - the sqlist | Read The Tip |
|
Thaky you both for clarifications. I now know what a Type 2 Slowly Changing Dimension is. I actually used it a lot in my experience, also the The Type 6 pure one, but I never was not familiar with these terms. I have to admit that using the output result from MERGE statement like that is very clever and it also must be well performing compared to other, more clasic, methods. |
|
| Friday, March 01, 2013 - 7:29:07 AM - Terence Hill | Read The Tip |
| The comments nicely prove the worth of the article itself. I'd still say that the code lacks handling for the WHEN NOT MATCHED BY SOURCE case. We'd want to end up with no records having Current='Y' for the source ID, the last EndDate showing the point of deletion. Would "where action='UPDATE' and SourceSystemID IS NOT NULL" be sufficient? | |
| Friday, March 01, 2013 - 2:14:28 PM - Dallas Snider | Read The Tip |
|
I understand the well-documented concerns about CHECKSUM() or BINARY_CHECKSUM() failing to detect changes in records. However, I used the BINARY_CHECKSUM() as shown in the tip on a 4 million records and growing table with 30+ attributes passed to the function for almost four years in a production environment and it never failed to detect a change. |
|
| Friday, March 01, 2013 - 2:35:05 PM - Dallas Snider | Read The Tip |
|
With a Type 2 Slowly Changing Dimension (SCD), the idea is to track the changes to (or record the history of) an entity over time. To accomplish this tracking, rows should never be deleted and the attributes are never updated. Instead, changes in the data are applied through the end-dating of the existing current record and by flagging the record as no longer being current; while a new record is inserted with the changes in the attributes. The INSERT/MERGE code above accomplishes the goals of maintaining a Type 2 SCD with a “minimal” amount of code to execute.
|
|
| Friday, March 08, 2013 - 6:37:39 AM - John Martin | Read The Tip |
|
If you are worried about the checksums you can use the following instead of the checksum comparison in the "WHEN MATCHED" clause: WHEN MATCHED AND NOT EXISTS ( This handles nulls also. Take a look at http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx
|
|
| Friday, March 08, 2013 - 1:00:14 PM - Lamprey | Read The Tip |
|
I have no issue with the article for what it is demonstrating. What I do have issue with is passing this off as something to do in a data warehouse. A data warehouse is no place for a dimensional model. The dimensional model fails at so many levels when trying to do a temporal data warehouse. If you are interested in implementing an actual temporal data warehouse, then I suggest you read Temporal Data & the Relational Model by CJ Date (http://www.amazon.com/Temporal-Relational-Kaufmann-Management-Systems/dp/1558608559). Assuming you have a solid relation background it shouldn't be too hard to extend which ever RDBMS you work with to handle temporal data. |
|
| Friday, March 29, 2013 - 4:05:06 PM - prince10 | Read The Tip |
|
Nice post I love the merge statement however, i am confused on some part, I am using a SP to do my SCD, where i call the SP but i do understand that merge statement is good. My question now is that, when i have additional columns to my dimensions namely:, ValidFrom ValidTo IsActive With this three coloumns above , how can i update the ValidFrom when it expires or new records comes in?Because most post i have seen dont have endDate, StartDate etc.
Any help or explanation on how to update those three columns will be usefull.
|
|
| Friday, March 29, 2013 - 10:36:11 PM - Dallas Snider | Read The Tip |
|
Prince10, Based on what you are stating, I believe your ValidFrom column is synonymous to my EffectiveDate column, your ValidTo column is synonymous to my EndDate column, and your IsActive column is synonymous to my CurrentRecord column. Please refer to how I handle these columns above. I hope this helps. Dallas |
|
| Saturday, March 30, 2013 - 12:00:37 PM - prince10 | Read The Tip |
|
Hello Dallas, Thanks for your respond. i have the below code that am using, tell me what the difference is between the merge and the insert statement below
DECLARE @CurrentTime DATETIME --STEP 1 Insert rows for source rows with changed attributes Insert into dbo.DimDepot SELECT S.DepotID, FROM dbo.[Staging.FinalDimDepot] S --- STEP 2. Expire active rows if attributes are changed. UPDATE dbo.DimDepot -- Step 3. Update IsActive = 2 rows to 1
INSERT INTO DimDepot SELECT DepotID, FROM dbo.[Staging.FinalDimDepot]
|
|
|
privacy | disclaimer | copyright | advertise | about authors | contribute | feedback | giveaways | user groups Some names and products listed are the registered trademarks of their respective owners. Edgewood Solutions LLC | MSSharePointTips.com | MSSQLTips.com |