Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Using hash values in SSIS to determine when to insert or update rows


By:   |   Read Comments (34)   |   Related Tips: More > Integration Services Development

Attend these FREE MSSQLTips webcasts >> click to register


Problem

When you want to load data into your data warehouse, for example into one of your dimensions, you need to determine if an incoming row is either an update - a historical change - or an insert, which is a new occurrence of a record. An update row is a source record for which the business key - or the natural key as some folks prefer - is already present in the destination table. To process these updates as efficient as possible, you need to determine if the update is an actual update, meaning that at least one of the columns has changed value. If none of the columns has actually changed, you can skip the update of that record and save some precious resources.

A common method to detect if a row has changed is to compare hashes: if the hash of the incoming record is different from the hash found in the destination table, one or more columns have changed. Christian Allaire describes the concept using the T-SQL Hashbytes function in the tip Using Hashbytes to track and store historical changes for SQL Server data. Integration Services itself doesn't have a hash implementation, so in this tip I'll show how you can implement a hash detection mechanism using a script component, as an alternative to the pure T-SQL solution.

Solution

The concept of the incremental load pattern - where you detect the inserts and the updates - is described clearly in this article by Andy Leonard: SSIS Design Pattern - Incremental Loads. We will closely follow those recommendations in this tip, but I'll make some performance improvements and add the hashing of course.

Demonstrate Data Load Options in SSIS

Let's keep it simple and use the AdventureWorksDW2012 database as the source for our data, which you can download here. If you don't have SQL Server 2012, you can use any earlier version of the AdventureWorksDW database as well. The following query will extract the data:

SELECT
  [CustomerAlternateKey] -- this is our business key (BK)
 ,[Title]
 ,[FirstName]
 ,[MiddleName]
 ,[LastName]
 ,[BirthDate]
 ,[MaritalStatus]
 ,[Suffix]
 ,[Gender]
 ,[EmailAddress]
 ,[YearlyIncome]
 ,[TotalChildren]
 ,[NumberChildrenAtHome]
 ,[EnglishEducation]
 ,[EnglishOccupation]
 ,[HouseOwnerFlag]
 ,[NumberCarsOwned]
FROM [dbo].[DimCustomer];

All of the data is selected from the source. The SSIS package will detect which rows are inserts and which rows are updates.

The destination table can be created with the following script:

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyDestination]') 
               AND type in (N'U'))
BEGIN
 CREATE TABLE [dbo].[MyDestination](
  [SK_Customers] [int] IDENTITY(1,1) NOT NULL,
  [CustomerAlternateKey] [nvarchar](15) NOT NULL,
  [Title] [nvarchar](8) NULL,
  [FirstName] [nvarchar](50) NULL,
  [MiddleName] [nvarchar](50) NULL,
  [LastName] [nvarchar](50) NULL,
  [BirthDate] [date] NULL,
  [MaritalStatus] [nchar](1) NULL,
  [Suffix] [nvarchar](10) NULL,
  [Gender] [nvarchar](1) NULL,
  [EmailAddress] [nvarchar](50) NULL,
  [YearlyIncome] [money] NULL,
  [TotalChildren] [tinyint] NULL,
  [NumberChildrenAtHome] [tinyint] NULL,
  [EnglishEducation] [nvarchar](40) NULL,
  [EnglishOccupation] [nvarchar](100) NULL,
  [HouseOwnerFlag] [nchar](1) NULL,
  [NumberCarsOwned] [tinyint] NULL,
  [DWH_Hash] [char](66) NULL,
  CONSTRAINT [PK_MyDestination] PRIMARY KEY CLUSTERED
   ([SK_Customers] ASC)
 );
END

The incremental load package takes the following form:

Control Flow Incremental Load

The original incremental load pattern uses an OLE DB Command in the data flow to perform the updates. I dislike this component because it issues a statement for every single row that passes through the component, which is horribly slow for large data sets and it can also bloat the transaction log. Therefore I create a staging table that will hold the update rows so that a set-based UPDATE statement can be issued in the third task. This improves performance immensely. The update table is created with this statement:

IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[UPD_MyDestination]') 
           AND type in (N'U'))
   DROP TABLE [dbo].[UPD_MyDestination];
GO
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[UPD_MyDestination]') 
               AND type in (N'U'))
BEGIN
 CREATE TABLE [dbo].[UPD_MyDestination](
  [SK_Customers] [int] NOT NULL,
  [Title] [nvarchar](8) NULL,
  [FirstName] [nvarchar](50) NULL,
  [MiddleName] [nvarchar](50) NULL,
  [LastName] [nvarchar](50) NULL,
  [BirthDate] [date] NULL,
  [MaritalStatus] [nchar](1) NULL,
  [Suffix] [nvarchar](10) NULL,
  [Gender] [nvarchar](1) NULL,
  [EmailAddress] [nvarchar](50) NULL,
  [YearlyIncome] [money] NULL,
  [TotalChildren] [tinyint] NULL,
  [NumberChildrenAtHome] [tinyint] NULL,
  [EnglishEducation] [nvarchar](40) NULL,
  [EnglishOccupation] [nvarchar](100) NULL,
  [HouseOwnerFlag] [nchar](1) NULL,
  [NumberCarsOwned] [tinyint] NULL
 );
END
GO

Next, the data flow will load the data from the source and uses a Lookup component to find out if a row is an insert or an update. The Lookup uses the following query to fetch the reference data set from the destination table:

SELECT SK_Customers,CustomerAlternateKey,DWH_Hash
FROM MyDestination;

Hint: always write a SQL query to select the lookup columns from the reference data set. Never use the drop down; it selects all columns and wastes memory. If you select only the columns you actually need, chances are good that you can fit everything in memory with the full cache option, even for very large reference sets.

The Lookup matches the source rows with the destination rows using the business key. If a match is found - an update - the surrogate key and the hash are retrieved. Make sure to configure the Lookup to redirect non-matching rows to the no match output.

Data Flow Incremental Load

The third task updates the destination table using the rows stored in the staging table.

UPDATE dest
SET  [Title]    = upd.[Title]    
 ,[FirstName]   = upd.[FirstName]   
 ,[MiddleName]   = upd.[MiddleName]   
 ,[LastName]    = upd.[LastName]    
 ,[BirthDate]   = upd.[BirthDate]   
 ,[MaritalStatus]  = upd.[MaritalStatus]  
 ,[Suffix]    = upd.[Suffix]    
 ,[Gender]    = upd.[Gender]    
 ,[EmailAddress]   = upd.[EmailAddress]   
 ,[YearlyIncome]   = upd.[YearlyIncome]   
 ,[TotalChildren]  = upd.[TotalChildren]  
 ,[NumberChildrenAtHome] = upd.[NumberChildrenAtHome] 
 ,[EnglishEducation]  = upd.[EnglishEducation]  
 ,[EnglishOccupation] = upd.[EnglishOccupation] 
 ,[HouseOwnerFlag]  = upd.[HouseOwnerFlag]  
 ,[NumberCarsOwned]  = upd.[NumberCarsOwned]
 ,[DWH_Hash]  = upd.[Hash]
FROM [dbo].[MyDestination]  dest
JOIN [dbo].[UPD_MyDestination] upd ON dest.SK_Customers = upd.SK_Customers;

The final task will drop the staging table, which is optional of course. If you drop it though, don't forget to set the package to DelayValidation=TRUE.

When this package is run for the first time, all of the source rows are new and are directly inserted into the destination.

Initial Load Without Hash

When the package is run again, all rows matched against the destination table and are sent to the staging table. This means that all of the rows were updated, while this wasn't necessary at all.

Reload Without Hash

Now let's add a hash to detect if there were actual changes or not.

Adding the Hash Value to Detect Data Changes in SSIS

From the SSIS Toolbox, add a script component to the data flow canvas. A pop-up will ask you which type of script you want: a source, a transformation or a destination. Choose the transformation. Open up the editor. In this tip, I will use C# code. If you're more familiar with VB.NET, the code can easily be adapted. Go to the Input Columns pane and add every column, except the business key. It won't change since it is the business key after all, so it is not necessary to add it to the hash. All of the input columns can be left as ReadOnly.

Configure Input Columns

Now go to the Inputs and Outputs pane. There we will add the Hash column to the output. Give it the string data type with a length of 66. I will explain later why we are using this length.

Configure Output Column

Go back to the Script pane and open up the script editor. Since we are going to use a hash function, which is a cryptographic function, a reference to System.Security.Cryptography needs to be added to the solution.

Add Cryptography Namespace

The script takes the following steps:

  • Concatenate all input columns into one long string. If a column contains a NULL value, convert it to the empty string.
  • Compute the hash over this concatenated string. This will return an array of bytes.
  • This array is converted back to a string using a for loop and is finally added to the output. The characters '0x' are prefixed to the result, to resemble the output of the HASHBYTE function.
HASHBYTES output in SSMS

The final code looks like this:

private SHA256 sha = new SHA256CryptoServiceProvider(); // used to create the SHA256 hash
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        // concatenate all strings. NULL values are replaced with the empty string.
        string hashSource =
              (Row.BirthDate_IsNull ? "" : Row.BirthDate.ToString())
            + (Row.EmailAddress_IsNull ? "" : Row.EmailAddress.ToString())
            + (Row.EnglishEducation_IsNull ? "" : Row.EnglishEducation.ToString())
            + (Row.EnglishOccupation_IsNull ? "" : Row.EnglishOccupation.ToString())
            + (Row.FirstName_IsNull ? "" : Row.FirstName.ToString())
            + (Row.Gender_IsNull ? "" : Row.Gender.ToString())
            + (Row.HouseOwnerFlag_IsNull ? "" : Row.HouseOwnerFlag.ToString())
            + (Row.LastName_IsNull ? "" : Row.LastName.ToString())
            + (Row.MaritalStatus_IsNull ? "" : Row.MaritalStatus.ToString())
            + (Row.MiddleName_IsNull ? "" : Row.MiddleName.ToString())
            + (Row.NumberCarsOwned_IsNull ? "" : Row.NumberCarsOwned.ToString())
            + (Row.NumberChildrenAtHome_IsNull ? "" : Row.NumberChildrenAtHome.ToString())
            + (Row.Suffix_IsNull ? "" : Row.Suffix.ToString())
            + (Row.Title_IsNull ? "" : Row.Title.ToString())
            + (Row.TotalChildren_IsNull ? "" : Row.TotalChildren.ToString())
            + (Row.YearlyIncome_IsNull ? "" : Row.TotalChildren.ToString());
        byte[] hashBytes = sha.ComputeHash(System.Text.UnicodeEncoding.Unicode.GetBytes(hashSource));
  // convert the byte stream to a string
  System.Text.StringBuilder sb = new System.Text.StringBuilder();
  for (int i = 0; i < hashBytes.Length; i++)
  {
   sb.Append(hashBytes[i].ToString("X2"));
  }
  Row.Hash = "0x" + sb.ToString(); // add the result to the output
 }

The script uses the SHA256 hash function to compute the hash. This hash function outputs a 256 bit hash and is much safer than the regular used MD5 hash and is less prone to hash collisions. The SHA256CryptoServiceProvider class is used to implement the hash, but you can also use the SHA256Managed class. There is not much difference between the two, expect the implementation behind the scenes. The CryptoServiceProvider class might be preferred above the managed class for regulatory reasons.

The 256 bit length of the hash ensures that the chance on hash collisions is minimal. 256 bit map to 32 Unicode characters in SQL Server, or 64 non-Unicode characters. If you add the two extra characters from '0x', you get the 66 character length used in the destination table and in the script component.

Note: if you use SQL Server 2012 or a more recent edition as the source, you can also use the new CONCAT function to avoid the lengthy expression in the script component. However, the script component can easily be ported to earlier versions of SQL Server.

Add a Conditional Split transformation between the Lookup and the OLE DB Destination on the math output path. The expression is simple but effective:

Conditional Split Expression

The Conditional Split is one of the reasons to use a script component to calculate the hash. The T-SQL HASHBYTES() function returns a binary data type, which maps to the SSIS data type DT_BYTES. Unfortunately, this data type cannot be used in a comparison expression like the one above. Furthermore, if another source is used, you might not even use a function like HASHBYTES.

The final data flow has the following layout:

The Final Data Flow

If the destination table is truncated and the initial load is executed again, you can find the hash stored alongside the data in the destination table.

Hashes in the destination table

Now, here is the moment that sums up this entire tip, if the package is run again, a single row does not make it to the destination table.

Updates do not make it through the conditional split

Let's change some rows in the source with the following script:

UPDATE [dbo].[DimCustomer]
SET Title = 'Test123'
WHERE CustomerKey % 1000 = 0;
-- this updates 19 rows

If the package is run again, only the changed rows are written to the staging table and subsequently updated. Mission accomplished.

Only changed rows make it through the conditional split

Conclusion

This tip demonstrated how you can efficiently manage the number of updates that come through your SSIS package with a hash. Since the hash is calculated in a script component, the solution is independent of any possible source. In the next tip, I'll demonstrate how this solution can be generated with the use of BIML!

Next Steps


Last Update:


signup button

next tip button



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a BI professional, specializing in the Microsoft BI stack with a particular love for SSIS.

View all my tips





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Thursday, October 19, 2017 - 6:18:28 PM - Ken W Back To Top

Nice writeup Koen. I have a process that uses hashbytes to do this same sort of work in a t-sql solution.  We still end up moving data around in SSIS anyway so I might attempt a version more like what you have laid out here. We have to evaluate which columns we want to spark changes because it is a type 2 dimension update.

 


Friday, August 04, 2017 - 7:18:04 PM - Jibs Ajia Back To Top

 

Thanks KoenWorked like a charm, pretty straightforward...


Tuesday, July 12, 2016 - 8:44:38 AM - Koen Verbeeck Back To Top

Hi Mac,

you'd have to store the hashes somewhere, so a table in SQL Server could be an option.
You'd have to do a lookup to retrieve the hash to compare it with the one coming from the script task.

Another option might be to calculate the hashes directly in the SQL statements.
I think Oracle has an equivalent function to hashbytes(). Downside is you have to calculate the hashes for both the incoming rows and all of the stored rows.


Monday, July 11, 2016 - 3:26:59 PM - Mac Hanna Back To Top

What if we want to create that package using comparison between two tables in Oracle Databases and not SQl Server Databases. And if that will be for Oracle databases, can we create the third table in SQL Server Database? Decause the access is limited in both Oracle DBs. 


Thursday, November 19, 2015 - 3:41:25 PM - Koen Verbeeck Back To Top

There have been a number of comments of people with a good eye who noticed the hash value in the table wasn't being updated.
The update statement at the start of the tip has been updated so that the hash is now updated as well. 


Tuesday, November 17, 2015 - 9:19:01 AM - Koen Verbeeck Back To Top

Hi Ha Le Thien,

which columns exactly give NULL errors?

Also, don't forget to update the hash value in the UPDATE statement. This is an error in the article and has been pointed out in the comments.


Sunday, November 15, 2015 - 4:39:06 AM - Ha Le Thien Back To Top

Dear Koen ,

I did exactly your instruction. However , there is some errors happen with staging table. The second time in lookup action to check match outputs to load it into staging. It is not work cause some errors as it is not allow NULL value to insert in staging table . Could u please check it again ?

 


Tuesday, August 11, 2015 - 5:32:22 AM - Koen Verbeeck Back To Top

@Bob Frasca: (if you are still following ths thread, almost one year later): I finally have been able to test your claim that you can still use the T-SQL Hashbytes function, but only if you convert the hashes to strings in the conditional split.

It does work, so thanks for that, but you have to store the hashes as binary values (and not as strings) in the destination table in order for it to work.

The solution presented in this tip is still valid for any source that is not SQL Server.
Say for example you read data from a large flat file and you don't want to use a staging table, then you can still use the script component.

Cheers,
Koen 


Tuesday, August 04, 2015 - 2:16:49 AM - Koen Verbeeck Back To Top

Hi Rodney,

the conditional split is case sensitive and white space sensitive.
Maybe you can put TRIM around the expressions to make sure there are no extra spaces. 


Monday, August 03, 2015 - 4:13:53 PM - Rodney Rau Back To Top

Koen,

It thinks they are all different, even though the bulk of them (95%) sure look like they are the same.

 


Monday, August 03, 2015 - 3:49:16 AM - Koen Verbeeck Back To Top

Rodney,

does the conditional split think all the hashes are the same, or that they are all different?


Friday, July 31, 2015 - 10:25:56 AM - Rodney Rau Back To Top

Hi Koen,

I receive no errors from the Conditional split, nor elsewhere.  I get the same results for every record in the batch whether the hashes differ or appear to be the same.  In all cases my cspl condition is met and no rows are directed to the default output. 

Thanks very much for your quick response!


Friday, July 31, 2015 - 2:56:00 AM - Koen Verbeeck Back To Top

Hi Rodney,

I don't have VS 2008 to test it myself.
Does the conditional split give any errors?
Or does it just doesn't compare the hashes correctly? 


Thursday, July 30, 2015 - 5:11:46 PM - Rodney Rau Back To Top

Thanks Koen,

Your tip worked great in my ssis/vs2013 packages.  In my ssis/vs2008 package I could not get the Conditional Split to recognize what appears to be like hashstrings.  Here's my condition expression from my 2008 cspl - (DT_STR,66,1252) [lkp_DWH_Hash] != (DT_STR,66,1252) [Hash].  I did have to use the SHA1 hash function rather that the SHA256 to get my 2008 c# to compile.  All the rest was same.  When I check the data viewers the Hash and lkp_DWH_Hash look identical.  Much appreciate any thoughts you might have on my issue.

Thanks again.


Wednesday, March 25, 2015 - 10:00:43 AM - Tom Oakes Back To Top

This is an outstanding article and helped us a lot.  The OLE DB Command approach had me really bummed out.  

 

Thanks.

 


Saturday, March 14, 2015 - 12:39:52 AM - sandeep Pandey Back To Top

Ohh, Yes sorry I didn't see. Thanks Koen once again great contribution.


Friday, March 13, 2015 - 9:55:01 AM - Koen Verbeeck Back To Top

@Sundeep: yes, the hash needs to be updated as well. Yan Ouyang already found this issue a few months back (see earlier comments).


Friday, March 13, 2015 - 8:03:18 AM - sandeep Pandey Back To Top

 Hi Koen,

Very nice and great to start with. I would like to add something later on you can cross check. This will update every time when package runs.

Because this update task need to update DWH_Hash at every update statement to get new Hash of change.

Please cross check from your side also.


Tuesday, February 24, 2015 - 7:47:11 AM - Koen Verbeeck Back To Top

@Andrew:

yes, I mention at the start of the article that this tip is an alternative to the pure T-SQL solution.
I also mention that since SQL Server 2012 you can use the CONCAT function in SQL Server which makes things even easier in TSQL.

However, the reason I wrote the script component at the time was because I couldn't get the conditional split to work with the binary output of HASHBYTES. Therefore I used the script instead, because it outputs a string. Now, Bob mentioned I should try to do cast in the conditional split, but I still haven't tested this. If the cast works, TSQL is definately easier than the script.

If you don't have a relational source though (such as a flat file), the script is your only option. Or you have to dump the data into a staging table and then do your comparison.

Cheers,
Koen 


Monday, February 23, 2015 - 5:55:59 PM - Andrew Jones Back To Top

Couldn't you generate the hash in the original select statement from the source without the need for the script task?


Thursday, December 04, 2014 - 9:47:52 PM - Yan Ouyang Back To Top

Thanks for your prompt reply, Koen. Actally I did update the DW_hash as you said. One thing is that SHA256 hash function doesn't work in 2008R2, so I used SHA1 instead of SHA256. Anyway, thanks!


Tuesday, November 25, 2014 - 8:29:25 AM - Koen Verbeeck Back To Top

@Yan: you are absolutely correct. The update of the hash should be done in the update statement.
When you have calculated the hash with the script component, you can store it in the staging table as well. And then you can update it using the UPDATE statement.

The reason I forgot to include this in the tip is because I based my package upon an actual package, where I do not update the rows but rather insert new versions of a row, just like in a type II dimension. 


Tuesday, November 25, 2014 - 6:45:25 AM - Yan Ouyang Back To Top

Nice article. Help me a lot. But I have a question about the update: I did not see where the DWH_Hashs of the updated records update. 


Thursday, September 25, 2014 - 3:30:19 AM - Koen Verbeeck Back To Top

@Bob: I haven't had time yet to test your claim. I'll come back at you later :)

@Damian: if you want to detect deletes you must make sure you do a full load of the source every time. Otherwise how would you know if a row has been deleted or if it just wasn't in the load window? Typically you can detect deletes by checking all of the rows of the destination table against the rows in the staging table using a NOT EXISTS clause.


Wednesday, September 24, 2014 - 11:26:50 AM - Damian Back To Top

Great article, got this working despite being a newbie to SSIS.

I want to expand this to deal with deletes too.  Can anyone give me any pointers?


Tuesday, September 16, 2014 - 11:28:02 AM - Bob Frasca Back To Top

Your remark about using the T-SQL hashbytes function isn't entirely accurate. This expression works just fine.

((DT_WSTR,42)srchash != (DT_WSTR,42)desthash)


Thursday, April 17, 2014 - 11:25:15 AM - Steven Neumersky Back To Top

Definitely better than the CHECKSUM transform on SQLIS.com.  Good stuff....as usual.


Friday, March 21, 2014 - 7:48:43 AM - Phil Parkin Back To Top

Nice article. It would also be interesting to see some performance stats comparing the HASH-only update with the full update to get an idea of how much of an overhead is added by creating and matching on the hashes.

If it is likely that a large percentage of the UPDATEs are actual changes, I would expect the full update to win.


Thursday, March 06, 2014 - 7:46:07 AM - Koen Verbeeck Back To Top

Indeed it does, Brett. Except you used MD5 (I prefer SHA256, but this is more of a personal choice), Visual Basic and the ugly OLE DB command ;)

In your code you looped over the columns of the input row in the buffer, which is certainly more efficient.


Wednesday, March 05, 2014 - 9:02:08 PM - Brett Flippin Back To Top

Seems pretty similar to my post on SQLServerCentral in 2010. Take a gander:

 

http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/69766/


Tuesday, February 25, 2014 - 3:42:58 AM - David S Back To Top

Nice article, we did this some weeks back but utilised another software package, (Pentaho), but went through the same process of comparing hashtag values. However, we decided to truly determine if a change was an update by dermining which fields we wanted to monitor for changes.

We had to do this as our incoming data file was a FULL refresh of the data we have loaded, and we could not afford the downtimein replacing our whole data set.

1. When doing initial load of data creating a hash value of all KEY fields to monitor added together, and store this hash value along with the data.

2. On import of new data into stage table, compare a hash summary of key fileds to hash value stored in database. If the hash values differed in any way, then this was an update, and passed on as such.

3. We also added a final step of comparing the Identity values in the incoming data file to what we had, (as our incoming data file was a 'full refresh of the data every time), if anything was found in our current dataset, which did not exist in the incoming data file, wemarked the source record as disabled, so it would no longer be used.


Tuesday, February 18, 2014 - 9:30:50 AM - Cedric Back To Top

Nice, already knew how to do it, but this sure is the best manual I've ever seen that explains the magic behind it.


Tuesday, February 18, 2014 - 8:49:23 AM - Greg Robidoux Back To Top

Download has been fixed.


Tuesday, February 18, 2014 - 5:20:45 AM - Saiful Back To Top

DOWNLOAD LINK FOR THIS PACKAGE IS NOT WORKING....


Learn more about SQL Server tools