Using Hashbytes to track and store historical changes for SQL Server data

By:   |   Comments (15)   |   Related: > Database Design


Problem

Change Data Capture (CDC) is a fundamental part of ETL especially in a data warehousing context. A way to achieve this is through hashing algorithms. Even though SSIS is a great ETL tool, it lacks such a hashing function. The scenario presented here is rather simple, but it could also be applied to more complex ETL logic. Basically what we have is a source table in which we have no reliable timestamp to identify which data has changed since the last load (as is often the case in operational systems). The table data is replicated in a staging area and we wish to store history for only changes in the data warehouse. One possible solution is to compare field by field the staging and the warehouse table, but obviously this is tedious and time consuming. We need to find an efficient and simple way to do this.

Solution

The solution presented here is to use the HashBytes SQL Server function to generate a hash code for every source record. HashBytes can use algorithms such as MD5 or SHA1. These hash functions are very sensitive to change and will generate a completely different key for a single character change.

To begin with, here is a very simple SSIS staging package:

Figure 1:

very simple SSIS staging package

The first SQL task simply truncates the staging table before inserting the complete data from the source table. The data flow task is nothing more than a one-to-one mapping from source to staging with a few technical fields added (load_id, load_date ...)

The sql_update_checksum SQL task is where it starts to get interesting. It will bulk update a "checksum" field for all the records inserted in the staging table. Here are the configuration screenshots:

Figure 2:

The sql_update_checksum SQL task is where is starts to get interesting

This SQL task uses a direct input SQL statement, but with an input parameter.

Figure 3:

This SQL task uses a direct input SQL statement, but with an input parameter

The parameter is a user-defined variable named "target_object_name" which contains the schema and staging table name we are feeding (for example "staging.customer").

Figure 4:

Execute SQL Task Editor

Next is the SQL statement with the "?" placeholder for the parameter. I use an UPDATE statement to bulk update a "checksum" field. This is were the get_hash_fields UDF is invoked and returns the list of field names which are then passed to HashBytes. The return value for HashBytes is a VARBINARY which I convert to a VARCHAR(32). Once converted to a varchar, the hash code looks like "0xEAD2DD130555A749BB128517DAA4C912". I cleanup the string by removing the first 2 characters ("0x") and putting it in lowercase.

The problem with HashBytes is that it is not very user friendly since it will only accept a single varchar or nvarchar variable. So this means that all the fields from our source records need to be concatenated to a single string; also meaning that all non-string columns need to be converted to a varchar.

So to achieve this and to make our code reusable, what better way than to encapsulate this conversion logic inside a UDF?  Figure 5 shows the get_hash_fields function definition.

Figure 5:

CREATE function [dbo].[get_hash_fields] (@p_table_name VARCHAR(100),@p_schema_name VARCHAR(20))
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @sqlString as varchar(max)
    SET @sqlString = ''

SELECT @sqlString = @sqlString + CASE DATA_TYPE WHEN 'int' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ')),'''')' WHEN 'datetime' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ',112)),'''')' WHEN 'datetime2' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ',112)),'''')' WHEN 'date' THEN 'ISNULL(RTRIM(CONVERT(varchar(10),' + COLUMN_NAME + ',112)),'''')' WHEN 'bit' THEN 'ISNULL(RTRIM(CONVERT(varchar(1),' + COLUMN_NAME + ')),'''')' WHEN 'decimal' THEN 'ISNULL(RTRIM(CONVERT(varchar('+ CONVERT(varchar(2),NUMERIC_PRECISION) +'),' + COLUMN_NAME + ')),'''')' ELSE 'ISNULL(RTRIM(' + COLUMN_NAME + '),'''')' END + '+' FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = @p_schema_name and TABLE_NAME = @p_table_name AND COLUMN_NAME NOT IN ('load_dts','load_cycle_id','period','checksum')
RETURN LEFT(ISNULL(@sqlString, ''),LEN(@sqlString)-1) END

The function receives 2 arguments: @p_table_name which is the table for which we want to generate the column list, and @p_schema_name which is the associated schema for this table. As you can see, I am using a single SELECT statement to generate a concatenated string of all the fields in the table. The INFORMATION_SCHEMA.COLUMNS system view is used to retrieve all column names for this table. I am also filtering on a list of technical fields which should not be part of the hash code since they are not received from the source, but added by the ETL logic.

The CASE statement checks the data type for each column of the table (as defined in SQL's system tables) and uses a corresponding CONVERT function to convert the column to a varchar data type. (Of course, you could add entries to this CASE statement if you are using other data types ...) So basically, the function builds a series of concatenated CONVERT statements that is used in a dynamic SQL statement in an SSIS variable.

Example tables:

Suppose we have a staging and datawarehouse customer table such as:

staging.customer

staging.customer:

dwh.customer:

 
dwh.customer

Here is a function call and its output:

function call:

select dbo.get_hash_fields('customer','staging')

output string from the function:

ISNULL(RTRIM(CUSTOMER_ID),'')+ISNULL(RTRIM(CUSTOMER_NAME),'')+ISNULL(RTRIM(CUSTOMER_ADDR1),'')+ISNULL(RTRIM(CUSTOMER_ADDR2),'')+ISNULL(RTRIM(CUSTOMER_ADDR3),'')+ISNULL(RTRIM(CONVERT(varchar(10),BUSINESS_TYPE)),'')+ISNULL(RTRIM(CONVERT(varchar(9),BALANCE_DUE)),'')+ISNULL(RTRIM(CONVERT(varchar(10),DUE_DATE,112)),'')+ISNULL(RTRIM(CONVERT(varchar(10),PAY_FREQ)),'')

So in our SSIS SQL Task code from figure 4, the function call is in bold.  I hard coded the @SchemaTableName into this as "staging.customer", but this would be handled by the SSIS parameter and that is why it shows as "?" in Figure 4.

DECLARE @sqlFieldString as varchar(max)
DECLARE @sqlStatementString as varchar(max)
DECLARE @TableName as varchar(100)
DECLARE @SchemaName as varchar(20)
DECLARE @SchemaTableName as varchar(20)

SET @SchemaTableName = 'staging.customer' SET @SchemaName = LTRIM(RTRIM(LEFT(@SchemaTableName,charindex('.',@SchemaTableName)-1))) SET @TableName = LTRIM(RTRIM(RIGHT(@SchemaTableName,len(@SchemaTableName)-charindex('.',@SchemaTableName))))
select @sqlFieldString = [dbo].[get_hash_fields](@TableName,@SchemaName)
SET @sqlStatementString = 'UPDATE ' + @SchemaName + '.' + @TableName + ' SET [checksum] = substring(lower(convert(varchar(32), HASHBYTES(''MD5'',' + @sqlFieldString + '),1)),3,32)'
EXEC (@sqlStatementString)

If we run the above code, the final result is that we have updated all the checksum fields of our "staging.customer" table. It is now ready to be used to be compared to the checksum field of the "dwh.customer" table in the data warehouse. This will allow us to detect any new changes arriving from the source system.

Inserting New Data 

Once we have the checksum calculated for our staging database, basically the SQL statement to INSERT the changes to the warehouse could be as simple as this.  Here I am using a LEFT JOIN to find any records that don't already exist in the "dwh.customer" table.

INSERT INTO [core].[customer]
SELECT stg.[load_dts]
   ,stg.[load_cycle_id]
   ,stg.[period]
   ,stg.[CUSTOMER_ID]
   ,stg.[CUSTOMER_NAME]
   ,stg.[CUSTOMER_ADDR1]
   ,stg.[CUSTOMER_ADDR2]
   ,stg.[CUSTOMER_ADDR3]
   ,stg.[BUSINESS_TYPE]
   ,stg.[BALANCE_DUE]
   ,stg.[DUE_DATE]
   ,stg.[PAY_FREQ]
   ,stg.[checksum]
FROM [staging].[customer] stg
LEFT JOIN [core].[customer] dwh on stg.[checksum] = dwh.[checksum]
WHERE dwh.[checksum] IS NULL  

Of course I am simplifying things: the SSIS package would need some lookup components and include logic to manage technical fields such as valid_from_dt, valid_to_dt and row_is_current.

What is most interesting about this design pattern is that it makes the hashing code reusable by hiding all the hashing logic inside a SQL function which dynamically builds the code. It also makes the CDC logic much simpler and uses the power of the SQL engine to bulk update the hashing codes for all rows.

Next Steps
  • In the case of more complex logic such as a type 2 dimension table being fed by multiple source tables, the pattern would be slightly different. In a nutshell, the ETL would involve the following steps:
    1. First data flow - business logic: Use a data source with a SQL command on your multiple source tables, and apply all business rules. Insert all records to a working table.
    2. SQL Task - hashing logic: Bulk update the checksum field for every record in the table, using the method described in this article.
    3. Second data flow - dimension logic: Use the working table as source and apply the normal logic you would apply to a type 2 dimension using the business key and the checksum field.

      (Note: ETL logic for a type 2 dimension is outside the scope of this article, but is very well documented in many books and on the web.)


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Christian Allaire Chris Allaire has been developing applications on SQL Server for the last 12 years and specializes in SSIS and data integration.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Thursday, October 14, 2021 - 8:32:04 AM - Niels Back To Top (89332)
Hi Chris,

I realise this is an old topic, however, it still suits my purposes very well, thank you for this.

I've implemented this as a combination of a stored procedure (mainly to be able to specific an algorithm, columnname, etc.) and a UDF, which works like a charm so far.

I implemented most comments, including separators and writing a binary version besides the original function, however, it's worth noting that the binary version is much slower (times 10) in my specific scenario, since my datasets include many varchar columns. These don't need to be converted in the original solution and are being converted in the binary version. If there's a clever way around this, or if I'm going all wrong about this, I'm all ears.

The reason I'm commenting here, is that I believe there is another conversion error (besides the decimal conversion as has been pointed out): In the case of INT, this only works for unsigned integers. If you convert to varchar(10) and you have an integer like -1234567890, you'll end up with * since it's actually 11 characters.

Cheers, Niels

Friday, September 24, 2021 - 3:26:25 PM - Phillip Putzback Back To Top (89264)
I ran into an issue when columns had spaces in the name. I updated the function to add brackets [ ] and it is working.

Wednesday, September 4, 2019 - 4:41:34 AM - Anjum Back To Top (82232)

Hey Chris,

Nice work, I am also looking into the same logic from past some days on using the checksum (hashbytes). But do you think using the primary key and index it with hash would add more performance to this logic?

How about creating a key hash and row hash to check which key hash is present and the row hash has changed to figure out updates.


Tuesday, June 11, 2019 - 7:20:06 AM - Pete Back To Top (81409)

Hi, 

I'm just looking at this , and potentially very useful thanks.  

Just wanted to point out one small bug I came across in the 'get_hash_fields' funtion.  When it builds the line for converting decimal datatypes,  it uses numeric precision to determine the length of the varchar target.  This will result in an overflow error if the decimal value is fully populated.  

Decimal (10,9) value  0.123456789  requires 11 characters. 

I changed the function to specify numeric precision+1 to avoid it the error

WHEN 'decimal' THEN 'ISNULL(RTRIM(CONVERT(varchar('+ CONVERT(varchar(2),NUMERIC_PRECISION +1 ) +'),' + COLUMN_NAME + ')),'''')'

 

regards, Pete


Friday, September 29, 2017 - 12:51:04 PM - Derek Back To Top (66732)

 I love this function and used it at my last job as part of our SSIS template. I am now at a MySQL shop and working on a function to acomplish the same thing using MySQL. Do you know if someone has already tackled this for MySQL?

 


Thursday, March 20, 2014 - 3:12:15 PM - Raymond A Beckett Back To Top (29823)

One performance enhancement...  Don't convert non character types into varchar/char fields.  Convert them into binary/varbinary types matching the actual binary storage sql server uses.  

i.e int would convert to binary(4) and money would be binary(8), varchar(10) would go into a varbinary(10), etc...

 This should keep your data sizes within the 8k page boundary all records have (not counting (max) fields), unless the separator you use pushes it that last bit over.  (I typically use a pipe character)  And the hashbytes routine has fewer bytes to process.  

 


Wednesday, January 15, 2014 - 12:18:37 AM - Bob Back To Top (28078)

I've recently come across this great article and started using a similar approach. A column separator, as previously noted, is an essential addition. I would also review the length and format of the varchar conversions (especially the datetime and datetime2 columns - the above code truncates the time off both leaving only the date - varchar(25) and style 21 for example will give date and time to 10 micro seconds).


Sunday, July 29, 2012 - 10:30:52 AM - Chris Allaire Back To Top (18839)

Wow! Thanks for all your comments, that's what I get for going on vacation for a week.  I will address all of them here.

Concerning concatenation flaws, thanks for pointing that out.  I think simply adding a column separator character in the get_hash_fields function would fix that. 

Concerning MD5, of course there is no perfect hashing algorithm, but personally I can live with it.  If you feel SHA1 is more secure than no problem, the premice of the article is still valid.  The 8K limitation is real, if you have very wide tables you should take that into account before using this technique.

Finally, the SQL Server 2012 CDC feature is very interesting but is not what I am talking about when use the term "CDC".  I have not used it yet, but I believe you need a SQL 2012 homogeneous environement to implement this.

Thanks again,

Chris 


Wednesday, July 25, 2012 - 11:48:29 AM - Robert McCauley Back To Top (18800)

Great article and introduction to hashing solutions to ETL problems!

However, what you're describing as Change Data Capture, I would call Change Tracking.  Change Data Capture (at least the SQL Server feature) involves storing the complete history of changes to column values.

The SQL Server "Change Tracking" feature is an alternate approach to detecting changes for ETL extracts.  It's lightweight but quite functional feature that is built into the database and comes free with SQL Standard Edition.  You may find it performs better than trying to join very large staging and core tables to handle inserts.

http://msdn.microsoft.com/en-us/library/bb933875.aspx


Thursday, July 19, 2012 - 9:02:01 PM - Boris Tyukin Back To Top (18675)

forgot to mention one more thing - HASHBYTES string is limited to 8k characters currently so for a very wide tables you would have to use some workarounds - see here https://connect.microsoft.com/SQLServer/feedback/details/273429/hashbytes-function-should-support-large-data-types

 


Thursday, July 19, 2012 - 8:58:54 PM - Boris Tyukin Back To Top (18674)

great post, Christian! based on your experience, will your do something about possible hash collisions? MD5 should be good enough, but I agree with a previos comment that this is still a risky solution.

 


Thursday, July 19, 2012 - 11:33:49 AM - HashingDetectionFlaws Back To Top (18655)

A few flaws:

1) simple concatenation fails to detect many basic changes, for instance, simple concatenation of the following two names when firstname is the field just before lastname results in identical strings, despite clearly different data.

Firstname: alli

Lastname: ewheren

Firstname: allie

Lastname: wheren

This is even worse with numeric fields, where the following are identical strings:

Amount1: 123

Amount2: 456

Amount1: 1

Amount2: 23456

 

2) Always consider how you're going to find instances of hash collisions, which do occur.  Just because the chances of any two out of only two source lines having the same hash is low doesn't mean the chance of any two out of millions of rows is just as low - see the Birthday Paradox.


Thursday, July 19, 2012 - 10:06:58 AM - Christian Allaire Back To Top (18652)

Thanks Jonathan for you comment.  Of course it is more convenient to use the Multiple Hash add-in.  However it is not always possible to do so, depending on the environement you are working in.  Some companies will not allow the use of third-party components.  The point of the article is to present an alternative way of tracking change. 


Thursday, July 19, 2012 - 7:33:17 AM - Ahmad Back To Top (18647)

Kindly, if you can share the sample package u build with us?
further more i need to write/update Dynamic SQL as you have done in your this sample in SQL Script TASK and then execute that and then get the results of those in further data flow! is this possible in SSIS to have output from SQL SCRIPT TASK? I am not familiar with Front End Programming languages, so i am quite incofident to make strings in any FE language but in SQL Scripts :)

Thank you!


Thursday, July 19, 2012 - 12:21:28 AM - Jonathan Kehayias Back To Top (18638)

The SSIS Multiple Hash addin, which is free on Codeplex (http://ssismhash.codeplex.com/) would allow the hash to be calculated on the fly in the SSIS package using multiple threads so that you don't have the extra steps, and in my experience is much faster at computing this kind of hash.  The output can also be indexed in binary form allowing for very fast lookups against the hash to find deltas.















get free sql tips
agree to terms