Using hash values in SSIS to determine when to insert or update rows

By:   |   Comments (63)   |   Related: More > Integration Services Development


Problem

When you want to load data into your data warehouse, for example into one of your dimensions, you need to determine if an incoming row is either an update - a historical change - or an insert, which is a new occurrence of a record. An update row is a source record for which the business key - or the natural key as some folks prefer - is already present in the destination table. To process these updates as efficient as possible, you need to determine if the update is an actual update, meaning that at least one of the columns has changed values. If none of the columns have actually changed, you can skip the update of that record and save some precious resources.

A common method to detect if a row has changed is to compare hashes: if the hash of the incoming record is different from the hash found in the destination table, one or more columns have changed. Christian Allaire describes the concept using the T-SQL Hashbytes function in the tip Using Hashbytes to track and store historical changes for SQL Server data. Integration Services itself doesn't have a hash implementation, so in this tip I'll show how you can implement a hash detection mechanism using a script component, as an alternative to the pure T-SQL solution.

Solution

The concept of the incremental load pattern - where you detect the inserts and the updates - is described clearly in this article by Andy Leonard: SSIS Design Pattern - Incremental Loads. We will closely follow those recommendations in this tip, but I'll make some performance improvements and add the hashing of course. The explanation in this tip is valid for all versions of SSIS starting from SQL Server 2012 (and possibly 2005/2008 as well).

Demonstrate Data Load Options in SSIS

Let's keep it simple and use the AdventureWorksDW2017 database as the source for our data, which you can download here. If you don't have SQL Server 2017, you can use any earlier version of the AdventureWorksDW database as well. The following query will extract the data:

SELECT
  [CustomerAlternateKey] -- this is our business key (BK)
 ,[Title]
 ,[FirstName]
 ,[MiddleName]
 ,[LastName]
 ,[BirthDate]
 ,[MaritalStatus]
 ,[Suffix]
 ,[Gender]
 ,[EmailAddress]
 ,[YearlyIncome]
 ,[TotalChildren]
 ,[NumberChildrenAtHome]
 ,[EnglishEducation]
 ,[EnglishOccupation]
 ,[HouseOwnerFlag]
 ,[NumberCarsOwned]
FROM [dbo].[DimCustomer];

All of the data is selected from the source. The SSIS package will detect which rows are inserts and which rows are updates.

The destination table can be created with the following script:

IF NOT EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyDestination]') 
               AND [type] IN (N'U'))
BEGIN
 CREATE TABLE [dbo].[MyDestination](
  [SK_Customers] [int] IDENTITY(1,1) NOT NULL,
  [CustomerAlternateKey] [nvarchar](15) NOT NULL,
  [Title] [nvarchar](8) NULL,
  [FirstName] [nvarchar](50) NULL,
  [MiddleName] [nvarchar](50) NULL,
  [LastName] [nvarchar](50) NULL,
  [BirthDate] [date] NULL,
  [MaritalStatus] [nchar](1) NULL,
  [Suffix] [nvarchar](10) NULL,
  [Gender] [nvarchar](1) NULL,
  [EmailAddress] [nvarchar](50) NULL,
  [YearlyIncome] [money] NULL,
  [TotalChildren] [tinyint] NULL,
  [NumberChildrenAtHome] [tinyint] NULL,
  [EnglishEducation] [nvarchar](40) NULL,
  [EnglishOccupation] [nvarchar](100) NULL,
  [HouseOwnerFlag] [nchar](1) NULL,
  [NumberCarsOwned] [tinyint] NULL,
  [DWH_Hash] [char](66) NULL,
  CONSTRAINT [PK_MyDestination] PRIMARY KEY CLUSTERED
   ([SK_Customers] ASC)
 );
END

The incremental load package takes the following form:

Control Flow Incremental Load

The original incremental load pattern uses an OLE DB Command in the data flow to perform the updates. I dislike this component because it issues a statement for every single row that passes through the component, which is horribly slow for large data sets and it can also bloat the transaction log. Therefore I create a staging table that will hold the update rows so that a set-based UPDATE statement can be issued in the third task. This improves performance immensely. The update table is created with this statement:

IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[UPD_MyDestination]') 
           AND [type] IN (N'U'))
BEGIN
   DROP TABLE [dbo].[UPD_MyDestination];
END

CREATE TABLE [dbo].[UPD_MyDestination](
  [SK_Customers] [int] NOT NULL,
  [Title] [nvarchar](8) NULL,
  [FirstName] [nvarchar](50) NULL,
  [MiddleName] [nvarchar](50) NULL,
  [LastName] [nvarchar](50) NULL,
  [BirthDate] [date] NULL,
  [MaritalStatus] [nchar](1) NULL,
  [Suffix] [nvarchar](10) NULL,
  [Gender] [nvarchar](1) NULL,
  [EmailAddress] [nvarchar](50) NULL,
  [YearlyIncome] [money] NULL,
  [TotalChildren] [tinyint] NULL,
  [NumberChildrenAtHome] [tinyint] NULL,
  [EnglishEducation] [nvarchar](40) NULL,
  [EnglishOccupation] [nvarchar](100) NULL,
  [HouseOwnerFlag] [nchar](1) NULL,
  [NumberCarsOwned] [tinyint] NULL,
  [DWH_Hash] CHAR(66) NOT NULL
 );

Next, the data flow will load the data from the source and uses a Lookup component to find out if a row is an insert or an update. The Lookup uses the following query to fetch the reference data set from the destination table:

SELECT SK_Customers,CustomerAlternateKey,DWH_Hash
FROM MyDestination;

Hint: always write a SQL query to select the lookup columns from the reference data set. Never use the drop down; it selects all columns and wastes memory. If you select only the columns you actually need, chances are good that you can fit everything in memory with the full cache option, even for very large reference sets.

The Lookup matches the source rows with the destination rows using the business key. If a match is found - an update - the surrogate key and the hash are retrieved. Make sure to configure the Lookup to redirect non-matching rows to the no match output.

Data Flow Incremental Load

The third task updates the destination table using the rows stored in the staging table.

UPDATE dest
SET  [Title]    = upd.[Title]    
 ,[FirstName]   = upd.[FirstName]   
 ,[MiddleName]   = upd.[MiddleName]   
 ,[LastName]    = upd.[LastName]    
 ,[BirthDate]   = upd.[BirthDate]   
 ,[MaritalStatus]  = upd.[MaritalStatus]  
 ,[Suffix]    = upd.[Suffix]    
 ,[Gender]    = upd.[Gender]    
 ,[EmailAddress]   = upd.[EmailAddress]   
 ,[YearlyIncome]   = upd.[YearlyIncome]   
 ,[TotalChildren]  = upd.[TotalChildren]  
 ,[NumberChildrenAtHome] = upd.[NumberChildrenAtHome] 
 ,[EnglishEducation]  = upd.[EnglishEducation]  
 ,[EnglishOccupation] = upd.[EnglishOccupation] 
 ,[HouseOwnerFlag]  = upd.[HouseOwnerFlag]  
 ,[NumberCarsOwned]  = upd.[NumberCarsOwned]
 ,[DWH_Hash]  = upd.[DWH_Hash]
FROM [dbo].[MyDestination]  dest
JOIN [dbo].[UPD_MyDestination] upd ON dest.SK_Customers = upd.SK_Customers;

The final task will drop the staging table, which is optional of course. If you drop it though, don't forget to set the package to DelayValidation=TRUE.

When this package is run for the first time, all of the source rows are new and are directly inserted into the destination.

Initial Load Without Hash

When the package is run again, all rows matched against the destination table and are sent to the staging table. This means that all of the rows were updated, while this wasn't necessary at all.

Reload Without Hash

Now let's add a hash to detect if there were actual changes or not.

Adding the Hash Value to Detect Data Changes in SSIS

In this tip, we'll use the script component to calculate the hash in .NET. However, you can also use the T-SQL HASHBYTES function if your source is SQL Server. Most database vendors have similar functions. The script component is a good alternative when your source doesn't have this functionality. For example, you are loading flat files or an Excel file. Or, the source databases doesn't support hashing property (for example, Sybase ASE has a hash function, but it doesn't include the SHA2 algorithm).

From the SSIS Toolbox, add a script component to the data flow canvas. A pop-up will ask you which type of script you want: a source, a transformation or a destination. Choose the transformation. Open up the editor. In this tip, I will use C# code. If you're more familiar with VB.NET, the code can easily be adapted. Go to the Input Columns pane and add every column, except the business key. It won't change since it is the business key after all, so it is not necessary to add it to the hash. All of the input columns can be left as ReadOnly.

Configure Input Columns

Now go to the Inputs and Outputs pane. There we will add the Hash column to the output. Give it the string data type with a length of 66. I will explain later why we are using this length.

Configure Output Column

Go back to the Script pane and open up the script editor. Since we are going to use a hash function, which is a cryptographic function, a reference to System.Security.Cryptography needs to be added to the solution.

Add Cryptography Namespace

We also add System.Text so we can use the StringBuilder class and the System.Reflection namespace for looping over the input columns.

The script takes the following steps:

  • Concatenate all input columns into one long string. If a column contains a NULL value, convert it to an empty string.
  • Compute the hash over this concatenated string. This will return an array of bytes.
  • This array is converted back to a string using a for loop and is finally added to the output. The characters '0x' are prefixed to the result, to resemble the output of the HASHBYTE function.
HASHBYTES output in SSMS

Instead of hard-coding all of the input columns in the concatenation, we loop over the input buffer to retrieve all the columnsThe final code looks like this:

private SHA256 sha = new SHA256CryptoServiceProvider(); // used to create the SHA256 hash
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        StringBuilder concat = new StringBuilder("");

        // loop over all the input columns of the buffer row
        foreach(PropertyInfo inputColumn in Row.GetType().GetProperties())
        {
            // ignore the _IsNull properties (used to check a column for NULL values) and the output column Hash
            if (!inputColumn.Name.EndsWith("IsNull") && !inputColumn.Name.Equals("Hash"))
            {
                // check for NULL values, otherwise the script fails
                if (inputColumn.GetValue(Row, null) == null)
                {
                    concat.Append(""); // add an empty string
                }
                else
                {
                    // add the value of the input column to the concat string
                    concat.Append(Convert.ToString(inputColumn.GetValue(Row).ToString()));
                }
                concat.Append("|"); // add pipe as delimiter
            }
        }
        // convert to string and remove last pipe
        string hashSource = concat.ToString().Remove(concat.ToString().Length - 1);
        // calculate the hash over the concatenated values
        byte[] hashBytes = sha.ComputeHash(System.Text.UnicodeEncoding.Unicode.GetBytes(hashSource));

		// convert the byte stream to a string
		StringBuilder sb = new StringBuilder("");
		for (int i = 0; i < hashBytes.Length; i++)
		{
			sb.Append(hashBytes[i].ToString("X2"));
		}

		Row.Hash = "0x" + sb.ToString(); // add the result to the output
    }

The script uses the SHA256 hash function to compute the hash. This hash function outputs a 256 bit hash and is much safer than the regular used MD5 hash and is less prone to hash collisions. The SHA256CryptoServiceProvider class is used to implement the hash, but you can also use the SHA256Managed class. There is not much difference between the two, expect the implementation behind the scenes. The CryptoServiceProvider class might be preferred above the managed class for regulatory reasons.

The 256 bit length of the hash ensures that the chance on hash collisions is minimal. 256 bit map to 32 Unicode characters in SQL Server, or 64 non-Unicode characters. If you add the two extra characters from '0x', you get the 66 character length used in the destination table and in the script component.

Note: if you use SQL Server 2012 or a more recent edition as the source, you can also use the new CONCAT function to concatenate the string values. However, the script component can easily be ported to earlier versions of SQL Server.

Add a Conditional Split transformation between the Lookup and the OLE DB Destination on the math output path. The expression is simple but effective:

Conditional Split Expression

The final data flow has the following layout:

The Final Data Flow

If the destination table is truncated and the initial load is executed again, you can find the hash stored alongside the data in the destination table.

Hashes in the destination table

Now, here is the moment that sums up this entire tip, if the package is run again, a single row does not make it to the destination table.

Updates do not make it through the conditional split

Let's change some rows in the source with the following script:

UPDATE [dbo].[DimCustomer]
SET Title = 'Test123'
WHERE CustomerKey % 1000 = 0;
-- this updates 19 rows

If the package is run again, only the changed rows are written to the staging table and subsequently updated. Mission accomplished.

Only changed rows make it through the conditional split

Conclusion

This tip demonstrated how you can efficiently manage the number of updates that come through your SSIS package with a hash. Since the hash is calculated in a script component, the solution is independent of any possible source.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a seasoned business intelligence consultant at AE. He has over a decade of experience with the Microsoft Data Platform in numerous industries. He holds several certifications and is a prolific writer contributing content about SSIS, ADF, SSAS, SSRS, MDS, Power BI, Snowflake and Azure services. He has spoken at PASS, SQLBits, dataMinds Connect and delivers webinars on MSSQLTips.com. Koen has been awarded the Microsoft MVP data platform award for many years.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Thursday, February 24, 2022 - 1:36:32 AM - Koen Verbeeck Back To Top (89828)
Hi Luther,

thanks for your suggestions!

Regards,
Koen

Wednesday, February 23, 2022 - 8:55:59 AM - Luther Atkinson Back To Top (89826)
I ran into a few issues adapting this to Visual Studio 2019, probably due to .NET changes or my using the sha512 instead of 256, but I also found what I think is a small shortcut. You can, when defining the inputs only select those columns you want to include in the calculation, so you can get rid of the check for column name. Also, and this might break in the future, but the ISNULL variable in the dataset seems to always be the next element, so I stepped my array by 2 and this is what I came up with.

StringBuilder concat = new StringBuilder("");

PropertyInfo[] myColumns = Row.GetType().GetProperties();
for (int ii=0; ii<myColumns.Length; ii=ii+2)
{
if (!myColumns[ii].Name.Equals("HashVal"))

{
if((bool)myColumns[ii+1].GetValue(Row))
{
concat.Append("");
}
else
{
concat.Append(Convert.ToString(myColumns[ii].GetValue(Row).ToString()));
}
}
concat.Append("|");
}
string hashSource = concat.ToString().Remove(concat.ToString().Length - 1);
SHA512 sha512 = SHA512.Create();
byte[] hashBytes = sha512.ComputeHash(Encoding.UTF8.GetBytes(hashSource));

StringBuilder sb = new StringBuilder("");
for (int i = 0; i < hashBytes.Length; i++)
{
sb.Append(hashBytes[i].ToString("X2"));
}

string hushVal = "0x" + sb.ToString();


Row.HashVal = hushVal;


Friday, January 22, 2021 - 3:21:59 AM - Koen Verbeeck Back To Top (88081)
Hi Bullock,

yes, that's an old sore in SSIS. They deprecated OLE DB, but forgot some of their components only work with OLE DB. A work around is to use a cache connection manager. Load the cache using ADO.NET and then use the cache in the lookup.

Cheers,
Koen

Thursday, January 21, 2021 - 1:00:34 PM - Bullock Back To Top (88080)
This is a great help but I need to update tables in Azure with an ADO.net connection. My understanding is that SSIS lookup only works with OLE DB. Now I am back to Truncate and replace. Any ideas?

Monday, October 12, 2020 - 1:43:49 PM - Koen Verbeeck Back To Top (86631)
Hi Alexis,
you can try calculating the hash using the T-SQL hashbytes function and compare it with the generated hash from the script.
Also make sure you're using a delimiter between the column values.

ps: I hope you're not posting actual data from your production system?

Thursday, October 1, 2020 - 4:04:53 PM - Alexis Back To Top (86583)
Hi Koen,

I have noticed that some row wasn't add, I think that the issue is the Hash-66. Below I added the rows from .csv data (columns).

Last Name,First Name,ID Number,Pay Class,Worked Location,Worked Department,Hours,Pay Designation Name,Rate,Dollars,Date,Home Department
Loren,Daniel,102190,WKLY ADMIN FT,TX,510690,1.75,Regular ,$23.63 ,$41.35 ,09/06/2020,510660

Hash: 0x37799F49CA97E4E7020855FBE38AE7FBAE0943F52C13D21A40590545DDECC720

Can you suggest what should I do to fix the query or another solution?

Thank you!

Alex Alonso

Wednesday, September 23, 2020 - 10:15:50 AM - Alexis Back To Top (86524)
Thank you Koen for your help.

Now is working the Hash and Lookup.
Should be great if you include an option to attach pictures to support questions and solutions.

Best Regards,
Alex

Tuesday, September 22, 2020 - 11:09:48 PM - Koen Verbeeck Back To Top (86520)
Hi Alexis,
in a lookup component you have two inputs. One is the source, which you map in the data flow itself by connecting either a source or a previous transformation to the lookup component with an arrow. The other one is the reference data set, which you configure in the lookup itself. You need to tell the lookup component how the source data needs to be matched against the reference data. In the Columns pane, you drag a column from the available input fields to the corresponding available lookup fields. Like in this screenshot:

https://www.mssqltips.com/tutorialimages/9067_using-transformations-in-the-data-flow.007.png

Regards,
Koen

Tuesday, September 22, 2020 - 4:41:07 PM - Alexis Back To Top (86517)
Hi,

I like your SSIS tutorial, however, I have tried to replicate it and I had errors:
(LKP): the lookup transform must contain at least one input column joined....I have tried to many options but does not works
One or more component failed validation

Friday, March 6, 2020 - 4:49:07 PM - Shon Thompson Back To Top (84979)

Thanks so much for this example.  Here's what I did to resolve the issue with nulls.

PropertyInfo[] RowProperties = Row.GetType().GetProperties();

int arraylimit = RowProperties.GetUpperBound(0);

 

//foreach (PropertyInfo inputColumn in Row.GetType().GetProperties())

for (int PropertyIndex = 0; PropertyIndex <= arraylimit; PropertyIndex++)

{

// ignore the _IsNull properties (used to check a column for NULL values) and the output column Hash

if (!RowProperties[PropertyIndex].Name.EndsWith("IsNull") && !RowProperties[PropertyIndex].Name.Equals("Hash"))

{

// check for NULL values, otherwise the script fails

//(Row.<#=ColumnName#>_IsNull ? "" : Row.<#=ColumnName#>.ToString())

 

if (Convert.ToBoolean(RowProperties[PropertyIndex + 1].GetValue(Row)))    //Row.<#=ColumnName#>_IsNull is always the next property in the array

{

concat.Append(""); // add an empty string

}

else

{

concat.Append(Convert.ToString(RowProperties[PropertyIndex].GetValue(Row).ToString()));

}

concat.Append("|"); // add pipe as delimiter

}

}


Wednesday, January 29, 2020 - 6:38:33 AM - Koen Verbeeck Back To Top (84031)

Hi Lawrence,

someone has created the first step to a more generic script at StackOverflow: https://stackoverflow.com/questions/59192552/ssis-script-component-iterate-through-columns-with-null-values

You could add more case statements to check for additional data types.
Regards,
Koen


Tuesday, January 28, 2020 - 11:05:41 AM - Lawrence Back To Top (84015)

Hi Koen

I've applied a TRY-CATCH block as a workaround but it really hits peformance, runs at third of the speed.

Any ideas on how the _IsNull fields could be used in the regular script above?  I might have to live with the slower performance as I want a generic script and BIML would be a lot more effort in my scenario.

Thanks, Lawrence


Tuesday, January 28, 2020 - 3:11:16 AM - Koen Verbeeck Back To Top (84006)

Hi Lawrence,

you're right. In a later version of the code I use in a Biml script I've replaced it with the following:

(Row.<#=ColumnName#>_IsNull ? "" : Row.<#=ColumnName#>.ToString())

Every column has an accompanying _IsNull property/column which indicates if the column is NULL or not.

Regards,
Koen


Monday, January 27, 2020 - 11:33:15 AM - Lawrence Back To Top (83999)

I receive script errors when an integer or date on the row is NULL.  It appears the (inputColumn.GetValue(Row, null) == null) check is not picking them up?

I've seen in other solutions that these data types required special handling?

Microsoft.SqlServer.Dts.Pipeline.ColumnIsNullException

  HResult=0x80131600

  Message=The column has a null value.


Friday, January 10, 2020 - 2:41:13 AM - Koen Verbeeck Back To Top (83665)

Hi Niel,

if you have to do this for many other tables, I would first make this script component into a DDL or maybe even a custom transformation so it's easy to re-use. Then you can generate 50+ tables using BIML (you can find a tutorial on Biml on this website). I personally created a Biml script that implements the script from this tip, but it's quite hard as it's C# (from the script) inside C# (from Biml) inside XML (also Biml).

Regards,
Koen


Thursday, January 9, 2020 - 3:54:55 PM - Niel Back To Top (83663)

This works perfectly, Thanks! How would you recommend repeating this process for many (50+) other tables? Do I need to create a Control Flow for each table?

Thanks Niel


Wednesday, October 16, 2019 - 9:31:13 AM - Koen Verbeeck Back To Top (82800)

Hi Tobias,

I would handle deletes separately and not in this flow. If you would like to do it in one flow, the Lookup would not be sufficient. You would need to use a Merge Join, which needs sorted input (a big no-no for large datasets). What I typically do is run an Execute SQL Task after the data flow where I run a DELETE WHERE NOT EXISTS statement. This statement checks the source table against the target table to detect any delete and then remove the rows from the target. If you want soft deletes, just change the DELETE into an UPDATE.

Regards,
Koen


Wednesday, October 16, 2019 - 7:38:08 AM - Tobias Seckinger Back To Top (82796)

Thank you for this article. What would your solution for deleting rows? 

I know, sometimes is there a special column (e. g. deleted) that indicate the deletion, but sometimes the row is really deleted in the source table.

Perhaps you could extend the example?


Wednesday, April 10, 2019 - 9:26:04 AM - Koen Verbeeck Back To Top (79517)

Hi MB,

SSIS streams the data, so as long as you don't have any blocking components in the data flow, it should work.

It might take a long time, but it should work :)


Tuesday, April 9, 2019 - 7:30:33 PM - MB Back To Top (79511)

 select * from source table. My source table has 19 billion rows in it. 19,000,000,000 rows. That's not going to work.


Friday, February 22, 2019 - 1:32:52 PM - Koen Verbeeck Back To Top (79097)

Hi Fred,

well, that's the thing: you cannot decrypt a hash (https://en.wikipedia.org/wiki/Hash_function#Non-invertible). Hashes are typically used to quickly compare large sets of values, since the resulting hash is way smaller than the original set.

Regards,
Koen


Friday, February 22, 2019 - 11:45:26 AM - Fred Back To Top (79094)

Thank you for your article. I used your script code to hash a field. Everything is working fine but now I need to decrypt that field. Could you please share the code if you have it.


Friday, August 24, 2018 - 1:52:33 AM - Koen Verbeeck Back To Top (77298)

Also, in the next steps section you are able to download the package.

Regards,
Koen


Friday, August 24, 2018 - 1:51:49 AM - Koen Verbeeck Back To Top (77297)

Hi Kalyan,

the complete script can be found in the tip. Right under "The final code looks like this:".

Regards,
Koen


Thursday, August 23, 2018 - 11:07:44 AM - kalyan Back To Top (77290)

 Hi Koen,

Is there any way of getting this complete script it'll be very helpful to me if you have send it to me please...

Thanks


Friday, June 1, 2018 - 12:01:05 PM - JR Back To Top (76073)

 Hi,

Thanks for this, really helpful. However, I have one question , when considering below code:

  ,[DWH_Hash] = upd.[Hash]

FROM [dbo].[MyDestination]  dest
JOIN [dbo].[UPD_MyDestination] upd

From the above code, how you are getting "upd.[Hash]" in your table? Based on what column you are generating "Hash" in the "upd" table?
Is it just the column name with no value in it? and later you are putting the value( once the package run for the 1st time) ?


Monday, February 19, 2018 - 2:53:38 AM - Koen Verbeeck Back To Top (75241)

Hi Devin,

the SSIS script component has the nasty habit of removing underscores from column names. You can either remove the underscores from the column names in the code or you can avoid the script component by using the T-SQL HASHBYTES if that's an option.

Koen


Friday, February 16, 2018 - 1:25:42 PM - Devin Back To Top (75228)

 Koen,

 

How would handle columns that have an underscore in their name?? The C# script will not build.

 

Thank you

 


Monday, October 23, 2017 - 6:05:48 AM - Koen Verbeeck Back To Top (68690)

 

Hi Ken,

I recently moved my ETL solution to the T-SQL hashbytes as well. The solution presented in this tip works fine when the data isn't coming from SQL Server.
It could be improved by removing the hardcoded columns and instead looping over the metadata columns and constructing it in the fly. (I believed this is discussed in the comments as well). This would make the solution more robust and more flexible than the T-SQL solution.

Regards,
Koen


Thursday, October 19, 2017 - 6:18:28 PM - Ken W Back To Top (68554)

Nice writeup Koen. I have a process that uses hashbytes to do this same sort of work in a t-sql solution.  We still end up moving data around in SSIS anyway so I might attempt a version more like what you have laid out here. We have to evaluate which columns we want to spark changes because it is a type 2 dimension update.

 


Friday, August 4, 2017 - 7:18:04 PM - Jibs Ajia Back To Top (63983)

 

Thanks KoenWorked like a charm, pretty straightforward...


Tuesday, July 12, 2016 - 8:44:38 AM - Koen Verbeeck Back To Top (41870)

Hi Mac,

you'd have to store the hashes somewhere, so a table in SQL Server could be an option.
You'd have to do a lookup to retrieve the hash to compare it with the one coming from the script task.

Another option might be to calculate the hashes directly in the SQL statements.
I think Oracle has an equivalent function to hashbytes(). Downside is you have to calculate the hashes for both the incoming rows and all of the stored rows.


Monday, July 11, 2016 - 3:26:59 PM - Mac Hanna Back To Top (41863)

What if we want to create that package using comparison between two tables in Oracle Databases and not SQl Server Databases. And if that will be for Oracle databases, can we create the third table in SQL Server Database? Decause the access is limited in both Oracle DBs. 


Thursday, November 19, 2015 - 3:41:25 PM - Koen Verbeeck Back To Top (39106)

There have been a number of comments of people with a good eye who noticed the hash value in the table wasn't being updated.
The update statement at the start of the tip has been updated so that the hash is now updated as well. 


Tuesday, November 17, 2015 - 9:19:01 AM - Koen Verbeeck Back To Top (39080)

Hi Ha Le Thien,

which columns exactly give NULL errors?

Also, don't forget to update the hash value in the UPDATE statement. This is an error in the article and has been pointed out in the comments.


Sunday, November 15, 2015 - 4:39:06 AM - Ha Le Thien Back To Top (39072)

Dear Koen ,

I did exactly your instruction. However , there is some errors happen with staging table. The second time in lookup action to check match outputs to load it into staging. It is not work cause some errors as it is not allow NULL value to insert in staging table . Could u please check it again ?

 


Tuesday, August 11, 2015 - 5:32:22 AM - Koen Verbeeck Back To Top (38419)

@Bob Frasca: (if you are still following ths thread, almost one year later): I finally have been able to test your claim that you can still use the T-SQL Hashbytes function, but only if you convert the hashes to strings in the conditional split.

It does work, so thanks for that, but you have to store the hashes as binary values (and not as strings) in the destination table in order for it to work.

The solution presented in this tip is still valid for any source that is not SQL Server.
Say for example you read data from a large flat file and you don't want to use a staging table, then you can still use the script component.

Cheers,
Koen 


Tuesday, August 4, 2015 - 2:16:49 AM - Koen Verbeeck Back To Top (38355)

Hi Rodney,

the conditional split is case sensitive and white space sensitive.
Maybe you can put TRIM around the expressions to make sure there are no extra spaces. 


Monday, August 3, 2015 - 4:13:53 PM - Rodney Rau Back To Top (38353)

Koen,

It thinks they are all different, even though the bulk of them (95%) sure look like they are the same.

 


Monday, August 3, 2015 - 3:49:16 AM - Koen Verbeeck Back To Top (38348)

Rodney,

does the conditional split think all the hashes are the same, or that they are all different?


Friday, July 31, 2015 - 10:25:56 AM - Rodney Rau Back To Top (38336)

Hi Koen,

I receive no errors from the Conditional split, nor elsewhere.  I get the same results for every record in the batch whether the hashes differ or appear to be the same.  In all cases my cspl condition is met and no rows are directed to the default output. 

Thanks very much for your quick response!


Friday, July 31, 2015 - 2:56:00 AM - Koen Verbeeck Back To Top (38330)

Hi Rodney,

I don't have VS 2008 to test it myself.
Does the conditional split give any errors?
Or does it just doesn't compare the hashes correctly? 


Thursday, July 30, 2015 - 5:11:46 PM - Rodney Rau Back To Top (38329)

Thanks Koen,

Your tip worked great in my ssis/vs2013 packages.  In my ssis/vs2008 package I could not get the Conditional Split to recognize what appears to be like hashstrings.  Here's my condition expression from my 2008 cspl - (DT_STR,66,1252) [lkp_DWH_Hash] != (DT_STR,66,1252) [Hash].  I did have to use the SHA1 hash function rather that the SHA256 to get my 2008 c# to compile.  All the rest was same.  When I check the data viewers the Hash and lkp_DWH_Hash look identical.  Much appreciate any thoughts you might have on my issue.

Thanks again.


Wednesday, March 25, 2015 - 10:00:43 AM - Tom Oakes Back To Top (36684)

This is an outstanding article and helped us a lot.  The OLE DB Command approach had me really bummed out.  

 

Thanks.

 


Saturday, March 14, 2015 - 12:39:52 AM - sandeep Pandey Back To Top (36531)

Ohh, Yes sorry I didn't see. Thanks Koen once again great contribution.


Friday, March 13, 2015 - 9:55:01 AM - Koen Verbeeck Back To Top (36518)

@Sundeep: yes, the hash needs to be updated as well. Yan Ouyang already found this issue a few months back (see earlier comments).


Friday, March 13, 2015 - 8:03:18 AM - sandeep Pandey Back To Top (36516)

 Hi Koen,

Very nice and great to start with. I would like to add something later on you can cross check. This will update every time when package runs.

Because this update task need to update DWH_Hash at every update statement to get new Hash of change.

Please cross check from your side also.


Tuesday, February 24, 2015 - 7:47:11 AM - Koen Verbeeck Back To Top (36330)

@Andrew:

yes, I mention at the start of the article that this tip is an alternative to the pure T-SQL solution.
I also mention that since SQL Server 2012 you can use the CONCAT function in SQL Server which makes things even easier in TSQL.

However, the reason I wrote the script component at the time was because I couldn't get the conditional split to work with the binary output of HASHBYTES. Therefore I used the script instead, because it outputs a string. Now, Bob mentioned I should try to do cast in the conditional split, but I still haven't tested this. If the cast works, TSQL is definately easier than the script.

If you don't have a relational source though (such as a flat file), the script is your only option. Or you have to dump the data into a staging table and then do your comparison.

Cheers,
Koen 


Monday, February 23, 2015 - 5:55:59 PM - Andrew Jones Back To Top (36326)

Couldn't you generate the hash in the original select statement from the source without the need for the script task?


Thursday, December 4, 2014 - 9:47:52 PM - Yan Ouyang Back To Top (35517)

Thanks for your prompt reply, Koen. Actally I did update the DW_hash as you said. One thing is that SHA256 hash function doesn't work in 2008R2, so I used SHA1 instead of SHA256. Anyway, thanks!


Tuesday, November 25, 2014 - 8:29:25 AM - Koen Verbeeck Back To Top (35395)

@Yan: you are absolutely correct. The update of the hash should be done in the update statement.
When you have calculated the hash with the script component, you can store it in the staging table as well. And then you can update it using the UPDATE statement.

The reason I forgot to include this in the tip is because I based my package upon an actual package, where I do not update the rows but rather insert new versions of a row, just like in a type II dimension. 


Tuesday, November 25, 2014 - 6:45:25 AM - Yan Ouyang Back To Top (35392)

Nice article. Help me a lot. But I have a question about the update: I did not see where the DWH_Hashs of the updated records update. 


Thursday, September 25, 2014 - 3:30:19 AM - Koen Verbeeck Back To Top (34709)

@Bob: I haven't had time yet to test your claim. I'll come back at you later :)

@Damian: if you want to detect deletes you must make sure you do a full load of the source every time. Otherwise how would you know if a row has been deleted or if it just wasn't in the load window? Typically you can detect deletes by checking all of the rows of the destination table against the rows in the staging table using a NOT EXISTS clause.


Wednesday, September 24, 2014 - 11:26:50 AM - Damian Back To Top (34695)

Great article, got this working despite being a newbie to SSIS.

I want to expand this to deal with deletes too.  Can anyone give me any pointers?


Tuesday, September 16, 2014 - 11:28:02 AM - Bob Frasca Back To Top (34545)

Your remark about using the T-SQL hashbytes function isn't entirely accurate. This expression works just fine.

((DT_WSTR,42)srchash != (DT_WSTR,42)desthash)


Thursday, April 17, 2014 - 11:25:15 AM - Steven Neumersky Back To Top (30093)

Definitely better than the CHECKSUM transform on SQLIS.com.  Good stuff....as usual.


Friday, March 21, 2014 - 7:48:43 AM - Phil Parkin Back To Top (29827)

Nice article. It would also be interesting to see some performance stats comparing the HASH-only update with the full update to get an idea of how much of an overhead is added by creating and matching on the hashes.

If it is likely that a large percentage of the UPDATEs are actual changes, I would expect the full update to win.


Thursday, March 6, 2014 - 7:46:07 AM - Koen Verbeeck Back To Top (29656)

Indeed it does, Brett. Except you used MD5 (I prefer SHA256, but this is more of a personal choice), Visual Basic and the ugly OLE DB command ;)

In your code you looped over the columns of the input row in the buffer, which is certainly more efficient.


Wednesday, March 5, 2014 - 9:02:08 PM - Brett Flippin Back To Top (29651)

Seems pretty similar to my post on SQLServerCentral in 2010. Take a gander:

 

http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/69766/


Tuesday, February 25, 2014 - 3:42:58 AM - David S Back To Top (29561)

Nice article, we did this some weeks back but utilised another software package, (Pentaho), but went through the same process of comparing hashtag values. However, we decided to truly determine if a change was an update by dermining which fields we wanted to monitor for changes.

We had to do this as our incoming data file was a FULL refresh of the data we have loaded, and we could not afford the downtimein replacing our whole data set.

1. When doing initial load of data creating a hash value of all KEY fields to monitor added together, and store this hash value along with the data.

2. On import of new data into stage table, compare a hash summary of key fileds to hash value stored in database. If the hash values differed in any way, then this was an update, and passed on as such.

3. We also added a final step of comparing the Identity values in the incoming data file to what we had, (as our incoming data file was a 'full refresh of the data every time), if anything was found in our current dataset, which did not exist in the incoming data file, wemarked the source record as disabled, so it would no longer be used.


Tuesday, February 18, 2014 - 9:30:50 AM - Cedric Back To Top (29490)

Nice, already knew how to do it, but this sure is the best manual I've ever seen that explains the magic behind it.


Tuesday, February 18, 2014 - 8:49:23 AM - Greg Robidoux Back To Top (29488)

Download has been fixed.


Tuesday, February 18, 2014 - 5:20:45 AM - Saiful Back To Top (29487)

DOWNLOAD LINK FOR THIS PACKAGE IS NOT WORKING....















get free sql tips
agree to terms