Custom SQL Server Change Tracking Example

By: Ray Barley | Comments (1) | Related: More > Import and Export

Problem

We have several SQL Server data marts that need to get updated with the daily changes from our main business application database (i.e. our source system) which is a SQL Server database. Our Extract, Transform and Load (ETL) processes have been following the truncate and load pattern where we truncate our staging tables and load them with every row from the source system database for selected tables every day. We have built our own custom business logic to figure out what changed, and perform the appropriate updates in our data marts. Given that we are processing every row and not just the ones that changed, the duration of this process is no longer acceptable. We would like to transform our process into one that works like the built-in Change Tracking in SQL Server. We want to query the main business application database and just get the rows that have been inserted or updated. The problem is that all we have to work with is a daily database snapshot of our source system. The tables do have primary keys defined and a LAST_UPDATE column which is a DATE type. Can you provide a solution?

Solution

In this tip I will demonstrate a simple way of implementing a change tracking solution.

Our scenario is a common one where we need to extract the latest inserts and updates from our selected source system tables and load into staging tables which we will use to update our data marts. Since the source system tables have a LAST_UPDATE column, we can retrieve just the rows that have recently been inserted or updated rather than retrieving every row. I call this the incremental extract and load to staging pattern.

A database snapshot is essentially a read-only copy of a database. Since our source system is a database snapshot, we are not able to take advantage of the built-in SQL Server solutions that we could otherwise leverage in the source system such as Change Tracking and Change Data Capture. Change Tracking is a light-weight solution that logs the rows that have been inserted, updated or deleted into tables that you can query and use to retrieve the actual rows that have been inserted or updated. Change Data Capture is a much more robust solution where you can specify the tables and columns you want to track and SQL Server will capture the changes into specific tables that you can query.

In the following sections I will walk through the demo scenario, the incremental extract and load to staging pattern, the implementation details, and a demo of the solution.

Demo Scenario

Before digging in to the implementation details, I want to provide a demo scenario that I will reference. The steps to prepare the demo scenario are to setup the sample source database, database snapshot, and the sample staging database.

For the source database, I will use an old SQL Server sample database called pubs. It’s extremely easy to setup. Here are the steps:

Create a pubs database
Get the T-SQL script to create the pubs database objects and load the sample data from GitHub. Click on the instpubs.sql script and copy/paste the T-SQL code into a new SQL Server Management Studio (SSMS) query window
Execute the T-SQL script
I’m going to use the publishers table for the change tracking demo; execute the T-SQL script below to add the LAST_MODIFIED column, set its value to a particular date, and create a non-clustered index on it for efficient query performance:

/* SCRIPT: ADD_LAST_MODIFIED_TO_PUBLISHERS.sql */
USE pubs;
GO

ALTER TABLE [dbo].[publishers]
ADD [LAST_MODIFIED] DATE NOT NULL
DEFAULT GETDATE();
GO

-- set [LAST_MODIFIED] to a couple of days ago
-- so we can see changes when we test
UPDATE [pubs_snapshot].[dbo].[publishers]
SET [LAST_MODIFIED] = '20200301';
GO

CREATE NONCLUSTERED INDEX NC_publishers_LAST_MODIFIED
ON [dbo].[publishers] ([LAST_MODIFIED])
GO

Use the following T-SQL script to get the data file name for the pubs database (your filename may be different based on your SQL Server version):

 /* SCRIPT: GET_PUBS_DATABASE_FILENAME.sql */
USE pubs
GO

SELECT 
[name]
, [physical_name] 
FROM sys.database_files
WHERE [name] = 'pubs'
AND [type_desc] = 'ROWS'

Get the data file name for the pubs database.

Use the following T-SQL script to drop the database snapshot if it exists and create the database snapshot based on your filename from the previous step:

/* SCRIPT: CREATE_DATABASE_SNAPSHOT.sql

Change FILENAME path below as necessary */

USE master
GO

-- drop the snapshot if it exists
IF EXISTS (
	SELECT 
		s1.[name]			AS [SNAPSHOT_DATABASE]
	,	s1.[create_date]	AS [SNAPSHOT_CREATE_DATE]
	,	s0.[name]			AS [SNAPSHOT_OF_DATABASE]
	FROM sys.databases s1
	JOIN sys.databases s0
		ON s0.[database_id] = s1.[source_database_id]
	WHERE s1.[name] = 'pubs_snapshot'
)
BEGIN
	DROP DATABASE pubs_snapshot;
END

CREATE DATABASE pubs_snapshot
ON  (
	NAME=pubs
,	FILENAME='C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\pubs_snapshot.mdf'
) 
AS SNAPSHOT OF pubs;

For the staging database, I will use a new database called pubs_staging. The following are the high-level steps for implementing the simple change tracking solution using the publishers table:

Create a staging database named pubs_staging
Create the publishers table in the staging database with the same columns as in the pubs_snapshot database
Create a stored procedure to perform the incremental extract from the publishers table in the source system and update the publishers table in the pubs_staging database

I will cover the table and stored procedure details in the implementation section below.

Incremental Extract and Load to Staging Pattern

The pattern for the incremental extract and load to staging is very simple. Remember that our goal is to retrieve the rows from the source system that have been inserted or updated since the last time we executed the process. A key point is that while we may execute the incremental extract and load process on a very specific schedule (e.g. daily), we want the process to work regardless of how much time has passed since the last time we ran the process.

Both the source and staging tables have the same primary key constraint. Each time the incremental extract and load to staging process is run, the staging table is essentially synchronized to the source table.

The following are the steps in the incremental extract and load to staging pattern:

Query the staging table to get the MAX value of the LAST_MODIFIED column
Query the source table to get every row where the LAST_MODIFIED column value is greater than the MAX LAST_MODIFIED column in the staging table
For each row retrieved from the source table, perform an insert or update on the staging table based on the following criteria:
1. If the primary key value in the source table does not exist in the staging table, then insert the row into the staging table
2. If the primary key value in the source table does exist in the staging table, then update the row in the staging table from the source table row

The incremental extract and load to staging pattern provides us with a staging table that we can query and determine what rows have changed for any date or date range. When using the staging table to update our data marts, this is essentially the same capability that we get with Change Tracking or Change Data Capture although these built-in features do provide additional capabilities (e.g. what columns changed) and they do not require any coding other than configuration.

Now it is time to review the implementation details.

Implementation

In this section I will provide the implementation details on the incremental extract and load to staging. The T-SQL script to create the publishers table in the pubs_staging database is shown below:

USE [pubs_staging]
GO

CREATE TABLE [dbo].[publishers] (
	[pub_id]		[char](4) NOT NULL,
	[pub_name]		[varchar](40) NULL,
	[city]			[varchar](20) NULL,
	[state]			[char](2) NULL,
	[country]		[varchar](30) NULL,
	[LAST_MODIFIED] 	[date] NOT NULL,
	CONSTRAINT 		[UPKCL_pubind] PRIMARY KEY CLUSTERED 
	(
		[pub_id] ASC
	) 
)
GO

The publishers table has the same columns as the publishers table in the source database.

The following stored procedure is used to implement the incremental extract and load to staging pattern as described in the previous section:

Step 1: Get the MAX([LAST_MODIFIED]) from Staging

USE [pubs_staging]
GO

CREATE PROCEDURE [dbo].[MERGE_PUBLISHERS]
AS
BEGIN
	DECLARE @MAX_LAST_MODIFIED	DATE;

	-- Get the MAX LAST_MODIFIED from the staging table
	SELECT
		@MAX_LAST_MODIFIED = MAX([LAST_MODIFIED])
	FROM [pubs_staging].[dbo].[publishers];

	-- Handle edge case - staging table is empty and 
	-- @MAX_LAST_MODIFIED IS NULL; set to arbitrary date
	IF @MAX_LAST_MODIFIED IS NULL
		SET @MAX_LAST_MODIFIED = '18991231';

The main points are:

We get the MAX([LAST_MODIFIED]) from the publishers table in the pubs_staging database so that we can retrieve rows from the source table that have been inserted or updated after this date
If the publishers table in the pubs_staging database does not have any rows, the @MAX_LAST_MODIFIED variable will be NULL
If the @MAX_LAST_MODIFIED variable IS NULL, we use an arbitrary date to assure that the value of the LAST_MODIFIED column in every row in the source table will be greater than this date

Step 2: Retrieve the Inserted and Updated Rows from the Source Table

	-- Get rows from source table where LAST_MODIFIED
	-- is > MAX([LAST_MODIFIED]) in staging table.
	WITH CTE_SOURCE AS (
		SELECT
			pub_id
		,	pub_name
		,	city
		,	state
		,	country
		,	LAST_MODIFIED	
		FROM [pubs_snapshot].[dbo].[publishers]	
		WHERE [LAST_MODIFIED] > @MAX_LAST_MODIFIED
	)

The main points are:

I use a Common Table Expression (CTE) to specify the query to retrieve the rows from the source table that have been inserted or updated
The CTE will be referenced in the last step of the stored procedure below
A CTE is a technique that isolates a part of a complex T-SQL statement

Step 3: Update the Staging Table

MERGE [pubs_staging].[dbo].[publishers] AS tgt
	USING (
		SELECT
			pub_id
		,	pub_name
		,	city
		,	state
		,	country
		,	LAST_MODIFIED	
		FROM CTE_SOURCE
	) AS src
	ON (tgt.pub_id = src.pub_id)

	WHEN MATCHED THEN
		UPDATE SET
			pub_name = src.pub_name
		,	city = src.city
		,	state = src.state
		,	country = src.country
		,	LAST_MODIFIED = src.LAST_MODIFIED

	WHEN NOT MATCHED THEN
		INSERT (
			pub_id
		,	pub_name
		,	city
		,	state
		,	country
		,	LAST_MODIFIED	
		) 
		VALUES (
			src.pub_id
		,	src.pub_name
		,	src.city
		,	src.state
		,	src.country
		,	src.LAST_MODIFIED	
		);
END

The main points are:

This is the perfect scenario for using the T-SQL MERGE which works on a source and a target
The source is the publishers table in the database snapshot
The target is the publishers table in the pubs_staging database
Both the source and the target have the same primary key constraint; the ON clause joins the source and target based on the primary key value
The MATCHED clause updates the target when the pub_id value in the source matches a row in the target
The NOT MATCHED clause inserts the source row values into the target when the source pub_id value does not exist in the target

Demo

Now that I have completed covering the implementation details, it’s time for a demo. The following are the steps to demo this tip based on my description in the Demo Scenario section above:

Step 1: Execute stored procedure to extract and load; view results

/* SCRIPT: TEST_EXTRACT_AND_LOAD_TO_STAGING.sql */
USE [pubs_staging]
GO

-- Extract and load to staging
EXEC [dbo].[MERGE_PUBLISHERS];

-- See all rows loaded
SELECT * FROM [dbo].[publishers];

Execute stored procedure to extract, load, and view results.

The main points are:

The stored procedure extracts the rows from the source that have been inserted or updated
Since this is the first time it’s been run, all rows are inserted into the publishers table in the staging database

Step 2: Make changes to the publishers table in the pubs database

-- do insert in [pubs]
INSERT [pubs].[dbo].[publishers] (
	pub_id
,	pub_name
,	city
,	state
,	country
,	LAST_MODIFIED	
) 
VALUES (
	'9902'
,	'Bond Publishers'
,	'New Orleans'
,	'LA'
,	'USA'
,	'20200302'	
);

-- do update in [pubs]
UPDATE [pubs].[dbo].[publishers]
SET
	city = 'Irving'
,	LAST_MODIFIED = '20200302'
WHERE pub_id = 1756;

The main points are:

This is just a simple insert and update in the publishers table of the pubs database
Note that the LAST_UPDATE value is set to ‘20200302’ in each case; the stored procedure will extract these rows and update the publishers table in the staging database

Step 3: Execute the CREATE_DATABASE_SNAPSHOT script

This script was run in the Demo Scenario section above. It drops the database snapshot if it exists and creates it. Remember that our source database is the database snapshot which is essentially a read-only copy of a database at the point in time it is created. The problem statement notes that we use a database snapshot that is recreated daily from the main business application database.

Step 4: Execute stored procedure to extract and load; view results

This is just a repeat of Step 1; note that the new rows added and updated are highlighted below:

/* SCRIPT: TEST_EXTRACT_AND_LOAD_TO_STAGING.sql */
USE [pubs_staging]
GO

-- Extract and load to staging
EXEC [dbo].[MERGE_PUBLISHERS];

-- See all rows loaded
SELECT * FROM [dbo].[publishers];

Execute stored procedure to extract, load, and view results including the latest changes.

Next Steps

I have now demonstrated the solution using the incremental extract and load to staging pattern. This approach of extracting and loading just the rows that have changed since the last run can save a substantial amount of processing time.

I have a future tip in mind where I will implement this pattern using SQL Server Integration Services (SSIS) and Biml which will automate the repetitive task of manually coding the implementation of the incremental extract and load to staging pattern.

The following are the next steps:

Review your ETL processes and determine whether you could benefit by implementing this approach.
Take a look at the Biml Tutorial to get an understanding of how you can leverage Biml to automate repetitive ETL tasks.
Download the code from this tip here and experiment.

About the author

Ray Barley is a Principal Architect at IT Resource Partners and a MSSQLTips.com BI Expert.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Friday, April 10, 2020 - 6:26:59 PM - Robert Magrogan, Jr	Back To Top (85337)
Consider using CHECKSUM to compare the OLTP table with the data mart/warehouse table. Use no match result in the WHERE clause of you MERGE.

Custom SQL Server Change Tracking Example

Problem

Solution

Demo Scenario

Incremental Extract and Load to Staging Pattern

Implementation

Step 1: Get the MAX([LAST_MODIFIED]) from Staging

Step 2: Retrieve the Inserted and Updated Rows from the Source Table

Step 3: Update the Staging Table

Demo

Step 1: Execute stored procedure to extract and load; view results

Step 2: Make changes to the publishers table in the pubs database

Step 3: Execute the CREATE_DATABASE_SNAPSHOT script

Step 4: Execute stored procedure to extract and load; view results

Next Steps

About the author

Comments For This Article