By: Ray Barley | Comments (1) | Related: More > Import and Export
Problem
We have several SQL Server data marts that need to get updated with the daily changes from our main business application database (i.e. our source system) which is a SQL Server database. Our Extract, Transform and Load (ETL) processes have been following the truncate and load pattern where we truncate our staging tables and load them with every row from the source system database for selected tables every day. We have built our own custom business logic to figure out what changed, and perform the appropriate updates in our data marts. Given that we are processing every row and not just the ones that changed, the duration of this process is no longer acceptable. We would like to transform our process into one that works like the built-in Change Tracking in SQL Server. We want to query the main business application database and just get the rows that have been inserted or updated. The problem is that all we have to work with is a daily database snapshot of our source system. The tables do have primary keys defined and a LAST_UPDATE column which is a DATE type. Can you provide a solution?
Solution
In this tip I will demonstrate a simple way of implementing a change tracking solution.
Our scenario is a common one where we need to extract the latest inserts and updates from our selected source system tables and load into staging tables which we will use to update our data marts. Since the source system tables have a LAST_UPDATE column, we can retrieve just the rows that have recently been inserted or updated rather than retrieving every row. I call this the incremental extract and load to staging pattern.
A database snapshot is essentially a read-only copy of a database. Since our source system is a database snapshot, we are not able to take advantage of the built-in SQL Server solutions that we could otherwise leverage in the source system such as Change Tracking and Change Data Capture. Change Tracking is a light-weight solution that logs the rows that have been inserted, updated or deleted into tables that you can query and use to retrieve the actual rows that have been inserted or updated. Change Data Capture is a much more robust solution where you can specify the tables and columns you want to track and SQL Server will capture the changes into specific tables that you can query.
In the following sections I will walk through the demo scenario, the incremental extract and load to staging pattern, the implementation details, and a demo of the solution.
Demo Scenario
Before digging in to the implementation details, I want to provide a demo scenario that I will reference. The steps to prepare the demo scenario are to setup the sample source database, database snapshot, and the sample staging database.
For the source database, I will use an old SQL Server sample database called pubs. It’s extremely easy to setup. Here are the steps:
- Create a pubs database
- Get the T-SQL script to create the pubs database objects and load the sample data from GitHub. Click on the instpubs.sql script and copy/paste the T-SQL code into a new SQL Server Management Studio (SSMS) query window
- Execute the T-SQL script
- I’m going to use the publishers table for the change tracking demo; execute the T-SQL script below to add the LAST_MODIFIED column, set its value to a particular date, and create a non-clustered index on it for efficient query performance:
/* SCRIPT: ADD_LAST_MODIFIED_TO_PUBLISHERS.sql */ USE pubs; GO ALTER TABLE [dbo].[publishers] ADD [LAST_MODIFIED] DATE NOT NULL DEFAULT GETDATE(); GO -- set [LAST_MODIFIED] to a couple of days ago -- so we can see changes when we test UPDATE [pubs_snapshot].[dbo].[publishers] SET [LAST_MODIFIED] = '20200301'; GO CREATE NONCLUSTERED INDEX NC_publishers_LAST_MODIFIED ON [dbo].[publishers] ([LAST_MODIFIED]) GO
- Use the following T-SQL script to get the data file name for the pubs database (your filename may be different based on your SQL Server version):
/* SCRIPT: GET_PUBS_DATABASE_FILENAME.sql */ USE pubs GO SELECT [name] , [physical_name] FROM sys.database_files WHERE [name] = 'pubs' AND [type_desc] = 'ROWS'
- Use the following T-SQL script to drop the database snapshot if it exists and create the database snapshot based on your filename from the previous step:
/* SCRIPT: CREATE_DATABASE_SNAPSHOT.sql Change FILENAME path below as necessary */ USE master GO -- drop the snapshot if it exists IF EXISTS ( SELECT s1.[name] AS [SNAPSHOT_DATABASE] , s1.[create_date] AS [SNAPSHOT_CREATE_DATE] , s0.[name] AS [SNAPSHOT_OF_DATABASE] FROM sys.databases s1 JOIN sys.databases s0 ON s0.[database_id] = s1.[source_database_id] WHERE s1.[name] = 'pubs_snapshot' ) BEGIN DROP DATABASE pubs_snapshot; END CREATE DATABASE pubs_snapshot ON ( NAME=pubs , FILENAME='C:\Program Files\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\pubs_snapshot.mdf' ) AS SNAPSHOT OF pubs;
For the staging database, I will use a new database called pubs_staging. The following are the high-level steps for implementing the simple change tracking solution using the publishers table:
- Create a staging database named pubs_staging
- Create the publishers table in the staging database with the same columns as in the pubs_snapshot database
- Create a stored procedure to perform the incremental extract from the publishers table in the source system and update the publishers table in the pubs_staging database
I will cover the table and stored procedure details in the implementation section below.
Incremental Extract and Load to Staging Pattern
The pattern for the incremental extract and load to staging is very simple. Remember that our goal is to retrieve the rows from the source system that have been inserted or updated since the last time we executed the process. A key point is that while we may execute the incremental extract and load process on a very specific schedule (e.g. daily), we want the process to work regardless of how much time has passed since the last time we ran the process.
Both the source and staging tables have the same primary key constraint. Each time the incremental extract and load to staging process is run, the staging table is essentially synchronized to the source table.
The following are the steps in the incremental extract and load to staging pattern:
- Query the staging table to get the MAX value of the LAST_MODIFIED column
- Query the source table to get every row where the LAST_MODIFIED column value is greater than the MAX LAST_MODIFIED column in the staging table
- For each row retrieved from the source table, perform an insert or update
on the staging table based on the following criteria:
- If the primary key value in the source table does not exist in the staging table, then insert the row into the staging table
- If the primary key value in the source table does exist in the staging table, then update the row in the staging table from the source table row
The incremental extract and load to staging pattern provides us with a staging table that we can query and determine what rows have changed for any date or date range. When using the staging table to update our data marts, this is essentially the same capability that we get with Change Tracking or Change Data Capture although these built-in features do provide additional capabilities (e.g. what columns changed) and they do not require any coding other than configuration.
Now it is time to review the implementation details.
Implementation
In this section I will provide the implementation details on the incremental extract and load to staging. The T-SQL script to create the publishers table in the pubs_staging database is shown below:
USE [pubs_staging] GO CREATE TABLE [dbo].[publishers] ( [pub_id] [char](4) NOT NULL, [pub_name] [varchar](40) NULL, [city] [varchar](20) NULL, [state] [char](2) NULL, [country] [varchar](30) NULL, [LAST_MODIFIED] [date] NOT NULL, CONSTRAINT [UPKCL_pubind] PRIMARY KEY CLUSTERED ( [pub_id] ASC ) ) GO
The publishers table has the same columns as the publishers table in the source database.
The following stored procedure is used to implement the incremental extract and load to staging pattern as described in the previous section:
Step 1: Get the MAX([LAST_MODIFIED]) from Staging
USE [pubs_staging] GO CREATE PROCEDURE [dbo].[MERGE_PUBLISHERS] AS BEGIN DECLARE @MAX_LAST_MODIFIED DATE; -- Get the MAX LAST_MODIFIED from the staging table SELECT @MAX_LAST_MODIFIED = MAX([LAST_MODIFIED]) FROM [pubs_staging].[dbo].[publishers]; -- Handle edge case - staging table is empty and -- @MAX_LAST_MODIFIED IS NULL; set to arbitrary date IF @MAX_LAST_MODIFIED IS NULL SET @MAX_LAST_MODIFIED = '18991231';
The main points are:
- We get the MAX([LAST_MODIFIED]) from the publishers table in the pubs_staging database so that we can retrieve rows from the source table that have been inserted or updated after this date
- If the publishers table in the pubs_staging database does not have any rows, the @MAX_LAST_MODIFIED variable will be NULL
- If the @MAX_LAST_MODIFIED variable IS NULL, we use an arbitrary date to assure that the value of the LAST_MODIFIED column in every row in the source table will be greater than this date
Step 2: Retrieve the Inserted and Updated Rows from the Source Table
-- Get rows from source table where LAST_MODIFIED -- is > MAX([LAST_MODIFIED]) in staging table. WITH CTE_SOURCE AS ( SELECT pub_id , pub_name , city , state , country , LAST_MODIFIED FROM [pubs_snapshot].[dbo].[publishers] WHERE [LAST_MODIFIED] > @MAX_LAST_MODIFIED )
The main points are:
- I use a Common Table Expression (CTE) to specify the query to retrieve the rows from the source table that have been inserted or updated
- The CTE will be referenced in the last step of the stored procedure below
- A CTE is a technique that isolates a part of a complex T-SQL statement
Step 3: Update the Staging Table
MERGE [pubs_staging].[dbo].[publishers] AS tgt USING ( SELECT pub_id , pub_name , city , state , country , LAST_MODIFIED FROM CTE_SOURCE ) AS src ON (tgt.pub_id = src.pub_id) WHEN MATCHED THEN UPDATE SET pub_name = src.pub_name , city = src.city , state = src.state , country = src.country , LAST_MODIFIED = src.LAST_MODIFIED WHEN NOT MATCHED THEN INSERT ( pub_id , pub_name , city , state , country , LAST_MODIFIED ) VALUES ( src.pub_id , src.pub_name , src.city , src.state , src.country , src.LAST_MODIFIED ); END
The main points are:
- This is the perfect scenario for using the T-SQL MERGE which works on a source and a target
- The source is the publishers table in the database snapshot
- The target is the publishers table in the pubs_staging database
- Both the source and the target have the same primary key constraint; the ON clause joins the source and target based on the primary key value
- The MATCHED clause updates the target when the pub_id value in the source matches a row in the target
- The NOT MATCHED clause inserts the source row values into the target when the source pub_id value does not exist in the target
Demo
Now that I have completed covering the implementation details, it’s time for a demo. The following are the steps to demo this tip based on my description in the Demo Scenario section above:
Step 1: Execute stored procedure to extract and load; view results
/* SCRIPT: TEST_EXTRACT_AND_LOAD_TO_STAGING.sql */ USE [pubs_staging] GO -- Extract and load to staging EXEC [dbo].[MERGE_PUBLISHERS]; -- See all rows loaded SELECT * FROM [dbo].[publishers];
The main points are:
- The stored procedure extracts the rows from the source that have been inserted or updated
- Since this is the first time it’s been run, all rows are inserted into the publishers table in the staging database
Step 2: Make changes to the publishers table in the pubs database
-- do insert in [pubs] INSERT [pubs].[dbo].[publishers] ( pub_id , pub_name , city , state , country , LAST_MODIFIED ) VALUES ( '9902' , 'Bond Publishers' , 'New Orleans' , 'LA' , 'USA' , '20200302' ); -- do update in [pubs] UPDATE [pubs].[dbo].[publishers] SET city = 'Irving' , LAST_MODIFIED = '20200302' WHERE pub_id = 1756;
The main points are:
- This is just a simple insert and update in the publishers table of the pubs database
- Note that the LAST_UPDATE value is set to ‘20200302’ in each case; the stored procedure will extract these rows and update the publishers table in the staging database
Step 3: Execute the CREATE_DATABASE_SNAPSHOT script
This script was run in the Demo Scenario section above. It drops the database snapshot if it exists and creates it. Remember that our source database is the database snapshot which is essentially a read-only copy of a database at the point in time it is created. The problem statement notes that we use a database snapshot that is recreated daily from the main business application database.
Step 4: Execute stored procedure to extract and load; view results
This is just a repeat of Step 1; note that the new rows added and updated are highlighted below:
/* SCRIPT: TEST_EXTRACT_AND_LOAD_TO_STAGING.sql */ USE [pubs_staging] GO -- Extract and load to staging EXEC [dbo].[MERGE_PUBLISHERS]; -- See all rows loaded SELECT * FROM [dbo].[publishers];
Next Steps
I have now demonstrated the solution using the incremental extract and load to staging pattern. This approach of extracting and loading just the rows that have changed since the last run can save a substantial amount of processing time.
I have a future tip in mind where I will implement this pattern using SQL Server Integration Services (SSIS) and Biml which will automate the repetitive task of manually coding the implementation of the incremental extract and load to staging pattern.
The following are the next steps:
- Review your ETL processes and determine whether you could benefit by implementing this approach.
- Take a look at the Biml Tutorial to get an understanding of how you can leverage Biml to automate repetitive ETL tasks.
- Download the code from this tip here and experiment.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips