By: Dinesh Asanka | Comments (4) | Related: More > Integration Services Development
Problem
SQL Server data warehouses contain large volumes of data from multiple heterogenous data sources. However, during the data extraction process, a major challenge is to retrieve only modifications that have occurred after the last load to the data warehouse. From each data source there can be modifications, additions and deletions which is a challenge to identify what is different from what is already in the data warehouse. Since loading the entire data set is not viable, there needs to be a better mechanism to retrieve only the changes for incremental data updates.
Solution
There are few solutions to tackle this challenge, from which you can choose the best to suit to your environment. Here are a few options:
- Typically, at the data source you can have auditing columns like UpdatedDateTime or ModifiedDateTime. These columns can be filtered to find out the newly added and modified records. A possible performance issue would be the need to scan the entire record set to find the specific rows unless you have indexes on these columns to minimize the performance impact. Also, records which are physically deleted from tables cannot be detected to make the necessary updates.
- Triggers can also be used to capture modifications in records. However, triggers can add a performance cost to existing transaction systems.
- Change Data Capture (CDC) is a feature of SQL Server Enterprise Edition that uses Transaction Log Sequence Numbers (LSNs) to identify insert, update, and delete operations that have occurred within a specified time period. CDC can be implemented in the Enterprise Edition of SQL Server 2008 or later version. Though it uses the transaction log, the database does not need to be in Full recovery model and CDC can be enabled even in the Simple recovery model.
We will use the CDC tasks in SSIS to see how this can be done.
Setup Change Data Capture (CDC) Example
Let's create a sample table to demonstrate the use of CDC.
USE Master GO CREATE Database SourceDB GO USE SourceDB GO IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers') BEGIN DROP TABLE Customers END CREATE TABLE Customers (ID INT IDENTITY PRIMARY KEY, Name VARCHAR(50), City VARCHAR(50), Status VARCHAR(2) )
After the table is created, let’s insert a few records as follows.
INSERT INTO Customers (Name,City,Status) VALUES ('John','New York','A'), ('Rose','Denver','A'), ('Kent','Settle','A') GO
Let’s create three tables to hold the incremental data by using the following scripts.
USE Master GO CREATE Database StagingDB GO USE StagingDB GO IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers_INS') BEGIN DROP TABLE Customers_INS END IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers_UPD') BEGIN DROP TABLE Customers_UPD END IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers_DEL') BEGIN DROP TABLE Customers_DEL END -- CREATE TABLE Customers_INS (ID INT PRIMARY KEY, Name VARCHAR(50), City VARCHAR(50), Status VARCHAR(2) ) CREATE TABLE Customers_UPD (ID INT PRIMARY KEY, Name VARCHAR(50), City VARCHAR(50), Status VARCHAR(2) ) CREATE TABLE Customers_DEL (ID INT PRIMARY KEY, Name VARCHAR(50), City VARCHAR(50), Status VARCHAR(2) )
Enabling CDC for SQL Server Database
The following script enables CDC for the database if it is not enabled and enables CDC for the customers table. It is essential to have a primary key constraint on the table where you enable CDC.
USE SourceDB GO -- Check sa has owner_sid permissions, if not then change db owner. IF (select SUSER_SNAME(owner_sid) from sys.databases WHERE NAME='SourceDB') <> 'sa' BEGIN EXEC sp_changedbowner 'sa' END GO -- Check SourceDB is not already CDC enabled, if not then enable CDC for the database if (select is_cdc_enabled from sys.databases WHERE NAME='SourceDB') = 'false' BEGIN EXEC sys.sp_cdc_enable_db END GO -- Check dbo.customers is not already CDC enabled, if not then enable CDC for the table if (select is_tracked_by_cdc from sys.tables WHERE SCHEMA_NAME(schema_id)='dbo' and NAME = 'Customers') = 'false' BEGIN EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'Customers', @role_name = NULL, @supports_net_changes = 1 END GO
It is important to note that, there is no necessity for tables in the source database to include a column that indicates the date and time of the last modification. This means that no structural changes are needed in order to enable CDC for extraction.
Make sure that SQL Server Agent is running as a SQL Server Agent Job is used to capture CDC data.
SSIS Package for Initial Load of Data
In any data warehouse, there is an initial load which will extract all data to a staging database at the start. This will be a one-time execution. We will use SSIS for extraction using the CDC Control Task.
In the control flow task of SSIS, we have the following: a CDC Control Task, a Data Flow Task and another CDC Control Task as shown below. We rename the first one as "Mark Initial Load Start" and the last one as "Mark Initial Load End".
Right click on "Mark Initial Load Start" to edit the task. This task will write an encoded value, including the starting LSN to a package variable.
For "Mark Initial Load Start" configure as follows:
- SQL Server CDC database ADO.NET connection manager: Configured to source database.
- CDC control operation: Select "Mark initial load start" from the drop down.
- Variable containing the CDC state: click New and create a new variable named CDC_State.
- Automatically store state in a database table: selected.
- Connection manager for the database where the state is stored: Staging database.
- Table to use for storing state: click New, and then click Run to create the cdc_states table. Following is the schema of the CDC State table. When the SSIS package is deployed to production, the cdc_state table should be created using the following T-SQL script.
- State name: CDC_State.
This is the cdc_states table that gets created. Here is the schema of the table and index.
CREATE TABLE [dbo].[cdc_states] ([name] [nvarchar](256) NOT NULL, [state] [nvarchar](256) NOT NULL) ON [PRIMARY] GO CREATE UNIQUE NONCLUSTERED INDEX [cdc_states_name] ON [dbo].[cdc_states] ( [name] ASC ) WITH (PAD_INDEX = OFF) ON [PRIMARY] GO
The Data Flow Task will extract all rows from the source and load them into the destination which is a staging table in this scenario. This data flow task does not require any CDC specific components.
For "Mark Initial Load End" configure as follows:
- SQL Server CDC database ADO.NET connection manager: Configured to source database.
- CDC control operation: Select "Mark initial load end" from the drop down.
- Variable containing the CDC state: Select the variable named CDC_State from the drop down.
- Automatically store state in a database table: selected.
- Connection manager for the database where the state is stored: Staging database
- Table to use for storing state: Select the cdc_states table which was created before.
- State name: CDC_State.
When this package is executed all records in the source table are inserted to the Customer_INS table that we created above. Also, the following get stored in the cdc_states table.
Update Some Data to See Impact of SSIS Incremental Updates
After extracting the initial load, the next task is to extract the incremental data load. To simulate this, let's execute the following query to add, modify and delete some records.
Before running these queries, the following are the records in the Customer table.
ID | Name | City | Status |
---|---|---|---|
1 | John | New York | A |
2 | Rose | Denver | A |
3 | Kent | Settle | A |
Run the following updates.
USE SourceDB GO INSERT INTO Customers (Name,City,Status) VALUES ('Kerry','Chicago','A'), ('Laurel','Texas','A') , ('Matt','Vegas','A') UPDATE Customers SET Status = 'I' WHERE ID = 5 UPDATE Customers SET Status = 'I' WHERE ID = 1 DELETE Customers WHERE ID =2 DELETE Customers WHERE ID =6 GO
After running the above queries, the following are the records in the table.
ID | Name | City | Status |
---|---|---|---|
1 | John | New York | I |
3 | Kent | Settle | A |
4 | Kerry | Chicago | A |
5 | Laurel | Texas | I |
If you analyze the records, you can see that record ID 1 is modified and record ID 2 is deleted. Record ID 4 is added, Record ID 5 is added and modified, record ID 6 is added and then deleted. However, when the incremental load is considered we need to consider only the net changes. For example, since Record ID 6 is ultimately deleted, it doesn’t need to be considered for the load.
SSIS Package for Incremental Data Updates
Let’s create the SSIS package to extract incremental data from sources. This is what the package will look like.
Truncate Staging Tables is an Execute SQL Task to delete previously inserted data in the staging tables, so that each time the staging tables get a fresh set of data.
Get Processing Range is a CDC Control Task like we configured before, the only difference is that we use Get processing range for the CDC control operation.
The next task, Extract Modified Records is a Data Flow Task which will look like this when configured.
These are the CDC tasks in the SSIS Toolbox.
Customer CDC is a CDC Source, configured as below for the customers table.
Important configuration is that the CDC processing mode is set to Net in order to capture the net changes rather than capturing all records which is not necessary for data warehousing.
The CDC Splitter, is used to split insert, delete and update records and split records are inserted to the different staging tables Customers_INS, Customers_UPD and Customers_DEL.
To help to see what is going on in the package I enabled the data viewer so records can be verified and checked to make sure the data is correct and they are going to the correct location.
In the above figure, we can see that record ID 6 is not captured as it is inserted and deleted in one session. This means that net changes are captured which will reduce the complexity and the extraction duration. Since there are no changes to record ID 3, that record is not captured for the extraction.
Finally, the Mark Processed Range CDC control will mark that this set is processed, so the next time the SSIS package is executed, only new records will be extracted and processed.
After getting the data to the different stating tables, the next step of loading the data will be much easier.
Next Steps
- You can get more details about the CDC Tasks from this link.
- Check out the CDC tips.
- Learn more about SSIS with this tutorial.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips