Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

SQL Server Integration Services SSIS CDC Tasks for Incremental Data Loading


By:   |   Last Updated: 2019-03-21   |   Comments   |   Related Tips: More > Integration Services Development

Problem

SQL Server data warehouses contain large volumes of data from multiple heterogenous data sources. However, during the data extraction process, a major challenge is to retrieve only modifications that have occurred after the last load to the data warehouse. From each data source there can be modifications, additions and deletions which is a challenge to identify what is different from what is already in the data warehouse. Since loading the entire data set is not viable, there needs to be a better mechanism to retrieve only the changes for incremental data updates.

Solution

There are few solutions to tackle this challenge, from which you can choose the best to suit to your environment. Here are a few options:

  • Typically, at the data source you can have auditing columns like UpdatedDateTime or ModifiedDateTime. These columns can be filtered to find out the newly added and modified records.  A possible performance issue would be the need to scan the entire record set to find the specific rows unless you have indexes on these columns to minimize the performance impact. Also, records which are physically deleted from tables cannot be detected to make the necessary updates.
  • Triggers can also be used to capture modifications in records. However, triggers can add a performance cost to existing transaction systems.
  • Change Data Capture (CDC) is a feature of SQL Server Enterprise Edition that uses Transaction Log Sequence Numbers (LSNs) to identify insert, update, and delete operations that have occurred within a specified time period. CDC can be implemented in the Enterprise Edition of SQL Server 2008 or later version.  Though it uses the transaction log, the database does not need to be in Full recovery model and CDC can be enabled even in the Simple recovery model.

We will use the CDC tasks in SSIS to see how this can be done.

Setup Change Data Capture (CDC) Example

Let's create a sample table to demonstrate the use of CDC.

USE Master
GO

CREATE Database SourceDB
GO
 
USE SourceDB
GO

IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers')
BEGIN
   DROP TABLE Customers
END
 
CREATE TABLE Customers
(ID INT IDENTITY PRIMARY KEY,
Name VARCHAR(50),
City VARCHAR(50),
Status VARCHAR(2)
)

After the table is created, let’s insert a few records as follows.

INSERT INTO Customers
(Name,City,Status)
VALUES
('John','New York','A'),
('Rose','Denver','A'),
('Kent','Settle','A')
GO

Let’s create three tables to hold the incremental data by using the following scripts.

USE Master
GO

CREATE Database StagingDB
GO

USE StagingDB
GO

IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers_INS')
BEGIN
   DROP TABLE Customers_INS
END
IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers_UPD')
BEGIN
   DROP TABLE Customers_UPD
END
IF EXISTS (SELECT 1 FROM Sys.tables WHERE NAME = 'Customers_DEL')
BEGIN
   DROP TABLE Customers_DEL
END

--
CREATE TABLE Customers_INS
(ID INT PRIMARY KEY,
Name VARCHAR(50),
City VARCHAR(50),
Status VARCHAR(2)
)
 
CREATE TABLE Customers_UPD
(ID INT PRIMARY KEY,
Name VARCHAR(50),
City VARCHAR(50),
Status VARCHAR(2)
)

CREATE TABLE Customers_DEL
(ID INT PRIMARY KEY,
Name VARCHAR(50),
City VARCHAR(50),
Status VARCHAR(2)
)

Enabling CDC for SQL Server Database

The following script enables CDC for the database if it is not enabled and enables CDC for the customers table. It is essential to have a primary key constraint on the table where you enable CDC.

USE SourceDB
GO

-- Check sa has owner_sid permissions, if not then change db owner.
IF (select SUSER_SNAME(owner_sid) from sys.databases WHERE NAME='SourceDB') <> 'sa'
BEGIN
   EXEC sp_changedbowner 'sa'
END
GO
 
-- Check SourceDB is not already CDC enabled, if not then enable CDC for the database
if (select is_cdc_enabled from sys.databases WHERE NAME='SourceDB') = 'false'
BEGIN
   EXEC sys.sp_cdc_enable_db
END
GO
 
-- Check dbo.customers is not already CDC enabled, if not then enable CDC for the table
if (select is_tracked_by_cdc from sys.tables WHERE SCHEMA_NAME(schema_id)='dbo' and  NAME = 'Customers') = 'false'
BEGIN
EXEC sys.sp_cdc_enable_table
   @source_schema = N'dbo',
   @source_name   = N'Customers',
   @role_name     = NULL,
   @supports_net_changes = 1
END
GO

It is important to note that, there is no necessity for tables in the source database to include a column that indicates the date and time of the last modification. This means that no structural changes are needed in order to enable CDC for extraction.

Make sure that SQL Server Agent is running as a SQL Server Agent Job is used to capture CDC data.

SSIS Package for Initial Load of Data

In any data warehouse, there is an initial load which will extract all data to a staging database at the start. This will be a one-time execution. We will use SSIS for extraction using the CDC Control Task.

cdc control flow tasks

In the control flow task of SSIS, we have the following: a CDC Control Task, a Data Flow Task and another CDC Control Task as shown below. We rename the first one as "Mark Initial Load Start" and the last one as "Mark Initial Load End".

Initial Data Warehouse load

Right click on "Mark Initial Load Start" to edit the task. This task will write an encoded value, including the starting LSN to a package variable.

SSIS CDC Control Task Editor

For "Mark Initial Load Start" configure as follows:

  • SQL Server CDC database ADO.NET connection manager: Configured to source database.
  • CDC control operation: Select "Mark initial load start" from the drop down.
  • Variable containing the CDC state: click New and create a new variable named CDC_State.
  • Automatically store state in a database table: selected.
  • Connection manager for the database where the state is stored: Staging database.
  • Table to use for storing state: click New, and then click Run to create the cdc_states table.  Following is the schema of the CDC State table. When the SSIS package is deployed to production, the cdc_state table should be created using the following T-SQL script.
  • State name: CDC_State.

This is the cdc_states table that gets created.  Here is the schema of the table and index.

CREATE TABLE [dbo].[cdc_states] 
 ([name] [nvarchar](256) NOT NULL, 
 [state] [nvarchar](256) NOT NULL) ON [PRIMARY]
GO
 
CREATE UNIQUE NONCLUSTERED INDEX [cdc_states_name] ON 
 [dbo].[cdc_states] 
 ( [name] ASC ) 
 WITH (PAD_INDEX  = OFF) ON [PRIMARY]
GO

The Data Flow Task will extract all rows from the source and load them into the destination which is a staging table in this scenario. This data flow task does not require any CDC specific components.

For "Mark Initial Load End" configure as follows:

  • SQL Server CDC database ADO.NET connection manager: Configured to source database.
  • CDC control operation: Select "Mark initial load end" from the drop down.
  • Variable containing the CDC state: Select the variable named CDC_State from the drop down.
  • Automatically store state in a database table: selected.
  • Connection manager for the database where the state is stored: Staging database
  • Table to use for storing state: Select the cdc_states table which was created before.
  • State name: CDC_State.

When this package is executed all records in the source table are inserted to the Customer_INS table that we created above.  Also, the following get stored in the cdc_states table.

CDC State stored in the dbo.CDC_States table

Update Some Data to See Impact of SSIS Incremental Updates

After extracting the initial load, the next task is to extract the incremental data load. To simulate this, let's execute the following query to add, modify and delete some records.

Before running these queries, the following are the records in the Customer table.

ID Name City Status
1 John New York A
2 Rose Denver A
3 Kent Settle A

Run the following updates.

USE SourceDB
GO

INSERT INTO Customers
(Name,City,Status)
VALUES
('Kerry','Chicago','A'),
('Laurel','Texas','A') ,
('Matt','Vegas','A') 
 
UPDATE Customers 
SET Status = 'I'
WHERE ID  = 5
 
UPDATE Customers 
SET Status = 'I'
WHERE ID  = 1
 
DELETE Customers
WHERE ID =2
 
DELETE Customers
WHERE ID =6
GO

After running the above queries, the following are the records in the table.

ID Name City Status
1 John New York I
3 Kent Settle A
4 Kerry Chicago A
5 Laurel Texas I

If you analyze the records, you can see that record ID 1 is modified and record ID 2 is deleted. Record ID 4 is added, Record ID 5 is added and modified, record ID 6 is added and then deleted. However, when the incremental load is considered we need to consider only the net changes. For example, since Record ID 6 is ultimately deleted, it doesn’t need to be considered for the load.

SSIS Package for Incremental Data Updates

Let’s create the SSIS package to extract incremental data from sources. This is what the package will look like.

SSIS package to extract incremental data from sources

Truncate Staging Tables is an Execute SQL Task to delete previously inserted data in the staging tables, so that each time the staging tables get a fresh set of data.

Get Processing Range is a CDC Control Task like we configured before, the only difference is that we use Get processing range for the CDC control operation.

The next task, Extract Modified Records is a Data Flow Task which will look like this when configured.

Extraction of modified records using a data flow task

These are the CDC tasks in the SSIS Toolbox.

cdc data flow tasks

 

Customer CDC is a CDC Source, configured as below for the customers table.

CDC source is configured

Important configuration is that the CDC processing mode is set to Net in order to capture the net changes rather than capturing all records which is not necessary for data warehousing.

The CDC Splitter, is used to split insert, delete and update records and split records are inserted to the different staging tables Customers_INS, Customers_UPD and Customers_DEL.

To help to see what is going on in the package I enabled the data viewer so records can be verified and checked to make sure the data is correct and they are going to the correct location.

Use the SSIS Data Viewer to verify the records

In the above figure, we can see that record ID 6 is not captured as it is inserted and deleted in one session. This means that net changes are captured which will reduce the complexity and the extraction duration. Since there are no changes to record ID 3, that record is not captured for the extraction.

Finally, the Mark Processed Range CDC control will mark that this set is processed, so the next time the SSIS package is executed, only new records will be extracted and processed.

After getting the data to the different stating tables, the next step of loading the data will be much easier.

Next Steps


Last Updated: 2019-03-21


next webcast button


next tip button



About the author
MSSQLTips author Dinesh Asanka Dinesh Asanka is a 10 time Data Platform MVP and frequent speaker at local and international conferences with more than 12 years of database experience.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Learn more about SQL Server tools