How To Implement Batch Processing in SQL Server Integration Services
We routinely load data warehouses with multiple years worth of fact rows at a time. We'd like to perform this process in batches and be able to restart at the point of failure when an error occurs. Can you give us an example of how we might implement this batching capability in an SSIS package?
SSIS supports batch processing very nicely with the existing components in the Toolbox. A simple approach to implementing batch processing in SSIS is to come up with a way to group the rows to be processed into batches, process each batch, then update each group as processed. Let's begin by describing a scenario then implement an SSIS package to perform the work.
A common requirement in developing reporting applications is to aggregate data allowing report queries to run very quickly. Let's assume that we want to aggregate data by month. We also want to have the capability to make adjustments to the aggregated data and only recalculate the month(s) that have adjustments.
We can envision an SSIS package with the following steps:
- Get Batch List is an Execute SQL task that groups the source data to be processed into batches, creating a result set that contains a single row per batch
- Process Batch Loop is a Foreach Loop container that iterates over the result set rows; i.e. executes once for each row in the result set
- Transaction Container is a Sequence container that contains the tasks to be executed for each iteration of the loop; it controls the transaction used to commit if successful or rollback on error
- Append Batch to Sales History is an Execute SQL task that extracts a batch of rows and inserts them to a history table
- Compute Aggregation is an Execute SQL task that performs the aggregations on the batch and updates an aggregation table
- Mark Batch as Processed is an Execute SQL task that updates rows in the source table to indicate that they have been processed
In the following sections we will discuss each step in the SSIS package in detail. Let's begin with the setup then proceed through the steps.
For simplicity sake we'll get our source data from the AdventureWorks sample database that comes with SQL Server 2005. Use the following script to copy the SalesOrderHeader and SalesOrderDetail tables from AdventureWorks into a database called mssqltips (create this database if it doesn't exist):
USE mssqltips GO SELECT * INTO dbo.imp_SalesOrderHeader FROM AdventureWorks.Sales.SalesOrderHeader SELECT * INTO dbo.imp_SalesOrderDetail FROM AdventureWorks.Sales.SalesOrderDetail ALTER TABLE dbo.imp_SalesOrderHeader ADD Processed bit not null default 0 GO
The Processed column will be updated to 1 as the rows are processed.
In the SSIS package the following variables will be used:
We'll describe the variable usage in the sections below.
Get Batch List
Get Batch List executes a stored procedure that groups the source data into batches. While there are many ways to accomplish this task, in this case we simply group on year and month in the stored procedure stp_CreateOrderBatchList:
SELECT DATEPART(YYYY,OrderDate) OrderYear ,DATEPART(MONTH,OrderDate) OrderMonth FROM dbo.imp_SalesOrderHeader WHERE Processed = 0 GROUP BY DATEPART(YYYY,OrderDate) ,DATEPART(MONTH,OrderDate) ORDER BY DATEPART(YYYY,OrderDate) ,DATEPART(MONTH,OrderDate)
Note that the stored procedure only gets rows where the Processed column is equal to zero.
The Execute SQL task in our SSIS package executes the above stored procedure. The General properties page is shown below:
Note the ResultSet property is set to Full result set. The stored procedure just does a SELECT and the Execute SQL task stores the result set in a package variable. The Result Set properties page maps the result set to the package variable User::v_BatchList. The variable type must be System.Object. The Result Name of 0 (zero) is required.
Process Batch Loop
Process Batch Loop is a Foreach Loop container that iterates over the result set created in Get Batch List, one time for each row in the result set. There are two property pages to be configured - Collection and Variable Mappings. The Collection property page has the following settings:
In order to iterate through the result set created by Get Batch List, the Enumerator is set to Foreach ADO Enumerator and the ADO object source variable is set to User::v_BatchList. Get Batch List mapped the User::v_BatchList variable to the result set. The Enumeration mode is set to Rows in the first table (there is only one table in the result set).
The Variable Mappings property page has the following settings:
The stored procedure executed in Get Batch List returns a result set that has two columns - OrderYear and OrderMonth. The Variable Mappings property page maps the columns in each row of the result set to the package variables based on the ordinal position of the column (the first column is 0).
The Transaction Container is a Sequence container. The tasks inside of the container are all executed in a transaction. They either all succeed and are committed or they are rolled back on error. Set the TransactionOption property of the Sequence container to Required; this setting executes all tasks inside the container in the context of a transaction. A new transaction is created each time through the loop.
Append Batch to Sales History
Append Batch to Sales History is an Execute SQL task that calls a stored procedure to extract a single batch of data from the source table and append it to the sales history table. If transformations were required we would use a Data Flow task. The sales history table and stored procedure are as follows:
CREATE TABLE dbo.SalesHistory ( OrderYear int not null, OrderMonth int not null, ProductID int not null, OrderQty smallint not null, LineTotal money not null )
CREATE PROCEDURE dbo.stp_AppendSalesHistory @OrderYear int ,@OrderMonth int AS BEGIN SET NOCOUNT ON; INSERT INTO dbo.SalesHistory ( OrderYear ,OrderMonth ,ProductID ,OrderQty ,LineTotal ) SELECT DATEPART(YYYY,m.OrderDate) ,DATEPART(MONTH,m.OrderDate) ,d.ProductID ,d.OrderQty ,d.LineTotal FROM dbo.imp_SalesOrderHeader m JOIN dbo.imp_SalesOrderDetail d ON d.SalesOrderID = m.SalesOrderID WHERE Processed = 0 AND DATEPART(YYYY,m.OrderDate) = @OrderYear AND DATEPART(MONTH,m.OrderDate) = @OrderMonth END GO
Note that the stored procedure only gets rows where the Processed column is equal to zero.
The General property settings for the Execute SQL task are as follows:
The Parameter Mapping property settings for the Execute SQL task are as follows:
In the above settings the SQLStatement is set to execute the stored procedure, with placeholders for the required parameters. SSIS package variables are mapped to the parameters based on the ordinal number of the parameters in the stored procedure.
Compute Aggregation is an Execute SQL task that recalculates the summary data in the sales history summary table for the order year and order month batch being processed. The sales history summary table and stored procedure are as follows:
CREATE TABLE dbo.SalesHistorySummary ( OrderYear int not null, OrderMonth int not null, ProductID int not null, OrderQty smallint not null, LineTotal money not null )
CREATE PROCEDURE dbo.stp_CalcSalesHistorySummary @OrderYear int ,@OrderMonth int AS BEGIN SET NOCOUNT ON; DELETE FROM dbo.SalesHistorySummary WHERE OrderYear = @OrderYear AND OrderMonth = @OrderMonth; INSERT INTO dbo.SalesHistorySummary ( OrderYear ,OrderMonth ,ProductID ,OrderQty ,LineTotal ) SELECT OrderYear ,OrderMonth ,ProductID ,SUM(OrderQty) ,SUM(LineTotal) FROM dbo.SalesHistory WHERE OrderYear = @OrderYear AND OrderMonth = @OrderMonth GROUP BY OrderYear ,OrderMonth ,ProductID END GO
The above stored procedure first deletes any rows in the summary table for the order year and month being processed, then performs the aggregation and insert. The Execute SQL task property settings are the same as in Append Batch to Sales History except for the name of the stored procedure to execute; we'll skip showing the screen shots.
Mark Batch as Processed
Mark Batch as Processed is an Execute SQL task that updates the Processed column in the source table for the rows that have been processed in the current batch. It invokes the following stored procedure:
CREATE PROCEDURE dbo.stp_MarkOrdersProcessed @OrderYear int ,@OrderMonth int AS BEGIN SET NOCOUNT ON; UPDATE dbo.imp_SalesOrderHeader SET Processed = 1 WHERE DATEPART(YYYY,OrderDate) = @OrderYear AND DATEPART(MONTH,OrderDate) = @OrderMonth; END GO
The Execute SQL task property settings are also the same as before except for the name of the stored procedure; we will skip the screen shots.
Let's highlight the key points in our sample SSIS package that implements batch processing:
- Group the source data into batches; use the Execute SQL task and create a Full result set which saves the result set in a package variable.
- Iterate over the result set using a Foreach Loop Container.
- Use a Sequence Container to define a transaction and add the appropriate tasks inside the Sequence Container.
- The package design supports restarting the package at the beginning in the event of any error; the Sequence Container commits all work if successful or does a rollback if there are any errors. The source data is flagged when processed so it will not be processed again if the package is restarted.
- Download the sample code and experiment with implementing batch processing in an SSIS package. The sample code was tested using SQL Server 2005 Developer Edition with SP2.
- The Sequence container transaction makes use of the Microsoft Distributed Transaction Coordinator (MSDTC) service. Although this service is on by default, make sure that it is running in your environment. You can check the status of the MSDTC service under Services; the name of the service is Distributed Transaction Coordinator.
About the author
View all my tips