Capturing and Logging Data Load Errors for an SSIS Package

By:   |   Comments (7)   |   Related: 1 | 2 | More > Integration Services Error Handling


Problem

In an ETL solution, error logging is a fundamental requirement for various aspects of the application and is very useful at various stages of the ETL execution life-cycle. Let's consider a scenario where our requirement is to insert records into a SQL Server table. The package should attempt loading all the records and whichever records fail, error details reported by the database engine should be reported. We will look at how to implement this in a SSIS package.

Solution

From a high level view for data loads, there are three main phases in a ETL execution life-cycle.

  1. When data is being extracted (i.e. read) from source systems
  2. When data is being transformed
  3. When data is loaded to the target systems

In the first phase, there can be errors while establishing a connection with the source systems, or the data read from the source might not match the mappings defined in the SSIS package. Similarly, there can be different kinds of errors which can occur during this phase.

In the second phase, when data is being transformed, the only major category of error that can occur is while manipulating the data. We are not considering any errors caused by hardware failures or memory as these category of errors can occur at any phase of the ETL life-cycle.

In the final phase, when data is being loaded into the target system, error logging is required at a very detailed level as there can be many reasons why loading of a particular record failed. After data crosses the SSIS boundary, and is handed over to the database driver for loading into the target system, the database engine takes control of loading the data. And if the data violates the criteria defined by the entity that stores the data, an error message is generated and returned back. Each target system has their own mechanism and translation of reporting the error.

For example, if one attempts to insert a record in a table which would violates the primary / foreign key constraint, that record would definitely fail. Support teams who maintain the ETL solution, would like to know the cause of each and every record failure with the supporting details that can help them clearly understand the reason of failure. One of the ways to deal with this is to log the error message reported by the database engine itself into the error log. Most of the time, the reason why the data manipulation failed becomes very apparent from the error message reported by the target system.


Follow the steps below to develop a solution that deals with the problem in question.

1) Create a new SSIS project and name it something relevant. Change the name of the default package and rename it to something relevant.

2) I take it for granted that the reader has the AdventureWorks database installed on their development machine. We will be using the Address table for our exercise. Create a new table named AddressClone with two columns: AddressID and City. Make AddressID the primary key column. Our purpose is to load the same records in the table twice and check whether primary key errors reported by the database engine are captured by our SSIS package and reported in the error table that we will design in the following steps.

3) Use the below script to create a new error table to log the errors that we would capture from the database.

CREATE TABLE [dbo].[ErrorLog]
( [ErrorID] [int] IDENTITY(1,1) NOT NULL,
[ErrorRow] [xml] NULL,
[ErrorMessage] [varchar](1000) NULL,
[ErrorIssuingPackage] [varchar](50) NULL,
[ErrorSourceTable] [varchar](50) NULL,
[ErrorDestinationTable] [varchar](50) NULL,
[ErrorLoggingTime] [datetime] NULL )

4) Add a new Data Flow task to the package. Edit the Data Flow task, and add a OLE DB Source Adapter. Configure the adapter to read the top 10 rows from the address table as shown below.

implement an ETL solution ia a SSIS package

5) After the OLE DB Adapter, add a Script transformation. When you add it to the package, you will be prompted with three options asking if you want the it as a Source, Transformation or Destination. Select Destination out of these options.

6) Add a new ADO.NET connection to the package and configure it to connect to the AdventureWorks database.

7) Configure the Script transform to use the connection we created above by using the Script Transformation Editor page as shown below. Also make sure that AddressID and City columns are available and checked on the Input Columns page.

 use the Script Transformation Editor page

8) Edit the Script transform, and add the following code. We are first acquiring a connection in the AcquireConnections method. In the pre-execute we are preparing the command object and configuring it with required parameters. In the ProcessInputRow method which gets executed for each record, we are assigning values from each record to the respective parameter and then we are executing the command which enters a record into the target table.

We have done exception handling at this level, by using the try-catch block. In case an error is reported by the database engine, the error gets captured by the catch block. As we are interested in errors reported by database engine, we have made our exception type specific by making the type of the exception parameter "e" as "SQLException". After an exception is captured, we again prepare the command at that point in time, we assign specific values in alignment with our error table and we report the same in our error table that we created in Step 3.

9) Execute the package the first time and there should be no errors Ten records should get inserted into the AddressClone table. Now execute the package again and check the Error table and you should be able to find the error details as shown below.

 capture error messages reported by the target system while loading data from SSIS

In this article, we saw how to capture error messages reported by the target system while loading data from SSIS. For the sake of this demonstration, we took SQL Server as the target system, and saw how we can capture errors for each record that failed to load into the database. This technique of error capture is very effective when the primary requirement is to attempt loading each record and also capture errors for each failed record.

Next Steps
  • Try implementing this for a non-relational target system like Excel.
  • Test whether SQLException works to catch the exceptions or if you need to change this to capture the error
  • Read more tips about SSIS


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Siddharth Mehta Siddharth Mehta is an Associate Manager with Accenture in the Avanade Division focusing on Business Intelligence.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, March 5, 2012 - 10:43:59 AM - Ralph Wilson Back To Top (16250)

I realize that I have come late to the party but, since this just showed up in a SQLTIP email, I think my contribution may be worthwhile.

I, too, share the concern over the performance impact of handling things RBAR; however, there is a way around both that and the issue of input data coming in from columns that are larger than the table's columns.

Since I have often had to deal with excess column lengths from input data, I have gotten into the habit of using both the TRIM functions and the LEFT function to not only remove excess spaces from either end of the string and to redue the string to the appropriate size.  Of course, this assumes that you have Business Rules that allow you to do this sort of thing.  (E.g. a Business Rule indicating that "The left-most X characters will be inserted into the [name of column] column.")

As for the RBAR performance issue, you have to think through the fact that, when there is a failure in the Data Flow Task, the entire batch of rows will be routed to the error handling, along with the Eorror Code and the Error Message.  That means that, if you are handling the entire set of source data and you route the error rows to an Exceptions table, then the entire set of input data will be routed to that table . . . which is only marginally useful.  However, if you change the size of the batches of rows being handled, then you will change the number of rows that are routed to the Exceptions table, e.g. if you are loading 1,000,000 rows of data then you may want to set your batches to something along the lines of 1000 rows.  Now, this doesn't solve the problem entirely; however, if you feed the "failed" batch of rows to another batching process that uses a batch size of, say, 100 rows, you can narrow things down a bit more before you go into the RBAR process and handle each row by itself (thus being able to route the individual rows that failed to the Exceptions table).

Also, if you create a copy of your target table and remove all constraints from that table (e.g. treat every column as varchar [of sufficient size to handle your "worst case" issues], allow every column to be NULL, remove all uniqueness constraints from any indexes, and also remove any Foreign Key constraints), then you can insert the data into the unconstrained Exceptions table.  Once the problem has been researched and either the data in the row corrected or the row deleted, the data in the Exceptions table can be easily reprocessed to maintain the original target table.


Tuesday, November 23, 2010 - 4:15:30 PM - Niall Back To Top (10392)

I share Nitesh's concern.  In a large ETL process, it seems like the row-by-row insertions would kill performance.  Instead, how about

1) disable constraints

2) bulk insert

It certainly does kill performance, but in my situation, the spec for the input file says (for example) PersonName varchar(12), however the actual input file may sometimes have a field that is longer than that.   This will break the input and fail the insert, regardless of whether its a bulk insert.   The only other way around it, is to make all fields varchar(255), load them into a staging table, then check values conform to the spec, then insert them into the destination table. As this is not mission critical to load quickly, but it is mission critical to be accurate, I thought it better to check and handle errors in the input file.   I believe both solutions will probably be as fast however.


Wednesday, November 17, 2010 - 3:40:45 PM - Siddharth Mehta Back To Top (10372)

Gentlemen,

The purpose of this article is to demonstrate how to catch errors returned by database engine. It it solely upto the developer, how one wants to insert it into the target destination. You might just want to log it to Windows Event Log, or might collect it in a database or store it somewhere else. I am not recommending to always use it, but in cases when you have destinations like Excel / Oracle or some non-SQL Server destinations or even a mix of all different data destinations, support team might want to have a closer and exact look on the errors returned by the DB engine, especially during testing phase. A suggestive choice can be to implement this during testing phase, and once the code releases in production and support team is trained, this mode of recording DB engine errors can be retired. Remember, it's just a prototype that has been demostrated on SQL Server.

1) A row by row operation comes to picture, when every row fails. In a typical ETL load, if you expect even 10% failures (which is again too high), then in a load of 10,000 rows, 1000 rows would get logged. DW operations are expected to be very precise and have to have less than 5% failures, and data cleansing / data migration operations are one time activity, so those processes would have decreased failures over a period of time.

2) If we need to check constraints that are known, then we can check it right in ETL or even pre-etl, instead of checking it after loading. We are trying to do this error handling for unforeseen errors reported by a variety of DB engine. For eg. some changes like a foreign key has been added or some new constraint has been added without sufficient impact analysis.

3) To make the script more generic, just wrap the error handling code in a parameterized function. You can make the code react conditionally based on parameter, and that should make the code generic enough to be used across packages.

I hope this clarifies my viewpoint.


Wednesday, November 17, 2010 - 2:38:10 PM - Roger Back To Top (10371)

I share Nitesh's concern.  In a large ETL process, it seems like the row-by-row insertions would kill performance.  Instead, how about

1) disable constraints

2) bulk insert

3) programatically check for rows that will violate constraints; copy these to an error table or file and delete from the original table

4) enable constraints

 

#3 can be done with efficient set operations.


Wednesday, November 17, 2010 - 12:43:26 AM - Nitesh Rai Back To Top (10368)

Will it be a row by row insertion? If Yes, then how about the performance of the package in case we have huge amount of data.


Tuesday, November 16, 2010 - 2:30:01 PM - Tom Bakerman Back To Top (10365)

Could you please provide some comments in the C# script.  While I can read the code and figure it out easily enough, I don't know, for example, which of those methods are required by the SSIS integration.

 

Can the error script be more generic, and shared among packages, or do you have to write a seperate script for each package?


Monday, November 8, 2010 - 4:37:45 PM - Niall Back To Top (10340)

Importing from a flat file.

 

I recently had a situation where I have 41 flat files to be read in, however the defined file structures didn’t always match the data.   (Integers with alpha characters in a field defined as INT, CSV files with different numbers of fields etc).   Not much I could do about this except figure out a way to handle it.

 

Handling the CSV files was easy, as someone here had already programmed a solution in another SSIS package, so I just copied that.   Basically, a table in the DB stores the file name and the maximum number of fields that are expected in that file.   A script task reads the file, and for each line, counts the number of commas – if there are too few, it adds commas until it reaches the maximum, then outputs that line to another file which becomes the actual input file used.

 

Handling fixed length or CSV/TSV/PSV files where the data contained in the columns did not match the column definition however was a bit more difficult.   I decided to use the Flat File Source Error Output, which retrieves the entire line (as a string) which you can output to an error table.   I add a derived column to record the Load Number (LoadID) and the relevant task name.  Each task is numbered, so I will easily be able to find the offending task.

Ok, now that handles the input file, but what dropping the data into the table causes an error?   That’s handled by an OLE DB Destination Error Output which automatically adds ErrorCode and ErrorColumn to the dataset.   Each field is defined as a varchar in my error table, and the ErrorCode and ErrorColumn are integers.   Its easy to find out which column is the offender, as its all recorded in the XML for the DTSX.   Just do a find on the number contained in the ErrorColumn column.

 

Easy....

 

(I would've posted pics of the data flow, but don't have anywhere to put them)















get free sql tips
agree to terms