By: Siddharth Mehta | Last Updated: 2010-06-02 | Comments (4) | Integration Services NULL Management
One of the biggest challenges in a ETL project is faced during data cleansing. NULLs are one of the most unexpected and problematic issues that can make a SSIS package fail. Defensive programming against NULLs is very important, especially when you are dealing with data migration projects where data from OLTP or other source systems cannot always be predicted.
This is a multi-part series of NULL Defense, and in this first part we will look at NULL Defense against dates (datetime data type).
In order to create a test bed for our NULL Defense series, create a new database ( name it something like TestDB ) which we will use for simulating problems and developing solutions.
Create a new table called NULLDefense and populate it with data exactly as shown in the picture below. To keep the focus on the problem, we would keep the structure simple, but inclined towards the problem for which we will create our NULL Defense.
Now follow the steps below to create the SSIS package which will simulate the problem in question.
- Create a new SSIS Project
- Add a new package to it and name it "NULLDefense".
- Add a new Data Flow Task to the package.
- Add an OLE DB Source and name it "NULLDefense - TestDB".
- Edit the OLE DB Source and connect it to the table we just created.
- Add a Conditional Split transformation and name it "Split valid and invalid rec". The expected use of this transformation is to split records which contain dates and records which do not contain dates. Edit this transformation as shown in the picture below and also make sure you change the Default output name to "Valid Records" as shown below.
Since this is only for demonstration, we are not going to waste time inserting these records, so we will just use two Multicasts for the destinations to analyze the number of records going to each destination.
- Add a Multicast transform and name this "Invalid Records". This should be tied to the "Dates Unavailable" output from the Conditional Split.
- Add another Multicast transform and name this "Valid Records" and this should be tied to the "Valid Dates" output from the Conditional Split.
After you have completed these steps, your package should look something like the below picture.
Now if you take a look again at the data we have inserted into our table, we should have two records on the invalid output side and one record on the valid side as there are two records which do not have dates. The reason why we inserted a record with NULL and one record with '' is this is one of the scenarios that you will find in real life projects. Applications insert data into OLTP systems and many times application developers insert data using '' as a placeholder for no date instead of using a constant for NULL from the data access libraries (something like SQLDBNULL) which sends a NULL value to the database instead of a blank value. But from a user or business point of view any date that is not available is NULL, though blank and NULL might carry a different meaning and behavior from a technical standpoint.
Execute the package and you will find two records on the valid side and only one on the invalid side. You would find the same result even if you put a condition using the LEN function to measure the length of the value in the DOB field.
Go back to SSMS and view the data in the NULLDefense table we just created and you will find the reason for this. Whenever a blank value is entered inside a datetime field, when you run a query it will return a default value of "1900-01-01 00:00:00.000" as can be seen in the below picture.
We now move towards the solution for this.
I would not consider any options of making changes at the database level as we might not have the liberty of changing anything for our ETL solution as many other applications might be depending on the existing settings of the database.
Coming to the next alternative, we can place additional checks at the database level to ensure that the value is neither NULL nor the default value, but this kind of solution cannot be considered for a real-time solution as there might be numerous functions and conditions operating on this field and everywhere a dual check or replacement function would be required to fix this issue.
Finally, the option that I recommend is to bring uniformity in perception of NULL to our package and feeding only data that is either NULL or non-NULL to the package. This can be done through a very simple step at the database level itself, without changing any existing database settings, values or logic.
For our test package, just create a new "view" as shown in the picture below where we replace the default value with NULL.
Now change the connection of the source in our package to this view and execute the package. The results should now be as expected, which can be seen in the below figure.
Datetime is one special datatype which can trick you with NULLs and we just developed a defense against NULL date values.
- Try to size the efforts that you would need to patch this NULL or blank issue of dates if you would be fixing this in your existing SSIS packages.
- Implement this solution and compare the benefit and issues of implementing it at the database level instead of at the package level.
Last Updated: 2010-06-02
About the author
View all my tips