Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

How to add NULL defense for dates in SSIS packages


By:   |   Last Updated: 2010-06-02   |   Comments (4)   |   Related Tips: 1 | 2 | 3 | More > Integration Services NULL Management

Problem

One of the biggest challenges in a ETL project is faced during data cleansing. NULLs are one of the most unexpected and problematic issues that can make a SSIS package fail. Defensive programming against NULLs is very important, especially when you are dealing with data migration projects where data from OLTP or other source systems cannot always be predicted.

This is a multi-part series of NULL Defense, and in this first part we will look at NULL Defense against dates (datetime data type).

Solution

In order to create a test bed for our NULL Defense series, create a new database ( name it something like TestDB ) which we will use for simulating problems and developing solutions.

Create a new table called NULLDefense and populate it with data exactly as shown in the picture below. To keep the focus on the problem, we would keep the structure simple, but inclined towards the problem for which we will create our NULL Defense.

management studio

Now follow the steps below to create the SSIS package which will simulate the problem in question.

  • Create a new SSIS Project
  • Add a new package to it and name it "NULLDefense".
  • Add a new Data Flow Task to the package.
  • Add an OLE DB Source and name it "NULLDefense - TestDB".
  • Edit the OLE DB Source and connect it to the table we just created.
  • Add a Conditional Split transformation and name it "Split valid and invalid rec". The expected use of this transformation is to split records which contain dates and records which do not contain dates. Edit this transformation as shown in the picture below and also make sure you change the Default output name to "Valid Records" as shown below.
visual studio

Since this is only for demonstration, we are not going to waste time inserting these records, so we will just use two Multicasts for the destinations to analyze the number of records going to each destination.

  • Add a Multicast transform and name this "Invalid Records". This should be tied to the "Dates Unavailable" output from the Conditional Split.
  • Add another Multicast transform and name this "Valid Records" and this should be tied to the "Valid Dates" output from the Conditional Split.

After you have completed these steps, your package should look something like the below picture.

visual studio

Now if you take a look again at the data we have inserted into our table, we should have two records on the invalid output side and one record on the valid side as there are two records which do not have dates. The reason why we inserted a record with NULL and one record with '' is this is one of the scenarios that you will find in real life projects. Applications insert data into OLTP systems and many times application developers insert data using '' as a placeholder for no date instead of using a constant for NULL from the data access libraries (something like SQLDBNULL) which sends a NULL value to the database instead of a blank value. But from a user or business point of view any date that is not available is NULL, though blank and NULL might carry a different meaning and behavior from a technical standpoint.

Execute the package and you will find two records on the valid side and only one on the invalid side. You would find the same result even if you put a condition using the LEN function to measure the length of the value in the DOB field.

visual studio

Go back to SSMS and view the data in the NULLDefense table we just created and you will find the reason for this. Whenever a blank value is entered inside a datetime field, when you run a query it will return a default value of "1900-01-01 00:00:00.000" as can be seen in the below picture.

null defense

We now move towards the solution for this.

I would not consider any options of making changes at the database level as we might not have the liberty of changing anything for our ETL solution as many other applications might be depending on the existing settings of the database.

Coming to the next alternative, we can place additional checks at the database level to ensure that the value is neither NULL nor the default value, but this kind of solution cannot be considered for a real-time solution as there might be numerous functions and conditions operating on this field and everywhere a dual check or replacement function would be required to fix this issue.

Finally, the option that I recommend is to bring uniformity in perception of NULL to our package and feeding only data that is either NULL or non-NULL to the package. This can be done through a very simple step at the database level itself, without changing any existing database settings, values or logic.

For our test package, just create a new "view" as shown in the picture below where we replace the default value with NULL.

management studio

Now change the connection of the source in our package to this view and execute the package. The results should now be as expected, which can be seen in the below figure.

design

Datetime is one special datatype which can trick you with NULLs and we just developed a defense against NULL date values.

Next Steps
  • Try to size the efforts that you would need to patch this NULL or blank issue of dates if you would be fixing this in your existing SSIS packages.
  • Implement this solution and compare the benefit and issues of implementing it at the database level instead of at the package level.


Last Updated: 2010-06-02


next webcast button


next tip button



About the author




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Tuesday, February 15, 2011 - 6:10:33 PM - JiriJ Back To Top

Can I use in SSIS  derived column expression like:
ISDate ([Date collected]) ? [Date collected] : NULL(DT_DBDATE)

check existing column (txt file) for valid date - if Date Collected value is valid date - load the date, else replace with NULL

I know that ISDATE function does not exist in SSIS - is there a function ISDate(value) like ISNULL?
Thanks,
Jiri 

 


Saturday, July 10, 2010 - 7:28:09 AM - Mithun Back To Top

Thanks for reply.

I understand what you mean to say. Also I think a constrsin on the column of the table alos will do a better solution.


 


Saturday, July 10, 2010 - 6:05:50 AM - siddhumehta Back To Top

Agreed on your point.

This is one of the common issues that arise due to careless coding on the application side or lack of tighter control on the DB side.

You run with the risk of losing such records, but the impact is minimum and the same can be re-processed again after a re-conciliation with the backoffice. This is more of a data cleansing issue, and it's a standard process. First you isolate the ambiguous and/or bad records, and then you get those records corrected.

The options are

1) Process all bad records to avoid the risk of losing few good ones,

2) Isolate and reprocess a few good records for the sake of correcting all bad records.


Saturday, July 10, 2010 - 4:34:30 AM - Mithun Back To Top

 Hi 

Thanks for the topic. I have one doubt here, as we are using the case statement what we are doing is 

CASE when "1900-01-01" then NULL

My doubt is if the table has a record with the DOB is "1900-01-01" we will miss that record right ? That means we are missing a rcord which has a valid DOB .   

In this case how we will do this?

 

 


Learn more about SQL Server tools