Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006 join MSSQLTips for free SQL Server tips













































   Got a SQL tip?
            We want to know!

Getting Started with Data Quality Services of SQL Server 2012 Using SSIS - Part 4

MSSQLTips author Arshad Ali By:   |   Read Comments (3)   |   Related Tips: 1 | 2 | 3 | 4 | More > Data Quality Services
Problem

In my earlier tips, I talked about the Data Quality Service (DQS) in SQL Server 2012, which helps ensure data integrity and quality by data profiling, matching, cleansing, correcting and monitoring overall status of the data cleansing process. We also talked about using the interactive Data Quality Client tool and in this tip we will walk through how to do data cleansing in a automated mode. 

Solution

Data Quality Service (DQS) has a SSIS component which you can use for automating the data cleansing process or if you want to run it in batch mode (versus interactively in the Data Quality Client tool).

Before you can start using the Data Cleansing component in SSIS, you need to make sure you have already created and published a knowledge base for cleansing your source data. To learn more about the knowledge base or knowledge discovery and the process of creating a knowledge base, refer to my earlier tip in this series.

Please note, this data cleansing component is a SSIS component which can be installed and run on the same machine or on a different machine than the DQS server. This component works like a client and requires DQS Server for performing data cleansing operations. To learn more about DQS installation and configuration, please refer to my first tip in this series.

As you can see below, the SSIS component reads data from the source, sends it to the DQS Server for data cleansing and correction and writes the output to the destination. It also includes additional columns to indicate whether the value was already correct, was corrected by DQS, if any other value was suggested or the value was an invalid/unknown value.

the SSIS component reads data from the source
(Image source is from Gadi Peleg's presentation available here)

GGetting started with the DQS cleansing component in SSIS

I am assuming you have basic understanding of SSIS and you are aware of how to create a simple package in SSIS, if not please refer to this tutorial.

CCreate a SSIS package and drag a data flow task from the Toolbox on to the Control Flow; double click on data flow task to open the task in data flow pane. In the Toolbox of the data flow pane you will notice a new component (DQS Cleansing) appears as shown below:

DQS Cleansing appears as shown

Before we begin using this component, we need to first select/specify the data source which needs to be cleansed, in my case I have a csv file (incorrect values have been highlighted here for clarity) with this data, which I want to cleanse, as shown below:

we need to first select/specify the data source which needs to be cleansed

In this example I am going to use the same knowledge base (TitleDomain) which I created in my last tip of this series and which has these entries, as shown below, for valid Title domain:

I am going to use the same knowledge base (TitleDomain)

Now coming back to the data source for data cleansing, I have defined a flat file connection manager as shown below, for the above "csv" file:

I have defined a flat file connection manager

Now drag a Data Cleansing component from the Toolbox on to the data flow pane, drag the green arrow icon to the component from the flat file source and right click -> Edit on the Data Cleansing component as shown below:

drag Data Cleansing component from the Toolbox on to the data flow pane

In the Connection Manager page of DQS Cleansing Transformation Editor, you need to either specify or create a new connection manager (a new type of connection manager introduced with this release to connect to DQS Server) that connects to DQS Server where you have created the knowledge base for use, you can click on the Test Connection button to verify the connection is valid:

the Connection Manager page of DQS Cleansing Transformation Editor

Once connected to the DQS Server, the data quality knowledge base combo-box will display all the published knowledge bases available on the DQS Server. Select the appropriate one and then go to the Mapping page:

the Data quality knowledge base Combo-box will display all the published knowledge bases available on the DQS Server

On the Mapping page, select the source/input columns to be cleansed and select the appropriate domain (will be contained in the knowledge base which you selected in last step) for each column in the bottom grid as you can see below.  I want to cleanse the Title column of the source and hence I have selected the Title domain (available in the knowledge base) to be used for cleansing the incoming titles:

select the source/input columns to be cleansed

The DQS Cleansing component adds a column called [Record Status] which indicates what operation has been taken by the DQS Server for each value. For example, if the value was already correct or corrected by DQS; if not then if there is any suggestion or it is invalid for the domain or an unknown value which is not in the domain and does not match with the domain rules:

The DQS Cleansing component adds a column called [Record Status]

You need to use a Conditional Split Transformation for splitting the output from the DQS Cleansing component based on the values of the [Record Status] column as shown below. Correct and Corrected values can be combined together (using Union All Transformation) and written directly to the destination whereas Autosuggested, Invalid or Unknown values can be combined and written to an intermediate table, so those values can be manually fixed before writing to the destination (or the knowledge base can be enhanced to include these additional values or updated for existing values so that manual fixing can be minimized):

You need to use Conditional Split Transformation for splitting the output from the DQS Cleansing component

When you run the above package, you will notice that 10 records come from the source, out of the 10 records, five are correct, one gets corrected, two are autosuggested and two values are unknown as shown below:

you will notice that 10 records are coming from the source

In this tip, I demonstrated how you can use the SSIS transformation component to do data cleansing in a batch/automated mode. If you want to learn more about creating a knowledge base and doing the cleansing in an interactive mode using the Data Quality Client tool, refer to my earlier tips in the series.

Please note, the details and examples in this tip are based on SQL Server 2012 CTP 3, there might be some changes in the final release of the product.

Next Steps


Last Update: 6/6/2012


About the author
MSSQLTips author Arshad Ali
Arshad Ali is a SQL and BI Developer focusing on Data Warehousing projects for Microsoft.

View all my tips


print tip Print  
Become a paid author





join MSSQLTips for free SQL Server tips     



Learn more about SQL Server tools
Comments and Feedback:
Monday, August 27, 2012 - 10:07:52 PM - IanR Read The Tip

Hi Arshad. Thank you for a very informative series of articles. I have a few questions. Firstly, in a Data Quality Project, if a record gets flagged as New (meaning DQS can't find it in the KB) you have the opportunity to correct it and check the Approve button. Is there a way to get this combination added to the KB automatically so it will autocorrect next time, or do I have to add that in separately?

Secondly, on the Export screen (screen 4) the Output Data Preview contains all records, including ones I have tagged as Invalid. So when I export I get invalid records in my output. Is there a way to exclude invalid records from output?

Finally, in my SSIS package I am checking multiple fields per record. The output from the DQS Cleansing transform has a _Status for each field checked. If one field in the record gets a _Status of 'Correct' and another field gets a _Status of 'New', the overall Record Status is 'Correct'. Consequently when I check for Record Status in my Conditional Split transform, invalid records get through. I know I could check the _Status for each field, but if I am checking multiple fields that could get unwieldy. Any idea why the Record Status field doesn't display the worst status in the record rather than the best?

Thanks. Ian.


Monday, June 24, 2013 - 3:21:25 AM - Aravind Read The Tip

Hi Arshad,

I have not involved in DQS project but want to know the requirement template document of DQS.Will you be provide the sample requirement foc of DQS?.Thanks in advance.


Wednesday, June 04, 2014 - 11:33:41 PM - koti Read The Tip

While connecting DQS connection manager it saying that you don't have any access.I tryed so maney times creatig the login  in server side but it finally it saying that you don't have any access please help me on that..



Post a Comment or Question

Keep it clean and stay on the subject or we may delete your comment.
Your email address is not published. Required fields are marked with an asterisk (*)

*Name   *Email Notify for updates



Comments
Get free SQL tips:

*Enter Code refresh code


 
Sponsor Information







Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
privacy | disclaimer | copyright | advertise | about
authors | contribute | feedback | giveaways | free t-shirt | user groups | community | events | first timer?
Some names and products listed are the registered trademarks of their respective owners.