Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006 join MSSQLTips for free SQL Server tips













































   Got a SQL tip?
            We want to know!

Creating a knowledge base and cleansing data using Data Quality Services in SQL Server 2012 - Part 3

MSSQLTips author Arshad Ali By:   |   Read Comments   |   Related Tips: 1 | 2 | 3 | 4 | More > Data Quality Services
Problem

I have heard that SQL Server 2012 has a brand new feature called Data Quality Service (DQS) which greatly helps in ensuring data integrity and quality.  In this tip we will cover how to setup a knowledge base and walk through the data cleansing process using the Data Quality Tool.

Solution

This is third tip of the series. In the first tip I talked about data quality challenges and its source, then I talked about the Data Quality Service in SQL Server 2012 and its processes. In the second tip, I talked about different components of DQS and how to go about installing and configuring DQS and then I talked, how DQS works. In this tip, I am going demonstrate step by step how you can create a knowledge base and do data cleansing.

Creating a Data Quality Knowledge Base (DQKB)

As DQS allows us to create a knowledge base by discovering, building and managing the information or knowledge about the data, we will first create a knowledge base, then we will use that knowledge base for cleansing the data.

In this first example I will keep the knowledge base as simple as possible for clarity and better understanding. So here is the data which I consider is a valid title (employee title) and want to create a knowledge base from it.

Creating a Data Quality Knowledge Base (DQKB)

To create a new knowledge base, connect to the Data Quality Client tool and click on the "New Knowledge base" icon in the left most section and then a screen will appear as shown below. Specify the name of the knowledge base you are creating, next you can specify if you want to create this knowledge base from an existing knowledge base or data file and finally select the activity that you want to perform for this knowledge base, for example, here we will select "Knowledge Discovery" and then click the Create button:

connect to Data Quality Client tool and click on "New Knowledge base" i

On the next screen, the knowledge base creation wizard, we need to provide the data source for the knowledge discovery as well as mapping of a source column with the domain of the knowledge base. Your data source can be a table from a SQL Server database or it can be an Excel file. In the mappings grid, select the column from the data source and click on the "Create a domain" icon on the top to create a domain in the knowledge base.

provide the data source for the knowledge discovery

Clicking on the "Create a domain" icon as shown above will bring up a screen like below, where you can specify the name of the domain, data type and output format (uppercase, lowercase or capitalize); click on the OK button to create the new domain.

 the "Create a domain" icon

In the mapping grid, select the Source Column from the first column of the grid and map it with the domain that we created above. Once you are done with mapping all the source columns with the domains, you can click the Next button on the bottom of the screen to move ahead:

select the Source Column from the first column of the grid and map it with the domain

On the next screen of the wizard (Discover), you need to click on the Start button to analyze the data source for knowledge discovery. It will run for sometime depending on the amount of data that you have at the source and finally the result will be shown as below, click on the Next button to move ahead:

finally the result will be shown as below

On the next screen (Manage Domain Values) of the wizard, you can specify which values from the data source to be considered valid values for the domain that you created or if the value has to be corrected with another value; apart from that you can also specify any new value for the domain as it is extensible:

Manage Domain Values

Once you are done with Domain Value Management, you can click the Finish button which will show a confirmation page as shown below. Here you can either select No to just save the knowledge base or Publish to publish the knowledge base so that it can be used by the data quality project for data quality improvement.

we have created and published a knowledge base by using the data discovery technique

So far we have created and published a knowledge base by using the data discovery technique and now that knowledge base is available for use in data quality projects for data cleansing.

Creating a Data Quality Project

So here we have data coming from the data source and we need to do data cleansing. If you see below, there are some values that are incorrect or not complete. Our expectation from this data cleansing is that these incorrect values should be corrected and incomplete values should be made complete. Let's run through an example to see it in the action.

Creating a Data Quality Project

Launch the Data Quality Client tool and connect to the data quality services server and click on New Data Quality Project, which will bring up a screen like this. Here you need to specify the name of the data quality project, a description and the knowledge base which will be used in this project for data cleansing/matching. Selecting a knowledge base will display the domains from the knowledge base in the tree view on the right side, select Cleansing from the activity section. Finally click the Create button to create the data quality project:

Launch Data Quality Client tool and connect to the data quality services server

On the next screen of the data quality project creation wizard, specify the data source (it could be SQL Server or an Excel file) and map the source column with the domain from the knowledge base and finally click the Next button to move ahead:

specify the data source (it could be SQL Server or Excel file)

On the next screen of the data quality project creation wizard, click on Start and the wizard will start analyzing and cleansing the data. It will take some time depending on the amount of data that you are trying to analyze/cleanse. The result summary will be shown as below in the profiler section of the page:

the wizard will start analyzing and cleansing the data

On the next screen of the data quality project creation wizard, you can see all the values (Correct values, Corrected values, Suggested values and New values); for the suggested/new/corrected values you can either Approve or Reject and finally click on the Next button to move ahead. You can also update your knowledge base from here if you get some new values (found during data cleaning of the data from the data source) so for future runs this data can be used for cleansing.

 you can see all the values (Correct values, Corrected values, Suggested values and New values)

 You can also update your knowledge base back from here

On the final screen of the data quality project creation wizard, you can export the cleansed data (after your approval) to either a database table or a csv file by clicking on the Export button.

export the cleansed data (after all approval)

In the above screen, I am exporting the cleansed data to a csv file which looks like this. As you can see below all the incorrect and incomplete data has been fixed or cleansed.

all the incorrect and incomplete data has been fixed or cleansed

In this tip, I demonstrated step by step how you can create a knowledge base and do data cleansing from a data source using the interactive mode in the Data Quality Client tool. In the next tip, I will show how you can use a SSIS transformation component to do data cleansing in a batch/automated mode.

Next Steps


Last Update: 5/24/2012


About the author
MSSQLTips author Arshad Ali
Arshad Ali is a SQL and BI Developer focusing on Data Warehousing projects for Microsoft.

View all my tips


print tip Print  
Become a paid author





join MSSQLTips for free SQL Server tips     



Learn more about SQL Server tools
Post a comment or let the author know this tip helped you.

       All comments are reviewed, so stay on subject or we may delete your comment.

*Name   *Email Notify for updates



       Note: your email address is not published. Required fields are marked with an asterisk (*)


Get free SQL tips:

*Enter Code refresh code     





 
Sponsor Information







Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
privacy | disclaimer | copyright | advertise | about
authors | contribute | feedback | giveaways | free t-shirt | user groups | community | events | first timer?
Some names and products listed are the registered trademarks of their respective owners.