Creating a knowledge base and cleansing data using Data Quality Services in SQL Server 2012 - Part 3
I have heard that SQL Server 2012 has a brand new feature called Data Quality Service (DQS) which greatly helps in ensuring data integrity and quality. In this tip we will cover how to setup a knowledge base and walk through the data cleansing process using the Data Quality Tool.
This is third tip of the series. In the first tip I talked about data quality challenges and its source, then I talked about the Data Quality Service in SQL Server 2012 and its processes. In the second tip, I talked about different components of DQS and how to go about installing and configuring DQS and then I talked, how DQS works. In this tip, I am going demonstrate step by step how you can create a knowledge base and do data cleansing.
Creating a Data Quality Knowledge Base (DQKB)
As DQS allows us to create a knowledge base by discovering, building and managing the information or knowledge about the data, we will first create a knowledge base, then we will use that knowledge base for cleansing the data.
In this first example I will keep the knowledge base as simple as possible for clarity and better understanding. So here is the data which I consider is a valid title (employee title) and want to create a knowledge base from it.
To create a new knowledge base, connect to the Data Quality Client tool and click on the "New Knowledge base" icon in the left most section and then a screen will appear as shown below. Specify the name of the knowledge base you are creating, next you can specify if you want to create this knowledge base from an existing knowledge base or data file and finally select the activity that you want to perform for this knowledge base, for example, here we will select "Knowledge Discovery" and then click the Create button:
On the next screen, the knowledge base creation wizard, we need to provide the data source for the knowledge discovery as well as mapping of a source column with the domain of the knowledge base. Your data source can be a table from a SQL Server database or it can be an Excel file. In the mappings grid, select the column from the data source and click on the "Create a domain" icon on the top to create a domain in the knowledge base.
Clicking on the "Create a domain" icon as shown above will bring up a screen like below, where you can specify the name of the domain, data type and output format (uppercase, lowercase or capitalize); click on the OK button to create the new domain.
In the mapping grid, select the Source Column from the first column of the grid and map it with the domain that we created above. Once you are done with mapping all the source columns with the domains, you can click the Next button on the bottom of the screen to move ahead:
On the next screen of the wizard (Discover), you need to click on the Start button to analyze the data source for knowledge discovery. It will run for sometime depending on the amount of data that you have at the source and finally the result will be shown as below, click on the Next button to move ahead:
On the next screen (Manage Domain Values) of the wizard, you can specify which values from the data source to be considered valid values for the domain that you created or if the value has to be corrected with another value; apart from that you can also specify any new value for the domain as it is extensible:
Once you are done with Domain Value Management, you can click the Finish button which will show a confirmation page as shown below. Here you can either select No to just save the knowledge base or Publish to publish the knowledge base so that it can be used by the data quality project for data quality improvement.
So far we have created and published a knowledge base by using the data discovery technique and now that knowledge base is available for use in data quality projects for data cleansing.
Creating a Data Quality Project
So here we have data coming from the data source and we need to do data cleansing. If you see below, there are some values that are incorrect or not complete. Our expectation from this data cleansing is that these incorrect values should be corrected and incomplete values should be made complete. Let's run through an example to see it in the action.
Launch the Data Quality Client tool and connect to the data quality services server and click on New Data Quality Project, which will bring up a screen like this. Here you need to specify the name of the data quality project, a description and the knowledge base which will be used in this project for data cleansing/matching. Selecting a knowledge base will display the domains from the knowledge base in the tree view on the right side, select Cleansing from the activity section. Finally click the Create button to create the data quality project:
On the next screen of the data quality project creation wizard, specify the data source (it could be SQL Server or an Excel file) and map the source column with the domain from the knowledge base and finally click the Next button to move ahead:
On the next screen of the data quality project creation wizard, click on Start and the wizard will start analyzing and cleansing the data. It will take some time depending on the amount of data that you are trying to analyze/cleanse. The result summary will be shown as below in the profiler section of the page:
On the next screen of the data quality project creation wizard, you can see all the values (Correct values, Corrected values, Suggested values and New values); for the suggested/new/corrected values you can either Approve or Reject and finally click on the Next button to move ahead. You can also update your knowledge base from here if you get some new values (found during data cleaning of the data from the data source) so for future runs this data can be used for cleansing.
On the final screen of the data quality project creation wizard, you can export the cleansed data (after your approval) to either a database table or a csv file by clicking on the Export button.
In the above screen, I am exporting the cleansed data to a csv file which looks like this. As you can see below all the incorrect and incomplete data has been fixed or cleansed.
In this tip, I demonstrated step by step how you can create a knowledge base and do data cleansing from a data source using the interactive mode in the Data Quality Client tool. In the next tip, I will show how you can use a SSIS transformation component to do data cleansing in a batch/automated mode.
- Review Introducing Data Quality Services
- Review Data Quality Services "How Do I?" Videos
- Review SQL Server 2012 related tips
About the author
View all my tips