The amount of data is increasing on a regular basis and so is the responsibility to ensure the quality of data remains intact. The data quality issues include inconsistency, incompleteness, invalidity, inaccuracy, duplicity etc. These data quality issues can arise from different sources like erroneous data entry, corruption in transmission or storage, consolidating data from different sources with different standards and formats, etc. and that can lead to incorrect data analysis, reporting, data mining and hence impact the business functions of the organization.
In SQL Server 2012 a new feature called Data Quality Service (DQS) greatly helps in ensuring data quality by data profiling, matching, cleansing, correcting data and monitoring the overall status of these processes. Let's explore this feature in this tip series.
This is second tip of the series, in the first tip I talked about data quality challenges and its source, then I talked about Data Quality Service of SQL Server 2012 and its processes. In this tip, I am going to talk about different components of DQS and how to go about installing and configuring DQS and then I will talk about how DQS works.
There are three different components of DQS as follow:
Data Quality Services Server - This is the main service/server that provides overall features of DQS and is installed on top of the SQL Server 2012 database engine.
Data Quality Client - DQS client is a tool (user interface) for the DQS administrator, KB editor and KB operator which can be used for knowledge base management, knowledge discovery, matching policy definitions, exploring reference data, creating and maintaining data quality projects as well as to monitor and administer the DQS service. This tool can be deployed on the same machine as the DQS service/server or on a separate client machine.
Integration Services component for data cleansing - This new component is a synchronous data flow transformation component which enables correcting the input data according to domain values, rules, and reference data associated with a designated DQS knowledge base. This is to be used when you want to perform data cleansing operation in batch (automated or as part of data integration) or as part of a SSIS data flow pipeline. To learn more about this feature watch for the next tip in this series.
DQS does not get installed by default when you install SQL Server 2012, you have to explicitly select it on the feature selection page of SQL Server 2012 Installation Wizard as shown below:
Data Quality Services
Data Quality Client
Integration Services (if you want to install the SSIS Component for DQS)
If you notice above, the Data Quality Services can be installed with the Database Engine although the Data Quality Client and Integration Services component can be installed separately on different machines without having the database engine.
After installation if you try to open the Data Quality Client or use the SSIS component it will not work and will fail with the following error. The reason is that with the installation wizard only code bits are installed and then you need to configure DQS explicitly by executing DQSInstaller.exe from the installed path. To learn more about installing and configuring DQS you can refer to this video.
After successful configuration, you will notice new DQS databases created as shown below for storing data related to data quality services.
Post installation you need to perform certain operations like granting DQS roles to intended users and enabling the TCP/IP network protocol (in SQL Server Configuration Manager) if Data Quality Server will be accessed from a remote computer. Click here for more information.
How DQS works
The first thing that you need to do is to create a Data Quality Knowledge Base (DQKB) or you need to use reference data providers to improve the integrity and quality of the data. A DQKB can be created by acquiring the knowledge through data samples or user feedback. The DQKB can be further enriched through a discovery process or manually by user knowledge.
When you launch the Data Quality Client it will prompt you to specify the DQS server name and once you are connected the screen would look similar to the below screenshot (if some or all the options are grayed out it means you do not have sufficient permissions and hence you need to follow the post installation steps as discussed above).
There are three sections of this screen:
Knowledge Base Management - This section gives us an option of creating a new knowledge base or opening an existing knowledge base for further enrichment.
Data Quality Projects - This section gives us an option of creating a new data quality project or opening an existing data quality project. To perform cleansing and matching tasks, a new data quality project has to be created, map the source columns to the domains in the knowledge base and start cleansing. The tool performs cleansing and suggests corrections and suggestions. We have an option to approve/reject the corrections/suggestions and see the final result. For example, EmailAddress should end with "@abc.com" and this what is defined in the knowledge base. During the cleansing process DQS compares an EmailAdress with the knowledge base and corrects/suggests the value accordingly for the invalid data. We can also export the cleansed and matched data to an external file or database for further use.
Data Cleansing - Data cleansing is the process of analyzing the data quality in the data source, and manually applying rules and making changes to the data./li>
Data Matching - Data matching analyzes your source data and clusters matching records. The challenge for the matching system is the degree to which it can understand and overcome the errors and variations that naturally exist in the data and across different data sources. DQS helps you identify and remove duplicate records from your data source in a guided and easy way.
Administration - This section gives us an option for monitoring the data quality activities and configure the server, for example for reference data providers. To learn more about DQS Administration click here.
Getting started with Data Quality Services
Once you are done with installation and configuration of DQS as discussed above, you move ahead with the data cleansing operation. There are basically two steps of working with Data Quality Services as shown below:
Creating knowledge base - Before you start working with Data Quality Services, you first need to create a knowledge base, which is nothing but the knowledge about the data. This knowledge about the data from the knowledge base is used by DQS to identify incorrect, incomplete and invalid data and improve the integrity and quality of the data. The knowledge base can be created using either computer assistance (by knowledge discovery from sample data) or interactive method by manually entering it and hence making it extensible.
The knowledge in the knowledge base is maintained in domains (single column domain or composite domain - combination of two or more single domains which represents the data in efficient way) which will be specific to a data field. You can create domain rules for data validation and standards as well. The data validation means the check to ensure the conformity of the data, for example and email should be a valid email id. The data standardization means converting data to a standard format or representation.
You can also define a matching policy which will be used by DQS to calculate the probability of a duplicate match and de-duplication process.
A knowledge base can also be created by importing the data back from the data quality services project. A knowledge base can be created from scratch or can be based on an existing knowledge base or can be imported from a data file.br>
Creating Data Quality Services Project - Once you are done with creating a knowledge base, you can create a data quality services project for data cleansing. In this project you specify what knowledge base that project will use and the data source to be cleansed (it could be SQL Server and Excel). After analysis, this wizard gives you a list of all correct values, incorrect values, suggested values and new values.
In this tip, I talked about different components of DQS and how to go about installing and configuring DQS and then I talked about how DQS works. In the next tip, I am going demonstrate step by step how you can create a knowledge base and do cleansing of data.