Data Wrangling in SQL Server for Cleaning, Transforming and Merging Data
Managing and analyzing data with SQL Server poses challenges when working with raw datasets. It becomes necessary to wrangle the data to make sense of it. Data wrangling proves to be a task that is both time-consuming and complex, as it entails data analysis, cleaning, transforming, enriching and merging data from diverse sources.
I will briefly remind you about the data wrangling concept in this article. Then, I will dive into the SQL Server functions used to create a custom data wrangling tool.
Data Wrangling Process: Quick Reminder
Data wrangling refers to cleaning, transforming, and structuring raw data into a format suitable for data analysis and decision making. This includes removing duplicates, dealing with missing data, correcting inconsistencies, and integrating multiple sources to prepare data for analysis so that insights can be gained from it.
Why is it Important?
Raw data can be quite messy, unstructured and filled with a bunch of noisy information such as outliers. But when we perform data wrangling, we clean up complex data sets, making it much simpler to analyze and get valuable insights including data visualizations.
What Differs Between Data Cleaning and Pre-Processing?
When working with data, we have this concept called data pre-processing. It's a big umbrella term covering all the tasks in preparing the data for analysis, including data wrangling. Besides data wrangling, pre-processing may involve integration, transformation, and feature selection. Interestingly, when we talk about "data pre-processing," we usually mean these operations before storing and loading the data into the warehouse. It's like part of the ETL process.
On the other hand, "data wrangling" is more commonly used in the context of ELT. In this case, we perform data cleaning right when we need the data for analysis, and it's consumed directly after that. This approach works well with larger datasets. But people often use both terms interchangeably, and the specific tasks involved can vary depending on the situation.
Data Wrangling in SQL Server
Several T-SQL functions and techniques are commonly used in SQL Server to wrangle data. Let's consider the Stack Overflow 2010 database, assuming we want to perform data-wrangling tasks using the tables provided. Here are a couple of examples:
Date and Time Functions
SQL Server has several built-in functions for working with date and time values, which can be helpful in data wrangling tasks. For instance, you may need to change a date to match your local culture information.
Changing the Date Format
-- MSSQLTips.com SELECT Id, Title, CONVERT(varchar, CreationDate, 107) AS FormattedCreationDate FROM Posts;
We use the CONVERT function in this query to change the date format. The third argument, 107, specifies the format style we want to use, representing "MMM dd, yyyy." The result will be stored in a new column called "FormattedCreationDate." You can customize the format by choosing a different style code according to your preference.
Extracting Date Parts
We want to extract the year, month, and day from the "Date" column in the "Badges" table. We can use the following SQL query:
-- MSSQLTips.com SELECT Id, Name, UserId, YEAR(Date) AS Year, MONTH(Date) AS Month, DAY(Date) AS Day FROM Badges;
This query will return the badge information and separate columns for the year, month, and day parts extracted from the "Date" column.
Calculating Date Difference
Suppose we want to calculate the difference in days between the "CreationDate" column in the "Comments" table and the current date. We can use the following SQL query:
-- MSSQLTips.com SELECT Id, CreationDate, DATEDIFF(DAY, CreationDate, GETDATE()) AS DateDifference FROM Comments;
This query will retrieve the comment details and a new column called "DateDifference," representing the number of days between creation and current dates.
Several built-in functions can be used to clean and manipulate string values. For example, these functions can be used in the following tasks:
Removing White Spaces
Let's say we have a column called "Name" in the "Users" table, and we want to remove any leading or trailing white spaces from the names. We can use the TRIM function to achieve this:
-- MSSQLTips.com SELECT Id, TRIM(DisplayName) AS CleanedName FROM Users;
This query will return the user IDs and a new column called "CleanedName," where leading and trailing white spaces are removed from the names.
Replace Erroneous Values
Suppose we have a column named "Score" in the "Posts" table, and numbers less than one represent some erroneous values since one is the lowest possible reputation of a Stack Overflow user. We want to replace those negative values with "1":
-- MSSQLTips.com SELECT Id, Title, CASE WHEN Score < 1 THEN 1 ELSE Score END AS UpdatedScore FROM Posts;
We use a CASE statement in this query to check if the "Score" is less than one. If it is, we replace it with 1; otherwise, we keep the original score. The new column "UpdatedScore" will contain the corrected values.
Extracting Specific Values from a String
Suppose we have a column called "Text" in the "Comments" table and want to extract the usernames mentioned in the comments. Assuming the usernames are enclosed between @ and space, we can use string the SUBSTRING function to extract them:
-- MSSQLTips.com SELECT Id, Text, SUBSTRING(Text, CHARINDEX('@', Text) + 1, CHARINDEX(' ', Text) - CHARINDEX('@', Text) - 1) AS UserName FROM Comments WHERE Text LIKE '%@% %';
In this query, we use the SUBSTRING and CHARINDEX functions to extract the text between @ and space. The extracted usernames will be returned in a new "UserName" column. The condition in the WHERE clause %@% % filters comments that mention a specific username.
Several built-in functions can be used to fix or convert values data types. For example, these functions can be used in the following tasks:
Assuming you have a column named "CreationDate" in the "Posts" table, which is stored as a string. You can parse it into a date data type using the following SQL query:
-- MSSQLTips.com SELECT CAST(CreationDate AS datetime) AS ParsedCreationDate FROM Posts
This query will convert the "CreationDate" column from a string to a datetime data type, allowing you to perform date-based operations on it.
Check If Values Can Be Converted to Another Data Type
In the Users table, we have a column named "Age" that stores each user's age as a string. If we want to check if the "Age" column values can be converted to integers, we should use the query below:
-- MSSQLTips.com SELECT Age FROM Users WHERE TRY_CONVERT(int, Age) IS NOT NULL
The TRY_CONVERT function attempts to convert the "Age" values to integers. If the conversion is successful, it returns the integer value. Else, it returns NULL.
As shown in the screenshot above, no records were returned since this column does not contain numeric data and needs to be cleaned.
The aggregation operators are crucial in extracting valuable summaries from the original dataset. By using aggregation functions such as SUM, COUNT, AVG, MIN, and MAX, we can quickly calculate key metrics and gain insights into our data. Moreover, aggregation operators prove incredibly useful in data profiling tasks. They allow us to assess the quality of your data by providing a snapshot of important statistics, such as the total number of records, the distinct values in a column, or the distribution of values across different categories. These operators are powerful tools for analyzing data and evaluating its quality. Aggregation requires using the GROUP BY clause to determine the key columns:
SELECT Location, COUNT(*) FROM Users
The following are some detailed tips previously published on MSSQLTips.com:
- SQL Aggregate Functions Code Samples
- Group By in SQL Server with CUBE, ROLLUP and GROUPING SETS Examples
- GROUP BY SQL Queries to Summarize and Rollup Sets of Records
Window functions allow us to perform various tasks, like adding row numbers and ranking values. In addition, one of the most valuable aspects of window functions is their ability to handle data gaps. We can replace NULL or inaccurate values with useful aggregates like the minimum and maximum values, effectively filling in missing pieces of the dataset. This helps us organize and analyze our data and ensure our results are more complete and accurate. They truly offer a powerful set of tools for data manipulation and analysis.
The tips below contain more details on window functions in SQL Server:
Pivoting and Unpivoting
Pivoting and unpivoting are valuable techniques in data wrangling that allow us to transform data between row-wise and column-wise formats. These techniques are especially beneficial when dealing with data in a wide format, where each row represents a single record, and each column represents a variable. By pivoting, we can restructure the data into a column-wise format, making it easier to analyze and compare variables across different records.
Conversely, unpivoting comes into play when we need to convert data from a column-wise format back to a row-wise format. Unpivoting helps us reshape the data into a long format, where each row represents a unique combination of variables and their corresponding values. These techniques provide flexibility and efficiency in manipulating data to suit our analytical needs.
You can learn more about pivoting and unpivoting in SQL Server by referring to the following tips:
Common Table Expressions (CTE)
With CTEs, we can create temporary result sets that can be referenced multiple times within a single SQL statement, making your code more readable and efficient. They act as temporary views, allowing us to break down complex tasks into smaller, more manageable steps.
CTEs are particularly useful when dealing with recursive queries, where a query refers to itself. We can use CTEs to build hierarchical structures or perform operations like graph traversals.
You can learn more about CTE in SQL Server by referring to the following tips:
- You can learn more about data wrangling by checking the following articles:
About the author
View all my tips
Article Last Updated: 2023-08-03