Data Wrangling in SQL Server for Cleaning, Transforming and Merging Data

By:   |   Updated: 2023-08-03   |   Comments   |   Related: > Data Cleansing for Validation


Problem

Managing and analyzing data with SQL Server poses challenges when working with raw datasets. It becomes necessary to wrangle the data to make sense of it. Data wrangling proves to be a task that is both time-consuming and complex, as it entails data analysis, cleaning, transforming, enriching and merging data from diverse sources.

Solution

I will briefly remind you about the data wrangling concept in this article. Then, I will dive into the SQL Server functions used to create a custom data wrangling tool.

Data Wrangling Process: Quick Reminder

Data wrangling refers to cleaning, transforming, and structuring raw data into a format suitable for data analysis and decision making. This includes removing duplicates, dealing with missing data, correcting inconsistencies, and integrating multiple sources to prepare data for analysis so that insights can be gained from it.

Why is it Important?

Raw data can be quite messy, unstructured and filled with a bunch of noisy information such as outliers. But when we perform data wrangling, we clean up complex data sets, making it much simpler to analyze and get valuable insights including data visualizations.

What Differs Between Data Cleaning and Pre-Processing?

When working with data, we have this concept called data pre-processing. It's a big umbrella term covering all the tasks in preparing the data for analysis, including data wrangling. Besides data wrangling, pre-processing may involve integration, transformation, and feature selection. Interestingly, when we talk about "data pre-processing," we usually mean these operations before storing and loading the data into the warehouse. It's like part of the ETL process.

On the other hand, "data wrangling" is more commonly used in the context of ELT. In this case, we perform data cleaning right when we need the data for analysis, and it's consumed directly after that. This approach works well with larger datasets. But people often use both terms interchangeably, and the specific tasks involved can vary depending on the situation.

Data Wrangling in SQL Server

Several T-SQL functions and techniques are commonly used in SQL Server to wrangle data. Let's consider the Stack Overflow 2010 database, assuming we want to perform data-wrangling tasks using the tables provided. Here are a couple of examples:

Date and Time Functions

SQL Server has several built-in functions for working with date and time values, which can be helpful in data wrangling tasks. For instance, you may need to change a date to match your local culture information.

Changing the Date Format

-- MSSQLTips.com
SELECT
    Id,
    Title,
    CONVERT(varchar, CreationDate, 107) AS FormattedCreationDate
FROM
    Posts;

We use the CONVERT function in this query to change the date format. The third argument, 107, specifies the format style we want to use, representing "MMM dd, yyyy." The result will be stored in a new column called "FormattedCreationDate." You can customize the format by choosing a different style code according to your preference.

Changing the date format

Extracting Date Parts

We want to extract the year, month, and day from the "Date" column in the "Badges" table. We can use the following SQL query:

-- MSSQLTips.com
SELECT
    Id,
    Name,
    UserId,
    YEAR(Date) AS Year,
    MONTH(Date) AS Month,
    DAY(Date) AS Day
FROM
    Badges;

This query will return the badge information and separate columns for the year, month, and day parts extracted from the "Date" column.

Extracting date parts

Calculating Date Difference

Suppose we want to calculate the difference in days between the "CreationDate" column in the "Comments" table and the current date. We can use the following SQL query:

-- MSSQLTips.com
SELECT
    Id,
    CreationDate,
    DATEDIFF(DAY, CreationDate, GETDATE()) AS DateDifference
FROM
    Comments;

This query will retrieve the comment details and a new column called "DateDifference," representing the number of days between creation and current dates.

Calculating date difference

String Functions

Several built-in functions can be used to clean and manipulate string values. For example, these functions can be used in the following tasks:

Removing White Spaces

Let's say we have a column called "Name" in the "Users" table, and we want to remove any leading or trailing white spaces from the names. We can use the TRIM function to achieve this:

-- MSSQLTips.com
SELECT
    Id,
    TRIM(DisplayName) AS CleanedName
FROM
    Users;

This query will return the user IDs and a new column called "CleanedName," where leading and trailing white spaces are removed from the names.

Removing white spaces

Replace Erroneous Values

Suppose we have a column named "Score" in the "Posts" table, and numbers less than one represent some erroneous values since one is the lowest possible reputation of a Stack Overflow user. We want to replace those negative values with "1":

-- MSSQLTips.com
SELECT
    Id,
    Title,
    CASE
        WHEN Score < 1 THEN 1
        ELSE Score
    END AS UpdatedScore
FROM
    Posts;

We use a CASE statement in this query to check if the "Score" is less than one. If it is, we replace it with 1; otherwise, we keep the original score. The new column "UpdatedScore" will contain the corrected values.

Replace erroneous values

Extracting Specific Values from a String

Suppose we have a column called "Text" in the "Comments" table and want to extract the usernames mentioned in the comments. Assuming the usernames are enclosed between @ and space, we can use string the SUBSTRING function to extract them:

-- MSSQLTips.com
SELECT
    Id,
    Text,
    SUBSTRING(Text, CHARINDEX('@', Text) + 1, CHARINDEX(' ', Text) - CHARINDEX('@', Text) - 1) AS UserName
FROM
    Comments
WHERE
    Text LIKE '%@% %';

In this query, we use the SUBSTRING and CHARINDEX functions to extract the text between @ and space. The extracted usernames will be returned in a new "UserName" column. The condition in the WHERE clause %@% % filters comments that mention a specific username.

Extracting specific values from a string

Conversion Functions

Several built-in functions can be used to fix or convert values data types. For example, these functions can be used in the following tasks:

Parsing Values

Assuming you have a column named "CreationDate" in the "Posts" table, which is stored as a string. You can parse it into a date data type using the following SQL query:

-- MSSQLTips.com
SELECT CAST(CreationDate AS datetime) AS ParsedCreationDate
FROM Posts

This query will convert the "CreationDate" column from a string to a datetime data type, allowing you to perform date-based operations on it.

Parsing values

Check If Values Can Be Converted to Another Data Type

In the Users table, we have a column named "Age" that stores each user's age as a string. If we want to check if the "Age" column values can be converted to integers, we should use the query below:

-- MSSQLTips.com
SELECT Age
FROM Users
WHERE TRY_CONVERT(int, Age) IS NOT NULL

The TRY_CONVERT function attempts to convert the "Age" values to integers. If the conversion is successful, it returns the integer value. Else, it returns NULL.

Check if values can be converted to another data type

As shown in the screenshot above, no records were returned since this column does not contain numeric data and needs to be cleaned.

Aggregation

The aggregation operators are crucial in extracting valuable summaries from the original dataset. By using aggregation functions such as SUM, COUNT, AVG, MIN, and MAX, we can quickly calculate key metrics and gain insights into our data. Moreover, aggregation operators prove incredibly useful in data profiling tasks. They allow us to assess the quality of your data by providing a snapshot of important statistics, such as the total number of records, the distinct values in a column, or the distribution of values across different categories. These operators are powerful tools for analyzing data and evaluating its quality. Aggregation requires using the GROUP BY clause to determine the key columns:

SELECT Location, COUNT(*)
FROM Users

The following are some detailed tips previously published on MSSQLTips.com:

Window Functions

Window functions allow us to perform various tasks, like adding row numbers and ranking values. In addition, one of the most valuable aspects of window functions is their ability to handle data gaps. We can replace NULL or inaccurate values with useful aggregates like the minimum and maximum values, effectively filling in missing pieces of the dataset. This helps us organize and analyze our data and ensure our results are more complete and accurate. They truly offer a powerful set of tools for data manipulation and analysis.

The tips below contain more details on window functions in SQL Server:

Pivoting and Unpivoting

Pivoting and unpivoting are valuable techniques in data wrangling that allow us to transform data between row-wise and column-wise formats. These techniques are especially beneficial when dealing with data in a wide format, where each row represents a single record, and each column represents a variable. By pivoting, we can restructure the data into a column-wise format, making it easier to analyze and compare variables across different records.

Conversely, unpivoting comes into play when we need to convert data from a column-wise format back to a row-wise format. Unpivoting helps us reshape the data into a long format, where each row represents a unique combination of variables and their corresponding values. These techniques provide flexibility and efficiency in manipulating data to suit our analytical needs.

You can learn more about pivoting and unpivoting in SQL Server by referring to the following tips:

Common Table Expressions (CTE)

With CTEs, we can create temporary result sets that can be referenced multiple times within a single SQL statement, making your code more readable and efficient. They act as temporary views, allowing us to break down complex tasks into smaller, more manageable steps.

CTEs are particularly useful when dealing with recursive queries, where a query refers to itself. We can use CTEs to build hierarchical structures or perform operations like graph traversals.

You can learn more about CTE in SQL Server by referring to the following tips:

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Hadi Fadlallah Hadi Fadlallah is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com. He holds a Ph.D. in data science focusing on context-aware big data quality and two master's degrees in computer science and business computing.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2023-08-03

Comments For This Article

















get free sql tips
agree to terms