Implement Data Enrichment in Lakehouse using Synapse Analytics Mapping Data Flow

By:   |   Updated: 2021-12-14   |   Comments   |   Related: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | > Azure Synapse Analytics


Problem

This article is part of the series of tutorial posts, dedicated to the building of a Lakehouse solution, based on Azure Synapse Analytics. In the previous post, Implement data deduplication in Lakehouse using Synapse Analytics Mapping Data Flow, we discussed techniques to deduplicate the data. In this tutorial, we're going to explore another common Lakehouse transformation - data enrichment.

Solution

Data enrichment may be required for a variety of reasons, like:

  • Extracting useful features from the source data. For example, the source data may include a column with JSON or comma-delimited strings and extracting some parts of it may be useful.
  • Fetching data from external sources. For example, the source data may include an identity column and we might be interested in bringing its description from another data source.

The data enrichment, like deduplication, can be part of the transformations between the Bronze and the Silver layers.

The Data Flow Design

I'm going to leverage the deduplication data flow we've built in the previous post, and add extra transformations to it to demonstrate data enrichment technics.

Let's open Azure Synapse Studio, clone the DataflowBonzeSilver data flow and name it DataflowBonzeSilver_Customer. Unlike DataflowBonzeSilver flow, this data flow is going to be table-specific, it'll include only Customer table transformations.

This flow has multiple transformations designed for deduplication purposes. Here's how it looks so far:

The Data Flow Design

Figure 1

We'll add data enrichment transformations at the end of it, just before the RemoveMetadataColumns transformation. Here's how our target data flow will look:

The Data Flow Design

Figure 2

We'll also need to customize some of the existing transformations, to make this flow specific to the Customer table.

Here is a brief description of these transformations:

  • ParsedEmailAddress. This transformation will parse the EmailAddress column from the source and generate a new column with the complex data type.
  • ExtractEmailAddressPart. This transformation will read the column generated in the previous step and split it into two parts - user name and domain parts.

Configure Debugging Settings

Let's turn on a debug mode and add debug parameters that will allow us to build the pipeline interactively.

Please enter following values as debug parameters:

  • SourceTableName: "Customer"
  • TargetTableName: "Customer"
  • PrimaryKey: "CustomerID"
  • TimestampColumn: "DateInserted"

Here's the screenshot:

Configure Debugging Settings

Figure 3

Source Transformation

Let's select BronzeDelta transformation and navigate to the Source options tab. Use the Browse button to navigate to the folder that hosts Bronze Delta Lake files for the Customer table, as follows:

Source Transformation

Figure 4

We'll also need to fetch the schema for this table, so let's navigate to the Projection tab and click the Import schema button, as follows:

Import schema

Figure 5

Here's the screenshot of the Projection tab with the imported schema:

Import schema

Figure 6

ParsedEmailAddress Step

Before adding a new transformation let's select the Deduplicate step and examine the data from its output:

Deduplicate step

Figure 7

Our goal is to split the EmailAddress column into two parts - username and domain parts.

Let's add a Parse transformation after the Deduplication step and select the DelimitedText option. Expand Delimited text settings section, select Edit checkbox and enter @ symbol as a delimiter:

Parse transformation

Figure 8

Enter the following settings:

  • ParsedEmailAddress in the Column text box, to define a new column's name.
  • EmailAddress in the Expression text box, to indicate a source column to be parsed.
  • (UserName as string, Domain as string) in the Output column type text box, to define a schema of the new column.

Here's the screenshot:

Parse transformation

Figure 9

Let's validate the results:

validate Parse transformation

Figure 10

Notice we created a ParsedEmailAddress column with the complex data type. Next, we'll split this column into its parts.

ExtractEmailAddressParts Step

Next, let's add Derived column transformation after the ParsedEmailAddress step and add the following calculated columns:

  • UserName: ParsedEmailAddress.UserName
  • Domain: ParsedEmailAddress.Domain

Here's the screenshot:

extract email

Figure 11

Let's validate the results:

validate extract email

Figure 12

RemoveMetadataColumn Step

As you may remember from the previous post, this step was designed to eliminate all the columns that are not required in the target table. We'll need to customize it, by adding the column ParsedEmailAddress to the exclusion list. Select this transformation, open the expression builder for the left-hand side textbox and modify the existing expression as follows:

name!="PkColumn"&&name!="TSColumn"&&name!="RowVersion"&&name!="ParsedEmailAddress"

Here's the screenshot:

remove meta data

Figure 13

Here's the screenshot with the validation results:

validate remove meta data

Figure 14

Building Parent Pipeline

Because the data flow we built here was for a specific table, our parent pipeline can be simple. Let's create a data integration pipeline, add data flow activity to it and select the DataflowBonzeSilver_Customer data flow, as follows:

parent pipline

Figure 15

Navigate, to the Parameters tab and enter Customer table-related parameters, as follows:

data flow parameters

Figure 16

This concludes the design of the parent pipeline. Let's run it in debug mode and ensure that it succeeds:

test pipeline

Figure 17

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Fikrat Azizov Fikrat Azizov has been working with SQL Server since 2002 and has earned two MCSE certifications. He’s currently working as a Solutions Architect at Slalom Canada.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2021-12-14

Comments For This Article

















get free sql tips
agree to terms