Data Transformation and Migration Using Azure Data Factory and Azure Databricks

By:   |   Updated: 2023-03-24   |   Comments (2)   |   Related: > Azure Data Factory


Problem

Microsoft Azure Databricks and Data Factory are both cloud-based services for data integration and (Extract, Transform, and Load) ETL that can manage a variety of data formats, including batch-streaming and unstructured and structured data. Azure Data Factory (ADF) is a solution for orchestrating data transfer at scale and ETL procedures for Data Integration services. Azure Databricks is a fully managed platform for analytics, data engineering, and machine learning, executing ETL and creating Machine Learning models.

Data ingested in large quantities, either batch or real-time, must be transformed appropriately to the proper format and stored in a database in the cloud or on-premises.

Solution

The need for data coming in the correct format for data analyst and scientist are on the rise more than ever before, where Azure Databricks is used to perform the entire ELT process and load the transformed data into a database.

This entire project will be carried out using ADF and Azure Databricks: mounting Azure SQL and Azure Data Lake Gen2 storage on Azure Databricks, writing the entire transformation code in PySparks on Azure Databricks, and creating a trigger to activate the ETL process of Azure Databricks right from Azure Data Factory.

Project Architecture

Project Architecture

However, before we start, let us first understand Databricks.

Understanding Databricks

Databricks is a cloud-based platform that combines data analytics, data engineering, and machine learning to process and analyze enormous amounts of data and enable collaboration between engineers, business analysts, and data scientists.

The cloud-based platform is built on top of distributed cloud computing infrastructure like Azure, AWS, or Google Cloud, which enables executing programs on CPUs or GPUs dependent on analysis requirements. In addition, Databricks can be integrated with another cloud platform to help manage enormous amounts of data and execute machine learning tasks by organizations.

What is Azure Databricks?

Azure Databricks is an Apache Spark-based analytics platform built on Microsoft Azure. Azure Databricks is used in opening lake houses and processing large amounts of data, allowing collaboration between data scientists, data engineers, and business analysts to obtain meaningful insights.

Azure Databricks by Microsoft

Azure Databricks by Microsoft

Create Azure Databricks

To work with Azure Databricks, you must first have an active Azure subscription. You will create all the necessary resources for this project in your Azure Portal site.

Create a Resource Group

Azure resource group is a container that holds related resources for an Azure solution. However, you can include all of the solution's resources in the resource group or only the resources you want to manage collectively.

Check out Connect Power BI to Azure Data Lake Gen2 Storage for Reporting for more information.

Azure Databricks Service

In the Azure Portal, go to the search resources, service, and docs at the top and type "databricks". This will bring up a list of resources. Select Azure Databricks.

Search results

Now, click + Create. This will open a new window where you are expected to input all necessary information.

Azure Databricks | Create

Basic Configuration

Complete all necessary information:

  • Subscription: Select your Azure subscription.
  • Resource Group: Select the resource group created. This will be used to house all Azure resources.
  • Workspace Name: Provide the workspace with a unique name.
  • Pricing Tier: We have 2 options, Standard and Premium. For this project, we will be using the Standard Tier. Note: I suggest using the Premium Tier if you work for a larger organization with multiple collaboration features.
Create Azure Databricks workspace

Networking

In the networking stage, you can leave it in the default settings.

Azure Databricks | Networking

Review + Create

This is the final step: click on the review + create button. This will start the validation process.

Azure Databricks | Validation

The Deployment stage can take a couple of minutes. When done, you will get a successful message.

Now click on "Go to resource." This will take you to the Azure Databricks resource we just created.

Azure Databricks | Deployment Complete

In the Azure Databricks resource, click on "Launch Workspace." This will take you to the Azure Databricks environment.

Azure Databricks environment

Databricks Cluster

Databricks clusters are a set of computation resources and configurations on which data engineers, science, and analytics run their workloads, such as production ELT pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Types of Databricks Cluster

In Azure Databricks, the workloads are manually or automatically executed through notebook instructions. There are two types of clusters supported in Databricks:

  1. All Purpose Clusters: These clusters are employed in interactive notebooks for group data analysis. The CLI, UI, or REST API are used in creating them. You can also manually stop the "All Purpose Cluster."
  2. Job Clusters: These clusters are used for quick and reliable automated tasks. They are generated when a job is conducted on your brand-new Job Cluster, and the cluster is terminated when the job is finished. Note: This cluster cannot be restarted.

Create an All-Purpose Cluster

Let us create an all-purpose or interactive cluster in Azure Databricks. You might notice the name is now called "Compute," formally called Cluster.

Step 1: Create a Cluster

In your Azure Databricks environment, scroll to the bottom left corner and click "Compute." Under the All-purpose compute, click on "Create compute."

Create a cluster

Step 2: Databricks Cluster Setting

Setting up the Azure Databricks cluster will require the following configuration:

  1. Node: We will use a "Single Node" to create it for one user. Spark jobs are executed on the driver node in a single node cluster with no workers.
  2. Runtime Version: Select the Runtime: 11.3 LTS (Scala 2.12, Spark 3.30) version for the runtime.
  3. Node Type: For this project, we will use the "Standard_DS3_v2 14GB Memory, 4 Cores." You can go for a higher version of the node depending on the project you want to work on.
  4. Terminate: This is used to terminate the cluster after a certain period of inactivity.
  5. Create: Click the Create Cluster button when you are done setting up the cluster configuration.
Databricks cluster settings

This process will take a couple of minutes to create. After the cluster is created, you will get a "successful" message with a green check mark indicating your cluster is actively running.

Cluster creation success message

Mount Azure Data Lake Gen2 to Azure Databricks

In mounting Azure Data Lake Gen2 to Databricks, we need to complete the following:

  • Create Azure Service Principal.
  • Grant access for Azure data lake to Azure Service Principal.
  • Create the mount in Databricks using Service Principal.

App Registration

We first must create an app in the Azure Service portal before we mount Azure Storage to Databricks.

Follow the steps listed below to create an App:

Step 1

Enter "Azure Active Directory" in the search box of your Microsoft Azure Portal, then select "App Registration."

Default Directory | App registrations

Step 2

In the app registration window, click on the "New Registration" button to register the app.

App registrations | New registration

Step 3

Give the app a unique name, leave other default settings, and click Register.

Naming, support account types, register

A new window appears, providing you with the necessary IDs to copy. These IDs are needed in mounting your data lake storage in Databricks:

  • Application ID
  • Directory ID
Application and Directory IDs

Step 4

In the new app created, you need to get the Secret ID. In the app, click "Certificates & secrets," then "New Client Secrets," give it a description, and click Add.

Secret ID

Step 5

Copy the App Secret Value. Note: The secret value appears only once.

Copy App Secret Value

Grant App Access to Data Lake Storage

We need to grant access to the app just created to Azure data lake storage.

Step 1

In your Azure Portal, navigate to the Data Lake Gen2 resource you already have. If you do not have Azure Data Lake Gen2 in your resources, check out our previous article on creating data lake storage: Connect Power BI to Azure Data Lake Gen2.

In the Data Lake Gen2 resource, click on Access Control (IAM), click +Add, and select "Add role assignment."

Access Control (IAM)

Step 2:

Search for "Storage Blob Data Contributor" in the role assignment window. Click next after selecting it.

Search for Storage Blob Data Contributor

Step 3:

Click the "Select member" icon on the member's page, look for the APP name you created, and then add it. Click "Review + assign" after that.

Add role assignment

Set Mount Configuration

Now that all registration is completed in Azure Portal, you will need to navigate the Azure Databricks environment.

Step 1: Configuration

In your Azure Databricks environment, click on the "Workspace" at the top left corner. Right-click on the workspace and select Create | Notebook. This process will help create a new Notebook for writing code using the PySpark.

Workspace | Notebook

In the Notebook, write the code below, which will help with the mount configuration settings. You must provide the "Application ID, App Key, and Directory ID" code obtained during your app registration.

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<app-id>",#application-id
           "fs.azure.account.oauth2.client.secret": "<secret_key>",#secret key
           "fs.azure.account.oauth2.client.endpoint": "<Tenant _id>"}#Tenant ID Directory

Step 2: Mount Folder Directory

You will need to indicate the source and mount point name.

  • <storage_name>: Name of Azure Data Lake Gen2 created.
  • <folder_name>: This is the name of the container (folder) found inside the Azure data lake storage.
dbutils.fs.mount(
  source = "abfss://<folder_name>@<storage_name>.dfs.core.windows.net/",#Mount your data lake gen 2
  mount_point = "/<storage_name>/<folder_name>",
  extra_configs = configs)

Step 3: Check the Folder

The line of code below checks for files in the folder we just mounted.

dbutils.fs.ls("/mnt/<storage_name>/<folder_name>") # This shows the list of the Mounted folder.

Note: To save money, you can always terminate the cluster running.

Data Transformation in Databricks using PySpark

This stage requires data to be transformed and processed to the right format, which will be migrated to the Azure SQL for a data analyst to derive insight and other useful information.

Step 1: Read Data from Mount Point

Mount the data directory we want to work on with the exact CSV file directory.

#Read Data from Mount Point
df_pyspark_2 = spark.read.csv("/mnt/adfdatalakeurbizedge/cleanfile/sales_data.csv", header=True, inferSchema=True)
#InferSchema endures the right datatype are in each individual column
  • Header: True promotes the first row as a header.
  • inferSchema: This helps auto-detect the right data types.

Step 2: Exploratory Data Analysis

Let us get information about the data we want to transform and migrate.

Show Data. Azure Databricks uses Apache PySpark, which is similar to Python Pandas.

df_pyspark_2.show() #show() is used to show the data frame we want to work on
Show data

Data Type. The printSchema() function is used to get the right data type of each column. It is the same as pandas info().

#View each columns datatypes
df_pyspark_2.printSchema()
Data type

Calculations & New Columns. You need to import the column function. This will give PySpark the ability to create a new column when trying to get the total amount sold.

from pyspark.sql.functions import to_date, col
df_pyspark_2 = df_pyspark_2.withColumn("Total_Amount_Sold", col("Quantity") * col("Price")*(1-col("Sales Discount")))
df_pyspark_2.select(["Pizza Sold","Total_Amount_Sold"]).show()# Select the #Pizza sold and Total Amount sold Column
Calculations and new columns

Groupby and Aggregate. The groupBy function works the same way as that of SQL. It will provide the unique of all the "PizzaSold" column and the sum aggregate of the "Total_Amount_Sold" column.

df_pyspark_2 = df_pyspark_2.groupBy("PizzaSold").agg({"Total_Amount_Sold":"sum"})
Groupby and Aggregate

Now that we have completed the transformation, we need to migrate the transformed data from Azure Data Lake Gen2 to Azure SQL Database.

Create Azure SQL Database

To create an Azure SQL database, you will need to follow the steps listed below:

Step 1

In your Azure Portal, search for Azure SQL resource. This should take you to the resource environment. Click the "Create +" sign at the top left corner.

Azure SQL | Create

Step 2

In the SQL deployment environment, you will have 3 options. For this project, we will use the "SQL databases – Single database." Then click Create.

SQL deployment options

Step 3

For the SQL configuration settings, you must fill in all necessary settings from your resource group, database name, and Server name.

Create SQL Database Basics

Step 4

Create a username and password to login to your Azure SQL. For other settings, click "Networking" and provide the necessary information. If not, click on "Review + create." This will create the Azure SQL resource.

Step 5

Now that you have successfully created the Azure SQL database, you need to login. Click on the Azure SQL resource you just created, then select "Query Editor (Preview),"and enter your necessary login information.

Query editor (preview)

Create a Sales Data Table in Azure SQL

In the Azure Portal SQL resource, write this query below to create a table.

create table sales_data (
   pizza_sold varchar(30),
   Target_Amount_Sold DECIMAL(10, 2)
);
Create a table

Write to Azure SQL. This is the final step. At this point, we will import the transformed data to Azure SQL. You are expected to fill in the following information:

  • Server name: This is obtained from the Azure SQL resources created in Azure Portal.
Write to Azure SQL | Servername
  • Database: This is the name given to the database when creating the Azure SQL resources.
  • Table: The table name created is "Sales data."
  • User: Username to your Azure SQL resources.
  • Password: Password to your database resources.
df.write .format("jdbc").option("url", "jdbc:sqlserver://<servername>:1433;databaseName=<databasename>").option("dbtable", "[<optional_schema_name>].<table_name>").option("user", "<user_name>").option("password", "<password>").save()

Trigger Databricks Pipeline in Azure Data Factory

The Azure Data Factory is a cloud-based service that performs the complete ETL process for ingesting, preparing, and transforming all data at scale. Apart from the ETL operation performed by ADF, it can also be used in running Databricks Notebook directly from the ADF environment. The process uses the Job cluster in Azure Databricks.

Azure Databricks Access Token

The Access token is required when creating a connection between ADF and Databricks.

The steps listed below show how to generate an Azure Databricks access token.

Step 1

At the top right corner of your Azure Databricks account, select "User Settings."

User Settings

Step 2

Under the Access tokens, click Generate new token.

Access tokens | Generate new token

Step 3

Add a Token Name and the lifetime you want your token to last, and click Generate.

Generate token

Step 4

From the new window, the access token is generated. You are expected to copy and keep the token in a secure place to which only you have access.

Copy created token

Create Azure Databricks Linked Service

In your ADF environment, navigate to the left pane and click the "Manage" icon. In the Manage session, click on "+ New." Select the "Compute" tab in the new window, then click on Azure Databricks.

Manage linked services

Set Up Serviced Link for Databricks

In your Azure Databricks linked service, you must fill in the correct configuration settings.

Note: Ensure all information in the Linked Service is the same as the Azure Databricks cluster created. The image below helps explain the configuration process for the Databricks service.

New linked service

Activity Pipeline for Azure Databricks

Step 1: Create a Pipeline Activity

In the ADF author window, create a new activity. Search for the Databricks and drag Notebook to the activity pipeline environment. You are also expected to name your Notebook activity or leave it as the default name.

Create a pipeline activity

Step 2: Activity Settings and Configuration

  • Databricks Linked Service: In your activity setting (Notebook), click on the Azure Databricks tab and select the correct Databricks linked service we created.
Databricks lined service
  • Notebook: Click on the "Settings" tab, then browse to the directory of the Notebook in the Azure Databricks root folder.
Notebook

Publish Pipeline

To save the entire activity we created in ADF, click on the Publish all icon at the top left corner.

Publish pipeline

Debug Pipeline

To run the pipeline, click on the "Debug" icon. This will run the entire Notebook activity pipeline. This starts an in-job cluster in Azure Databricks. It should take a couple of minutes, depending on the ETL pipeline you want to run.

To monitor the pipeline activity and check the debugging status, click on the "monitor" tab and select debug. This should provide you with the status.

Debug pipeline

Query Result

In your Azure Portal, select the Azure SQL resource table you migrated the transformed data into and run the SELECT * FROM query to see your result.

Query result

Automate Pipeline

Data Engineering involves automating the ETL process. To achieve this, we will create a "storage event trigger," which activates the entire ETL process whenever data is uploaded into the Azure Data Lake Gen 2 storage.

Truncate Data Table

Truncating removes all data from a table but keeps the table structure intact. Let’s remove the sales_data and repeat the ETL process, but this time with an event-based trigger.

Truncate table [dbo].[sales_data];
Truncate data table

Create Storage Event Trigger

Now that we have deleted all the data from the Azure SQL, let's create a "storage-based event trigger" in ADF.

The steps below show how to create an event-based trigger in ADF:

  1. Type: This indicates the type of trigger we want to work with.
  2. Subscription: Select the subscription you are currently on.
  3. Storage account name: We will use the Azure Data Lake Gen 2 storage.
  4. Container name: Select the folder to which the file will be uploaded in your data lake.
  5. The blob path begins with: Copy the file name upon which you want to base your trigger.
Create an event-based trigger in ADF

Publish All

You need to click on the Continue icon to activate your trigger.

Test Trigger

Now that we are done creating the event-based storage trigger let's head to Azure Data Lake Gen 2 and upload the sales_data to the storage.

Upload Data

In your Azure Data Lake Gen2, navigate to the cleanfile folder, and click "Upload." Select the sales_data file from your local machine.

Upload data

Note: As soon as the files are uploaded, this triggers the entire pipeline in Azure Data Factory.

Monitor

In ADF, click on the monitor pane with Triggered by: All configured then you will see that the pipeline ran successfully.

Monitor

Test

Now that the pipeline has run successfully head to your Azure SQL and run the SELECT FROM function to view the entire data.

Test

Conclusion

In this article, you learned different Azure Databricks applications and how to mount Azure Data Lake Gen 2 on Azure Databricks to perform data transformation and migration to Azure SQL database. We also created an event-based storage trigger, which helped us improve the entire process and reduce human interaction to the nearest minimum.

It is important to note that including other Azure resources, like Azure Function and Azure Logic Apps, can further improve the ETL process. It will also further help in automating the entire process.

To learn more about Azure, check out the Azure Tips segment on MSSQLTips.com or subscribe to the MSSQLTips.com newsletter.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Temidayo Omoniyi Temidayo Omoniyi is a Microsoft Certified Data Analyst, Microsoft Certified Trainer, Azure Data Engineer, Content Creator, and Technical writer with over 3 years of experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips


Article Last Updated: 2023-03-24

Comments For This Article




Monday, September 11, 2023 - 10:43:11 AM - Becks Back To Top (91549)
Thank you so much. This is really helpful

Friday, July 14, 2023 - 4:48:22 AM - Stephanie GUO Back To Top (91386)
Hi Temidayo,

Nice work and really helpful! Just wondering whether you are able to share the dataset that has been used on this project. Many thanks!