Azure Data Factory Pipeline to fully Load all SQL Server Objects to ADLS Gen2


By:   |   Updated: 2020-01-30   |   Comments   |   Related: More > Azure

Problem

The process of moving data from an on-premises SQL Server to Azure Data Lake Storage has become an ever increasing and vital need for many businesses. Customers with numerous on-premises SQL servers along with hundreds of databases within these servers are interested in leveraging Azure's Data Services to build an E-L-T process to fully load Azure Data Lake Storage with their on-premises SQL Server databases and tables in a structured, partitioned and repeatable process for all servers in their SQL Server ecosystem. How might they be able to solve for this problem and what tools could be used for the solution?

Solution

Azure Data Factory has been a critical E-L-T tool of choice for many data engineers working with Azure's Data Services. The ability to leverage dynamic SQL and parameters within ADF pipelines allows for seamless data engineering and scalability. In this article, I will demo the process of creating an end-to-end Data Factory pipeline to move all on-premises SQL Server objects including databases and tables to Azure Data Lake Storage gen 2 with a few pipelines that leverage dynamic parameters.

Create Prerequisite Resources

Azure Data Factory V2: This will be used as the ETL orchestration tool. For more detail on setting up an ADFv2, visit: Quickstart: Create a data factory by using the Azure Data Factory UI

Azure Data Lake Store Gen2: The data will be stored in ADLS2. For more information on setting up ADLS2, visit: Create an Azure Data Lake Storage Gen2 storage account

On-Premises SQL Server: My source data will be stored in an on-premises SQL Server and this will be needed for the exercise. For more information on obtaining a Developer, Express or Trial version of SQL Server, visit: Try SQL Server on-premises or in the cloud

Azure SQL Database (Standard): I will use Azure SQL Database to store my pipeline parameter values and tables so as not to make any changes to my source on-premises objects. For more detail on creating an Azure SQL Database, visit: Quickstart: Create a single database in Azure SQL Database using the Azure portal, PowerShell, and Azure CLI

Azure Data Factory Self Hosted IR: The self-hosted IR will allow me to link my on-premises resources to Azure. For more detail on creating a Self-hosted IR, visit: Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime - Part 1

Prepare and Verify SQL Server Database Objects

I'll begin the process by navigating to SQL Server Management Studio and connecting to my on-premises SQL Server containing two OLTP SQL Databases.

For more detail related to finding sample SQL Databases, visit: SQL Server Samples on GitHub

SQL Objects List of SQL Server database on-prem

I will also verify that there are tables in both databases:

WWIDB Tables List of tables in the WWI DB
Adventure works Tables List of tables in the Adventure works DB

Prepare and Verify Azure SQL Database Objects

I will create a pipeline parameter table in my Azure SQL Database to store the table names, catalog names, and process flags, which will drive the pipeline configurations at runtime.

Azure SQL DB Pipeline parameter table in ASQLDB
SET ANSI_NULLS ON
GO
 
SET QUOTED_IDENTIFIER ON
GO
 
CREATE TABLE [dbo].[pipeline_parameter](
   [Table_Name] [nvarchar](500) NULL,
   [TABLE_CATALOG] [nvarchar](500) NULL,
   [process_type] [nvarchar](500) NULL
) ON [PRIMARY]
GO

Prepare Azure Data Lake Store Gen2 Container

I will also create an ADLS2 container and folder for my server.

Datalake folders1 Container and folder in ADLS2

I will also confirm that there is no existing data in my server level folder:

Datalake2 no data in data lake currently-confirmed

Create Azure Data Factory Pipeline Resources

I am now ready to begin creating my Azure Data Factory Pipeline by navigating to my Azure Data Factory resource and clicking Author and monitor:

ADF Author step to open ADFv2

Once the Azure Data Factory canvas loads, I will click 'Create Pipeline'.

Create Pipeline Step to click create pipeline and open pipeline canvas

Create Self Hosted IR

I will create and verify that my Self Hosted Integration runtime is created and in 'Running' Status.

For more detail on creating a Self-Hosted IR, please see: Create and configure a self-hosted integration runtime.

A self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private network.

SelfHosted IR Step to confirm ADFv2 Self hosted ir is running

Create Linked Services

Once my self-hosted IR is created, I will create all my required Linked services, which include SQL Server, Azure SQL Database, and Azure Data Lake Store Gen 2.

Create Linked Services Step to confirm all liked services are created.

Create Datasets

My pipelines will require the following three datasets.

Datasets Needed These are the data sets needed by pipeline

DS_ADLS2

I will configure my ADLS2 Dataset as a Parquet format and will also add the following parameterized file path connections to allow me to partition my data by YY-MM-DD-HH. I'll also set the compression type to 'snappy' for improved performance.

DS_ADLS2_connection Configure DS_ADLS2 connection

@concat('lake/raw/on_prem2/',dataset().table_catalog_name)
@{item().Table_Name}/@{formatDateTime(utcnow(),'yyyy')}/@{formatDateTime(utcnow(),'MM')}/@{formatDateTime(utcnow(),'dd')}/@{item().Table_Name}@{formatDateTime(utcnow(),'HH')}

Within this dataset, I will also add the following parameters, which I will use at a later stage.

DS_ADLS2 Parameter Configure DS_ADLS2 parameter

DS_SQLSERVER

Next, I will add a dataset connection to my on-premise SQL server. Note that I will also leave the table as set to 'None' to allow for traversing through all tables in the SQL Server.

DS_SQLSERVER Configure DS_SQLSERVER

DS_ASQLDB_PIPEINE_PARAMETER

My final dataset will be for a connection to my pipeline parameter table in an Azure SQL Database.

DS_ASQLDB_PIPEINE_PARAMETER Step to configure DS_ASQLDB_PIPEINE_PARAMETER

Create Azure Data Factory Pipelines

P_Insert_Base_Table_Info

This pipeline will query the on-premise information_Schema.tables as its source to get the Table Name and Database name and will then output the results to a basic parameters table in Azure SQL Database. The purpose of this pipeline and process will be to use this pipeline parameter table to drive the rest of our pipelines.

To create this pipeline, I will add a copy activity to the pipeline canvas and set my source as my on-premises server dataset. I will also add the following query as the source. This query will query the specified Database and list the tables that can be used in the rest of the process. For now, I will manually change the database name for every database that I would need and run the pipeline.

USE AdventureWorksLT
SELECT 
QUOTENAME(table_Schema) + '.' + QUOTENAME(TABLE_NAME) AS Table_Name, TABLE_CATALOG
   FROM information_Schema.tables
   WHERE TABLE_TYPE = 'BASE TABLE'
P_Insert_Base_Table_Info Copy Data 
step to configure P_Insert_Base_Table_Info Copy Data activity

Next, I will set my Sink:

P_Insert_Base_Table_Info Sink step to configure P_Insert_Base_Table_Info sink

Lastly, I will ensure that my source to sink mappings are accurate:

P_Insert_Base_Table_Info Mapping step to configure P_Insert_Base_Table_Info mapping

The resulting pipeline.parameter Azure SQL DB table would look like this:

pipeline.parameter table step to confirm pipeline parameter has data

P_SQL_to_ADLS_0

This pipeline will be executed within P_SQL_to_ADLS_1 as part of the iteration process to create all tables within the specified database within ADLSgen2. ADFv2 currently does not have the capability of nesting for each loop, which is why I will use this pipeline as a workaround to next it within my main database level for each loop.

Now I am ready to create my next pipeline. I will add a Lookup and Foreach activity to the pipeline canvas.

P_SQL_to_ADLS_0 step to configure P_SQL_to_ADLS_0 activities

The lookup activity simply looks up the pipeline parameter table that we populated in the prior pipeline.

P_SQL_to_ADLS_0 lookup step to configure  P_SQL_to_ADLS_0 settings

Next, within settings of the For Each activity, ensure that Items is set to:

@activity('Get-Tables').output.value
P_SQL_to_ADLS_0 foreach step to configure  P_SQL_to_ADLS_0 settings

I have also added a copy activity within the ForEach Activity. Click Edit Activities to view the details.

P_SQL_to_ADLS_0 Foreach Activities step to configure  P_SQL_to_ADLS_0 for each activities settings

My source dataset is the on-prem SQL Server.

P_SQL_to_ADLS_0 foreach settings step to configure  P_SQL_to_ADLS_0 for each copy table settings

Additionally, I will use the following dynamic source query which will look up the Table Catalog and Table Name:

USE @{item().Table_Catalog} SELECT * FROM @{item().Table_Name}

My sink dataset will be my specified ADLS2 account container and folder structure. I will also add table_name and Item_name to the dataset properties that we configured while creating DS_ADLS2 dataset.

P_SQL_to_ADLS_0 sink settings step to configure  P_SQL_to_ADLS_0 sink settings

It's fair to say that this pipeline is now complete and we can move on to the final pipeline.

P_SQL_to_ADLS_1

This pipeline will start by getting the distinct database names from the pipeline parameter table and then iterate over every database to perform a for each database activity at the table level.

To begin this pipeline, similar to the last pipeline, I will add a Lookup and ForEach activity.

P_SQL_to_ADLS_1 settings step to configure  P_SQL_to_ADLS_1 activity settings

Within the settings tab of the lookup activity, I will set my source dataset to my Azure SQL DB pipeline parameter table and then add the following query as the source. This query gives me the distinct databases listed my pipeline parameter table that will be used in the downstream ForEach activity:

SELECT Distinct [Table_Catalog]
  FROM [dbo].[pipeline_parameter]
P_SQL_to_ADLS_1 source settings step to configure  P_SQL_to_ADLS_1 source settings

The settings within the ForEach loop will contain the following for Items, which simply passes the database name to this Item value:

@activity('Get-Databases').output.value
P_SQL_to_ADLS_1 foreach settings step to configure  P_SQL_to_ADLS_01 foreach settings

Within the activities, I'll have one ExecutePipeline activity. Click Edit Activity to view the details.

P_SQL_to_ADLS_1 foreach activity settings step to configure  P_SQL_to_ADLS_1 foreach  activity settings

Note that the Execute Pipeline activity invokes the previous table-level pipeline copy activity that we created.

P_SQL_to_ADLS_1 ExecutePipeline settings step to configure  P_SQL_to_ADLS_1 ExecutePipeline settings

This concludes all of the necessary steps to create my pipelines.

Run the Data Factory Pipeline & Verify Azure Data Lake Store Gen 2 Objects

I will now run pipeline P_Sql_to_ADLS_1, which will invoke pipelin2 P_Sql_to_ADLS_0 in the lookup and for each activity.

After running the pipeline, I can see that there are two database-level folders for my two databases.

ADLS2 Folders for DBs Step to confirm that ADLS2 folders are created for both databases

I can also see the appropriate table-level folders:

ADLS2 Folders for Tables Step to confirm that ADLS2 folders are created for tables within the databases folders

Additionally, when I drill into the folders, I can see that they are appropriately time-partitioned and that the table file has been created.

ADLS2 file Step to confirm that ADLS2 partitioned folders and files are created
Next Steps


Last Updated: 2020-01-30


get scripts

next tip button



About the author
MSSQLTips author Ron L'Esteve Ron L'Esteve is a seasoned Data Architect who holds an MBA and MSF. Ron has over 15 years of consulting experience with Microsoft Business Intelligence, data engineering, emerging cloud and big data technologies.

View all my tips
Related Resources




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name
*Email
Email me updates

Signup for our newsletter

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.






download

























get free sql tips

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



Learn more about SQL Server tools