Import Data from Excel to Azure SQL Database using Azure Data Factory

By: Ron L'Esteve | Comments (2) | Related: > Azure Data Factory

Problem

The need to load data from Excel spreadsheets into SQL Databases has been a long-standing requirement for many organizations for many years. Previously, tools such as VBA, SSIS, C# and more have been used to perform this data ingestion orchestration process. Recently, Microsoft introduced an Excel connector for Azure Data Factory. Based on this new Excel connector, how can we go about loading Excel files containing multiple tabs into Azure SQL Database Tables?

Solution

With the new addition of the Excel connector in Azure Data Factory, we now have the capability of leveraging dynamic and parameterized pipelines to load Excel spreadsheets into Azure SQL Database tables. In this article, we will explore how to dynamically load an Excel spreadsheet residing in ADLS gen2 containing multiple Sheets into a single Azure SQL Table and also into multiple tables for every sheet.

Pre-Requisites

Create an Excel Spreadsheet

The image below shows a sample Excel spreadsheet containing four sheets containing the same headers and schema that we will use in our ADF Pipelines to load data in Azure SQL Tables.

Upload to Azure Data Lake Storage Gen2

This same Excel spreadsheet has been loaded to ADLS gen2.

LakeFile Excel File uploaded to DataLake

Within Data Factory, we can add an ADLS gen2 linked service for the location of the Excel spreadsheet.

Create Linked Services and Datasets

ADLSLinkedService Linked service for ADLS2

We'll need to ensure that the ADLS gen2 linked service credentials are configured accurately.

ExcelLinkedService Linked Service for Excel

When creating a new dataset, notice that we have Excel format as an option which we can select.

The connection configuration properties for the Excel dataset can be found below. Note that we will need to configure the Sheet Name property with the dynamic parameterized @dataset().SheetName value. Also, since we have headers in the file, we will need to check 'First row as header'.

ExcelConnection Excel connection details

Within the parameters tab, we'll need to add SheetName.

ExcelParams Excel parameters in connection

Next, a sink dataset to the target Azure SQL Table will also need to be created with a connection to the appropriate linked service.

Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into a Single Azure SQL Table

In the following section, we'll create a pipeline to load multiple Excel sheets from a single spreadsheet file into a single Azure SQL Table.

Within the ADF pane, we can next create a new pipeline and then add a ForEach loop activity to the pipeline canvas. Next, click on the white space of the canvas within the pipeline to add a new Array variable called SheetName containing default values of all the sheets in the spreadsheet from Sheet1 through Sheet4, as depicted in the image below.

ExceltoSQLPipeline1 Display of the variables

Next, add @variables('SheetName') to the items property of the ForEach Settings.

ForEachSettings1 Foreach settings for the pipeline

Next, navigate into the ForEach activity and add a CopyActivity with source configurations as follows.

CopyDataSource1 Source settings for the copy data activity

Within the sink configurations, we'll need to set the table option property to 'Auto Create Table' since we currently do not have a table created.

CopyDataSink1 Sink settings for the copy data activity

After executing the pipeline, we can see that the four Sheets have been loaded into the Azure SQL Table.

PipelineResults1 Results for the pipeline execution

When we navigate to the Azure SQL Table and query it, we can see that the data from all the Excel Sheets were loaded into the single Azure SQL Table.

SQLSelectResults1 select statement to show the data

Create a Pipeline to Load Multiple Excel Sheets in a Spreadsheet into Multiple Azure SQL Tables

In this next example, we will test loading multiple Excel sheets from a spreadsheet into multiple Azure SQL Tables. To begin, we will need a new Excel lookup table that will contain the SheetName and TableName which will be used by the dynamic ADF pipeline parameters.

The following script can be used to create this lookup table.

CreateLookupTbl Script to create a lookup table.

SET ANSI_NULLS ON
GO
 
SET QUOTED_IDENTIFIER ON
GO
 
CREATE TABLE [dbo].[ExcelTableLookUp](
   [SheetName] [nvarchar](max) NULL,
   [TableName] [nvarchar](max) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

Once the table is created, we can insert the SheetNames and corresponding TableNames into the table:

loadlookupandquery load the lookup table and query it

Next, we will also need to create a new dataset with a connection to the Excel Look up table.

SQLConnection2 connection for the excel sql lookup table

The connection properties of the Excel Spreadsheet will be similar to the previous pipeline where we parameterized SheetName as follows.

SpreadsheetSource Spreadsheet connection details.

In this scenario, we will also need to add a parameter for the TableName in the Azure SQL Database dataset connection as follows.

In the Azure SQL DB connection section, we'll leave the schema as hardcoded and would need to add the parameter for the TableName as follows.

ExcelDatasetSQL Dataset for the excel connection in sql

In this pipeline, we will also need a lookup table which will serve the purpose of looking up the values in the SQL lookup table through a select * lookup on the table.

The values from the lookup can be passed to the ForEach loop activity's items property of the settings tab, as follows:

ForEachLoop2 FE Look settings for the pipeline

Next, within the ForEachLoop activity, we'll need a Copy Data activity with the source dataset properties containing the parameterized SheetName value, as follows.

Next, the sink dataset properties will also need to contain the parameterized TableName value, as follows. Note that the table option is once again set to 'Auto Create Table'.

After we run this pipeline, we can see that the pipeline succeeded and four tables were created in the Azure SQL Database.

Pipeline2Results Results for the pipeline execution

Upon navigating to the Azure SQL Database, we can see that all four table were created with the appropriate names based on the TableName values we defined in the SQL Lookup table.

As a final check, when we query all four tables, we can see that they all contain the data from the Excel Sheets which confirms that the pipeline executed successfully and with the correct mappings of sheets to multiple tables which were defined in the lookup tables.

QueryTables2 Query all the tables that were created and loaded

Next Steps

For more details on the Excel Connector, read the Microsoft article - ADF Adds Connectors for Delta Lake and Excel.
For a list of all the other Azure Data Factory Connectors, read Azure Data Factory Connector overview.
Explore my article, Using SQL Server Integration Services to Generate Excel Files Based on Criteria which was built using SSIS and explore how to re-create this similar process in Azure Data Factory and also explore other capabilities and patterns to work with Excel files in ADF.

About the author

Ron L'Esteve is a trusted information technology thought leader and professional Author residing in Illinois. He brings over 20 years of IT experience and is well-known for his impactful books and article publications on Data & AI Architecture, Engineering, and Cloud Leadership. Ron completed his Master�s in Business Administration and Finance from Loyola University in Chicago. Ron brings deep tec

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Thursday, November 4, 2021 - 6:00:13 AM - satish	Back To Top (89409)
I am working on the same exercise for my company. The one doubt i have is the reason for Datalake gen 2 instead of table or blob storage. Any advantage provided by ADLS gen2? I believe its costlier.

Friday, August 20, 2021 - 4:48:56 AM - kim	Back To Top (89155)
Hi, thanks for the guide. What if some users during a day drops new version of the excel spreadsheet and when you execute the pipeline the import job should only import the latest sheet. Is there a setting for that? br kim