How to Load Multiple Files in Parallel in Azure Data Factory - Part 1


By:   |   Updated: 2020-01-23   |   Comments (7)   |   Related: More > Azure Data Factory


Problem

At our company we're using Azure Data Factory as the orchestrator for our data pipelines in Azure. We need to load flat files from various locations into an Azure SQL Database. The schema of the flat files can change per type of file and even the delimiter changes sometimes. Manually creating a dataset and a pipeline in ADF for each file is tedious and time consuming. Is there a better way to do this?

Solution

In Azure Data Factory, a dataset describes the schema and location of a data source, which are .csv files in this example. However, a dataset doesn't need to be so precise; it doesn't need to describe every column and its data type. You can also use it as just a placeholder for the .csv file type in general. For example, if you have two .csv files, the first with the comma as delimiter and the second one with the semicolon as delimiter, you can still use the same dataset. You only need to put a parameter on the definition of the delimiter.

Furthermore, using the For Each activity we can loop over metadata and load multiple files in parallel. When you combine both features, we can create a metadata-driven pipeline where we will load multiple types of flat file dynamically. In this tip, we will lay out the architecture for such a solution.

Test Set-up

We're going to load 3 files stored in Azure Blob Storage into an Azure SQL DB. The files contain data about the top 250 movies. The first file contains the first 100 movies, using a semicolon:

file 1

The second file contains the next 200 movies, also with a semicolon:

file 2

The third file contains the last 50 movies. However, the file was created on a different system and there the comma was used as a delimiter:

file 3

The three files are stored inside an Azure Blob Container. The first two in a virtual folder called semicolon, the last file in a virtual folder named comma.

source folders

Download the files here.

This is our source data. To be able to drive our data factory pipeline dynamically, we need metadata. This metadata is stored in a table using the key-value pair method. This allows us to story any type of metadata in the table. The table has the following schema:

CREATE TABLE dbo.Metadata_ADF(
    ID INT IDENTITY(1,1) NOT NULL,
    SourceType VARCHAR(50) NOT NULL,
    ObjectName VARCHAR(500) NOT NULL,
    ObjectValue VARCHAR(1000) NOT NULL
);

To load the files, we need the following metadata:

  • The location in the container, aka the virtual folder.
  • The delimiter used
  • The destination table name

The following SQL scripts inserts the metadata into the table:

INSERT INTO dbo.[Metadata_ADF]
(
    [SourceType]
   ,[ObjectName]
   ,ObjectValue
)
VALUES  ('BlobContainer','semicolondata','semicolon')
        ,('BlobContainer','commadata','comma');
 
INSERT INTO dbo.[Metadata_ADF]
(
    [SourceType]
   ,[ObjectName]
   ,ObjectValue
)
VALUES  ('Delimiter','semicolondata',';')
        ,('Delimiter','commadata',',');
 
INSERT INTO dbo.[Metadata_ADF]
(
    [SourceType]
   ,[ObjectName]
   ,ObjectValue
)
VALUES  ('SQLTable','semicolondata','topmovies_semicolon')
        ,('SQLTable','commadata','topmovies_comma');

The destination tables are created using the following DDL:

CREATE TABLE [dbo].[topmovies_semicolon](
   [Index] [BIGINT] NULL,
   [MovieTitle] varchar(500) NULL
);
 
CREATE TABLE [dbo].[topmovies_comma](
   [Index] [BIGINT] NULL,
   [MovieTitle] varchar(500) NULL
);

The ADF Pipeline

Step 1 – The Datasets

The first step is to add datasets to ADF. Instead of creating 4 datasets: 2 for blob storage and 2 for the SQL Server tables (each time one dataset for each format), we're only going to create 2 datasets. One for blob storage and one for SQL Server. For each dataset, the relevant attributes will be updated dynamically at runtime by parameters. Let's start with the blob dataset.

In the ADF editing environment, click on the ellipsis next to the datasets header to add a new dataset:

add new dataset

Choose Azure Blob Storage as the data store:

azure blob storage as data store

In the next screen, you need to choose the format of your data, which is DelimitedText in our example.

choose file format

At the time of writing, this is new functionality. In older tips, such as Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime - Part 2, you can see this step is not present when a dataset on blob storage is created.

Next, we need to specify a name and if the first row contains headers.

set dataset properties

For the Linked service, you can either choose an existing service or create a new one (an example is given in this tip). In the next screen, click on Browse to choose the container.

configure file path

Using the dialog, navigate to the correct container.

choose correct container

Once the dataset is created, go to the Schema tab.

csv dataset created

Here we can specify the metadata of the columns. It might be possible that ADF automatically picked up some metadata about the columns, if a file was present. Since we want a truly dynamic dataset, we're going to clear all metadata.

clear metadata of the schema

For this to work, the destination table needs to have the exact same columns as the source files. Next, we add some parameters to the dataset in the Parameters tab. We need a parameter for the folder and a parameter for the delimiter:

specify dataset parameters csv

In the Connection tab, we can assign the parameters to specific fields of the configuration. When you click for example in the Directory field, a link will appear to add dynamic content.

add dynamic content link

When you click on this link, you can simply select the parameter from the list:

add parameter for the folder

This will add an expression to the text box at the top:

parameter expression

Click Finish to go back to the connection tab. Similarly, we can add a parameter for the delimiter. But first we have to check the Edit checkbox, so the dropdown changes into a text box.

edit delimiter

When both parameters are assigned, the connection tab should look like this (observe there's no value specified for the file name):

parameterized connection

We also need a dataset for our sink, which is SQL Server. Add a new dataset and choose Azure SQL Database as the data store:

azure sql db dataset

Specify a name for the dataset, create a linked service or choose an existing one and do not import the schema. The table name can be left empty.

set properties

For this dataset, we need to create one parameter: the table name.

add table name parameter

In the Connection tab, choose Edit beneath the table, specify dbo as the schema and add the parameter in the second box as the table name.

connection tab

Conclusion

This concludes the first part of the tip. We have created a key-value pair table in SQL Server holding all of the metadata. This metadata will populate the parameters of the datasets we created. Stay tuned for the second part of this tip, where we will create the metadata-driven pipeline which will copy multiple blob files to a database at once.

Next Steps


Last Updated: 2020-01-23


get scripts

next tip button



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a BI professional, specializing in the Microsoft BI stack with a particular love for SSIS.

View all my tips





Comments For This Article




Wednesday, May 13, 2020 - 2:41:32 AM - Koen Verbeeck Back To Top (85641)

Hi Daniel,

Maybe the incremental load pattern is more your use case? Check out this tip:
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/

I think you have to select "recursively" to traverse any subfolder.

Regards,
Koen


Tuesday, May 12, 2020 - 6:07:08 PM - Daniel Halabe Back To Top (85639)

Hi Koen, 

I've been struggling to figure out how to do this with .csv files that are structured the same way and placed in blob storage every day in a nightly export. 

Any chance you could point me in the right direction? I figure this would be easy to you as the above example is a much more complicated scenario. 


Monday, February 17, 2020 - 4:32:34 AM - Koen Verbeeck Back To Top (84569)

Hi Sam,

it's already here!

https://www.mssqltips.com/sqlservertip/6282/azure-data-factory-multiple-file-load-example--part-2/


Wednesday, February 12, 2020 - 11:57:11 PM - Sam Back To Top (84410)

Hi there, looks great! When is part 2 coming?


Friday, January 24, 2020 - 9:56:48 AM - Jeremy Kadlec Back To Top (83974)

Everyone,

The supporting files for this tip have been included as URLs.  Please reference them to test the logic of the tip.

Thank you,
Jeremy Kadlec
MSSQLTips.com Community Co-Leader


Friday, January 24, 2020 - 3:06:01 AM - Koen Verbeeck Back To Top (83971)

Hi Michael,

The files are available for download in part 2. I'll ask if they can also be included in this part.

Regards,
Koen


Thursday, January 23, 2020 - 2:58:01 PM - Michael Kirkpatrick Back To Top (83963)

It would be great if we could download the CSV files.



download





Recommended Reading

Azure Data Factory Pipeline Email Notification Part 1

Azure Data Factory Lookup Activity Example

Azure Data Factory Get Metadata Example

Azure Data Factory vs SSIS vs Azure Databricks

Getting Started with Delta Lake Using Azure Data Factory








get free sql tips
agree to terms


Learn more about SQL Server tools