Incremental File Load using Azure Data Factory


By:   |   Updated: 2020-04-03   |   Comments   |   Related: More > Azure

Problem

I want to load data from different flat files stored in Azure Blob Storage to an Azure SQL Database. However, I only want to process new files. Files which already have been processed should be ignored. I cannot just delete them, since they are used by other processes as well. Can I achieve this requirement with Azure Data Factory?

Solution

With Azure Data Factory (ADF), we can copy data from a source to a destination (also called sink) using the Copy Data activity. In the source settings of this activity we can see there are a modified start and end datetime we can use to filter the files.

modified start and end date

The question is: what do we need to enter in those text boxes? Luckily, ADF provides us with a wizard and a template to help us get started.

The Azure Data Factory Copy Data Tool

The Copy Data Tool provides a wizard-like interface that helps you get started by building a pipeline with a Copy Data activity. It also allows you to create dependent resources, such as the linked services and the datasets (for more information about these concepts, check out this tip - Azure Data Factory Overview).

In this tip, we’ll load data from a CSV file into a table in Azure SQL DB. The sample files can be downloaded here, while the destination table can be created with the following DDL statement:

CREATE TABLE [dbo].[topmovies](
   [Index] [BIGINT] NULL,
   [MovieTitle] [NVARCHAR](500) NULL
);

On the home page of ADF, choose Copy data.

choose copy data wizard

In the first screen, name the task. This will be the name of the pipeline later on. You also need to choose a schedule. For incremental load to work, you need to choose a regularly schedule. A one-time run will not work and any configurations for incremental load will be disabled in the later steps.

assign name and schedule

Regarding the schedule, you can either choose between a "normal schedule" as the trigger type, or a tumbling window. Here we can see a regular schedule:

regular schedule

And here we have a tumbling window:

tumbling window schedule

The tip Create Tumbling Window Trigger in Azure Data Factory ADF dives a bit deeper into tumbling windows and the difference with a schedule trigger.

In the next screen, you can choose between existing linked services, or the option to create a new one. If you don’t have a connection to Azure Blob Storage, you need to create it here. An example is given in the tip Transfer On-Premises Files to Azure Blob Storage.

choose source connection

Once the connection to Azure Blob Storage is created, you need to specify the location of the files you want to copy:

choose input folder

In the same screen, you can also specify the loading behavior. In our case, we definitely want the "Incremental load: LastModifiedDate" option.

specify file loading behavior

If you’ve chosen a regular schedule, you can see the start and end of the window.

window start and end parameter

In the example above, the schedule runs every 24 hours, so every file with a modified date between the schedule time and the schedule time minus 24 hours will be picked up by ADF. In the case of a tumbling window, no parameters are shown. They’ll show up in a later step though.

file loading behavior

As the last step in the source configuration, you need the file format. If you’ve specified a folder or a specific file, you can choose the Detect text format, which will detect the settings for you and render a preview of the file at the bottom.

file format settings

On the other hand, if you specified a wildcard pattern, such as myfolder/*.csv, you’ll get an error message:

auto detect error

In step 3 of the wizard, we’ll configure the destination. Choose Azure SQL DB from the list of available linked services, or create a new connection.

choose destination

In the next step, choose the destination table:

choose destination table

If the metadata is known from the source file, the columns are mapped automatically onto those of the table:

 
column mapping

For a tumbling window, you can find the window parameters here. In step 4 of the copy data tool, you can specify a couple of data movement settings. You can leave everything to the default settings.

data movement settings

Step 5 gives you an overview of the to-be generated pipeline:

pipeline overview

In the last step, everything is created and published to the ADF environment:

If we take a look at the Copy Data activity in the generated pipeline, we can see the tumbling window parameters are filled in in the source settings:

parameters provided

The first time the pipeline runs, it will only pick up files if there’s a modified data after the start date of the schedule. Files already present will be ignored.

initial run

If a new file is added to the blob container, it will be picked up in the next run:

next run with one file copied

Using an Azure Data Factory Pipeline Template

Another option to create a pipeline with this incremental load pattern is using a template. On the home page, choose Create pipeline from template.

create pipeline from template

In the template gallery, choose the Copy new files only by LastModifiedDate template.

pick template

On the next screen, you can choose the source and the destination connections. A big difference with the Copy Data Tool is the template only supports binary sources, which rules out Azure SQL DB as a destination.

select source and destination

The template will create a pipeline with a Copy Data activity, together with a bunch of pipeline parameters:

pipeline with parameters

It’s up to the user to build a framework using this pipeline and parameters in order to implement the incremental load pattern.

Conclusion

In Azure Data Factory, we can copy files from a source incrementally to a destination. This can either be achieved by using the Copy Data Tool, which creates a pipeline using the start and end date of the schedule to select the needed files. The advantage is this setup is not too complicated. The disadvantage is that if files already exist before the schedule starts, they are not copied to the destination. Or, if you want to reload all files, you have to change the schedule (tumbling windows deal best with this scenario) or by re-copying the files so their modified date changes.

The other option is using the template. This sets the foundation of the incremental load pattern, but leaves everything else to the developer. The advantage is you can probably build this out to a more flexible framework than the one used by the Copy Data Tool.

Next Steps
  • If you want to follow along, you can find the sample data here.
  • There are already quite some tips on Azure Data Factory on the MSSQLTips website. You can find an entire list here.
  • For more Azure tips, check out these articles.


Last Updated: 2020-04-03


get scripts

next tip button



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a BI professional, specializing in the Microsoft BI stack with a particular love for SSIS.

View all my tips
Related Resources





Comments For This Article





download


Recommended Reading

Adding Users to Azure SQL Databases

Azure Data Factory vs SSIS vs Azure Databricks

Azure Data Factory Pipeline Email Notification Part 1

Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime - Part 1

Continuous database deployments with Azure DevOps





get free sql tips
agree to terms


Learn more about SQL Server tools