By: Koen Verbeeck | Comments (25) | Related: > Azure Data Factory
Problem
In this two-part tip, we are created a metadata-driven pipeline which will copy multiple flat files from Azure blob storage to an Azure SQL Database. The flat files can have different delimiters and are stored in different folders and there are multiple destination tables as well.
Solution
In part 1 of this tip, we created the metadata table in SQL Server and we also created parameterized datasets in Azure Data Factory. In this part, we will combine both to create a metadata-driven pipeline using the ForEach activity.
If you want to follow along, make sure you have read part 1 for the first step.
Step 2 – The Pipeline
With the datasets ready, we can now start on the pipeline. The first action is retrieving the metadata. In a new pipeline, drag the Lookup activity to the canvas.
![pipeline lookup](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.001.png)
With the following query, we can retrieve the metadata from SQL Server:
SELECT b.[ObjectName] ,FolderName = b.[ObjectValue] ,SQLTable = s.[ObjectValue] ,Delimiter = d.[ObjectValue] FROM [dbo].[Metadata_ADF] b JOIN [dbo].[Metadata_ADF] s ON b.[ObjectName] = s.[ObjectName] JOIN [dbo].[Metadata_ADF] d ON b.[ObjectName] = d.[ObjectName] WHERE b.[SourceType] = 'BlobContainer' AND s.[SourceType] = 'SQLTable' AND d.[SourceType] = 'Delimiter';
The advantage of datasets is you can use them dynamically with parameters, but you can also use it for any query connecting to the same database. This means we can use the dataset we created before as a placeholder for this metadata query (as long as the metadata and the destination tables are in the same database of course).
![specify lookup metadata query](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.002.png)
All we have to do is change the Use query property to Query and fill in the query below. Since the query is returning more than one row, don’t forget to deselect the First row only checkbox. We did define a parameter on the table name though, so we need to specify some value, otherwise an error is returned. This can be solved by specifying a dummy value, such as _notSet in this example.
We can preview the data:
![metadata preview](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.003.png)
If the dummy table name was not specified, the following error would have been returned by the preview:
![preview error](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.004.png)
Next, we add a ForEach iterator to the pipeline and connect it with the Lookup activity.
![add foreach](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.005.png)
On the settings pane of the ForEach activity, click on Add dynamic content.
![settings foreach dynamic content](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.006.png)
Here we’re going to select the output of the Lookup activity.
![add output to expression](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.007.png)
At the end of the expression, add .value so the full expression becomes:
@activity('Retrieve Metadata').output.value
The settings tab should now look like this:
![settings tab finished](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.008.png)
Go to the Activities tab and click on Add activity.
![add activity to foreach](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.009.png)
This will take you to a new pipeline canvas, where we add the Copy Data activity.
![load multiple files in parallel in azure data factory 010](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.010.png)
You can go back to the original pipeline by selecting the DynamicPipeline link at the top of the canvas. Go to the Source tab of the Copy Data activity and select the csv_movie_dynamic dataset. You have to specify the parameter values for the FolderName and the DelimiterSymbol parameters. This can be done using the following expression:
@{item().ObjectValue}
Here ObjectValue is a metadata column from the Lookup activity. You need to replace it with the actual column name you need. The tab for the source then becomes:
![source configuration](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.011.png)
A wildcard for the file name was also specified, to make sure only csv files are processed. For the sink, we need to specify the sql_movies_dynamic dataset we created earlier.
![sink configuration](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.012.png)
Here, we need to specify the parameter value for the table name, which is done with the following expression:
@{item().SQLTable}
It’s also possible to specify a SQL statement that will be executed at the start of the Copy Data activity, before any data is loaded. In the Pre-copy script field, we can specify an expression that will truncate the destination table with the following expression:
TRUNCATE TABLE dbo.@{item().SQLTable}
Make sure no mapping is specified at all in the Mapping tab. Again, this set-up will only work if the columns in the flat file are the same as in the destination table.
![no mapping](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.013.png)
Step 3 – Debugging The Pipeline
Go back to the original pipeline (containing the Lookup and the ForEach) and hit the debug button at the top. If everything went successful, you’ll get an output like this:
![success debug](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.014.png)
By clicking on the output arrow for the Retrieve Metadata activity, we can view the metadata retrieved from the SQL Server database:
![output lookup](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.015.png)
For the ForEach iterator, we can verify two items were declared in the input:
![input data](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.016.png)
For the input of one of the CopyData activities, we can verify the truncate table statement for the sink was constructed correctly:
![input copydata](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.017.png)
For the output of the CopyData activity (for the files with a semicolon), we can see two files were indeed processed in parallel:
![copydata output semicolon](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.018.png)
While for the other CopyData activity, only one file (with a comma) was processed:
![copydata output comma](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.019.png)
When you click on the glasses next to one of the CopyData activities in the debug output, you can see more performance related information:
![copydata performance](/tipimages2/6282_load-multiple-files-in-parallel-in-azure-data-factory.020.png)
As you can see, two files were processed with a total of 200 rows for that one specific CopyData activity.
It’s important to notice there are actually two parallel operations in this set-up:
- The ForEach loop will process each blob folder separately in parallel (semicolon files vs comma files).
- In an iteration of the ForEach loop, the CopyData activity itself will process all blob files found in one folder also in parallel (2 files found with semicolon data).
Conclusion
In this tip we saw how we can build a metadata-driven pipeline in Azure Data Factory. Using metadata stored in a key-value pair table in SQL Server, we can use this data to populate parameters of the different datasets. Because of those parameters, we can use a single dataset dynamically to pick up multiple types of files or write to different tables in a database. The ForEach iterator loops over the metadata and executes a CopyData activity dynamically.
Next Steps
- If you want to follow along, you can download the flat files with the movie titles here. Make sure you also have read part 1 of this tip series.
- For more information about the activities in Azure Data Factory, check out the following tips:
- You can find more Azure tips in this overview.
About the author
![MSSQLTips author Koen Verbeeck](/images/Koen-Verbeeck-2018-2.png)
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips