Implement Incremental Ingestion for Large Fact Tables on Lakehouse using Synapse Mapping Data Flow - Part 2
This tip is part of a series of posts dedicated to the building of end-to-end Lakehouse solutions leveraging Azure Synapse Analytics. In the previous post, Incremental Ingestion of Fact Tables on Lakehouse using Azure Synapse Analytics Mapping Data Flow - Part 1, we've explored the ways to build incremental data upload logic and created a mapping flow to fetch the data for two fact tables for a specified date range. In this tip, we'll create a data integration pipeline that divides historical data into smaller pieces based on certain time intervals and calls the mapping flow for each interval.
Our incremental logic will be based on ModifiedDate column, and because original data coming with the Adventure Works has single value, let's connect to Azure SQL DB and run below query, to introduce a date variations spanning a few weeks:
UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-01' WHERE SalesOrderID<71831; UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-08' WHERE SalesOrderID between 71831 AND 71901 UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-17' WHERE SalesOrderID between 71902 AND 71920 UPDATE [SalesLT].[SalesOrderHeader] SET ModifiedDate='2022-01-25' WHERE SalesOrderID >71920
Let's also run Azure SQL DB to Raw and Raw to Bronze layer pipelines we created in earlier posts to fetch the data to the Bronze layer. Next, run the following SQL script to ensure that the Bronze table has received the latest updates:
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://synstg.blob.core.windows.net/syn-fs/delta/bronze/SalesOrderHeader/', FORMAT = 'DELTA' ) AS [result] order by modifieddate desc
Here's is the screenshot with the results:
Data Integration Pipeline Design
Let's create a data integration pipeline and add two parameters of the string type- WindowStart and WindowEnd. These parameters needed to receive trigger window timestamps. Here's the pipeline with the parameters:
Next, add a data flow activity and point it to the data flow BronzeSilverSalesOrders created in the previous post:
Next, navigate to the Parameters tab and provide the following pipeline expressions as the parameter values:
- WindowStart [email protected]().parameters.MyStartDate
- WindowEnd - @pipeline().parameters.MyEndDate
These expressions will pass the pipeline parameters to the mapping data flow.
Here is the screenshot:
As mentioned earlier, our purpose is to design the orchestration that could handle both historical and future time periods. Out of various of trigger types available in Azure Synapse Analytics only the tumbling triggers allow creating backdated window ranges. Because my historical data spans several weeks in January 2022, I'm going to create a weekly trigger starting at January 1, 2022.
So, let's create a trigger, as follows:
Select tumbling trigger type, set starting date as ‘01/01/2022', select hourly frequency, enter 168 hours and ensure that Start trigger on creation checkbox is selected, as follows:
Expand the Advanced panel and limit the max concurrency setting to one- this will ensure that multiple instances pipeline does not run simultaneously. This is important because Delta Lake tables do not allow concurrent writes. Here's the screenshot:
Confirm your changes, and move to the trigger parameters screen. Enter following expressions for the pipeline parameters:
- MyStartDate - @formatDateTime(trigger().outputs.windowStartTime,'yyyy-MM-dd')
- MyEndDate - @formatDateTime(trigger().outputs.windowEndTime,'yyyy-MM-dd')
These expressions convert the trigger's window boundaries into string values and assign them to the corresponding pipeline parameters. Save the trigger settings, commit and publish your changes.
Once the trigger is published, we can navigate to the Monitor page and find multiple instances of this pipeline running sequentially, as follows:
We can also see details of each execution, by opening the corresponding link and going into data flow details:
My data flow execution results for Jan 1 - Jan 8 window looks like this:
Finally, let's validate the data in the destination Delta table, by running following SQL script:
SELECT TOP 100 * FROM OPENROWSET( BULK 'https://synstg.blob.core.windows.net/syn-fs/delta/silver/SalesOrderHeader/', FORMAT = 'DELTA' ) AS [result]
Here is the screenshot with the results:
This concludes the data pipeline design to allow incremental ingestion of the historical data, and this trigger will also kick off on a weekly interval for future periods.
- Read: Common Data Warehouse Development Challenges
- Read: Create a trigger that runs a pipeline on a tumbling window
- Read: Mapping data flows in Azure Data Factory
About the author
View all my tips
Article Last Updated: 2022-04-21