Building Data Pipelines with Azure Data Factory to Move Data
By: Koen Verbeeck
In this video we look at how you can use Azure Data Factory to build a workflow to load and move data in Azure.
- Azure Data Factory Overview
- Transfer Data to the Cloud Using Azure Data Factory
- How to Load Multiple Files in Parallel in Azure Data Factory - Part 1
- Azure Data Factory Multiple File Load Example - Part 2
Welcome to this video of moving data around in Azure, today's topic is about Azure Data Factory. My name is Koen Verbeeck and I work as a senior business intelligence professional at AE in Belgium. I write some articles for MSSQLTips.com. I have some certifications. And I've been a Microsoft data platform MVP for a couple of years now. If you have any questions about the video, you can put them in the comments below or you can contact me on Twitter, LinkedIn, or my blog.
So the topic of the Azure Data Factory, which called the cloud ETL, the integration services in Azure. Very important, today I'm going to talk only about Azure Data Factory version 2.0. We don't even talk about version one anymore. It's obsolete. Don't work with it. Always go for version two or later.
Azure Data Factory is a data integration service, and you can look at it like ELT or ETL without writing any code and of course without any code means you can maybe from time to time you have to write an expression but you don't have to write lengthy scripts. It's a user interface, just like you're used to with integration services. You build scalable data pipelines. So you can move data around and load data from sources Azure Data Factory can look, for example, from Blob storage, or data lake and you can dump the data into Azure SQL Database or other destinations. It also supports Git integration, however, it only supports Azure DevOps or GitHub, and those are the only two supports for the moment. So there's, for example, no Bitbucket integration, but the Git integration is good. It means you can source control everything, you can work at branches and you can have a DevOps solution who work with your environments.
A very important concept in Azure Data Factory are linked services. And you can compare those as the connection managers in integration services, basically the same thing. And I've listed at the time I made a slide, those are all the services you can connect to with Azure Data Factory. Most of them are probably too small to read, but just to show you, okay there's a lots of them you have Hadoop and big data related to them. You have of course, lots of them in Azure like Azure database, Azure data lake, Azure Blob storage. We also have some from Amazon, Amazon Redshift, S3, Apache Impala, and lots of other services. If the service is not listed here, you cannot connect with Azure Data Factory and you have to maybe look for another solution. One can be for example, you have to use an Azure Function or an Azure Logic App to move the data which you can learn about in other videos in this course.
Once you've created a link service, you can also create datasets. And dataset is a definition of a what type of data do we have? It can be, for example, a SQL Server table or it can be a file, a CSV file or a JSON file somewhere on blob storage. And datasets says it's okay, what is this data? What are the columns? What are the names of the columns? What are data types? Where can I find this data? What is the schema? And so on. So this is a dataset. And then you build pipelines. Pipelines are a bit like the control flow in integration service. So you have the source on the left, and you do a specific series of tasks all the way to the right, where it drops the data and do some other stuff and you can do all sorts of things. And you can say, okay, if this tasks fails, and into this task, if it leads then you do this and so on. So basic control flow functionality.
All right. At the left, you can see the activities and these are all that you can do and you can also see the Databricks, for example, is there as well. So you can execute a Databricks notebook or a JAR file using Azure Data Factory and lots of other activities available as well. A pipeline is triggered by introducing a schedule or other type of triggers. So you can have different types of schedules like normal SQL Server Agent scheduled run every day at 8 a.m. You can also have a tumbling window schedule. You can also have like a blob storage trigger, so if a new file is added, it will trigger this pipeline. And so on. And at the right side, you can also see there's a button view at the code because everything you do in Azure Data Factory generates basically JSON files. So which makes Azure Data Factory very easy to automate and generate, in contrast with integration services where it has very big clunky ETL file.
Here is just very simple, JSON. It doesn't even contain the layout information of the pipeline. All right, let's take a quick look at Azure Data Factory. So this is the home screen of Azure Data Factory, you need to create your Azure Data Factory environment in the Azure portal but once it's finished, you will get a link and you will arrive in this area in a factory environment, where you can create your pipelines, set up a code repository using it, and so on. There's also some videos available to help you get started and it also some templates and portals also to get you started.
Let's go to the Author page. So here we have a couple of pipelines and you can see a pipeline here at the right. And you have all this task and if you click on a task, you can configure it. Okay, these are the connections you need to connect, execute this specific stored procedure, in this case, do something wrong, and so on and so on. And here we can see if all other types of tasks and you can have the if it fails, or succeeds, you can do specific actions. And you have like general settings like timeout, and so on. And a web activity, for example, like this one can be used to trigger a logic app or an Azure Function or stuff like that, or maybe call an API. And here at the left, we can see all the available activities.
Another great functionality of Azure Data Factory is that it supports dynamic pipelines. What do I mean with that? So you read some metadata from, for example, a SQL Server table. And then you can have a ForEach Loop and it will be use this metadata. For example, in this case, it's going to retrieve the metadata of like four Logic Apps, for example. And then they're going to execute those four Logic Apps at the same time in this ForEach loop, click here on go to the activities, and in this case, okay, it's writing some data. So you have a Source, and you have a Sink, so it's going to read CSV file, different kinds of CSV files. And is going to write it to the Sink in this case, SQL Server. And for this to work, you need dynamic datasets. And if you take a look at this one, you can see for example, it just has a connection. But some of the items have parameters like the folder name and the delimiter and stuff like that. And there's no schema. So you can read any type of CSV file, you can parameterize everything send it to that ForEach Loop in the pipeline, and so executes with all of those in parallel, so you need to create only one pipeline to transfer as many CSV files as you want.
So this is in short, a very short introduction to pipelines, to Azure Data Factory. If you want more information there are a lot of articles on MSSQLTips.com about Azure Data Factory. Now, let's go back to the slides. Here are some of those Azure Data Factory tips. So how to load multiple files in parallel, that's the dynamic situations that I talked about how to load multiple files and so on. And the top link is just an overview of what Azure Data Factory is all about. And there are many, many more tips on MSSQLTips.com about Azure Data Factory. So I encourage you to definitely check those out. All right. Thank you for listening and see you in the next video.