By: Koen Verbeeck
In this video we look at some additional features of using Azure Data Factory to build a workflow to load and move data in Azure.
Recommended Reading
- Azure Data Factory Mapping Data Flow for Datawarehouse ETL
- Configure an Azure SQL Server Integration Services Integration Runtime
- Connect to On-premises Data in Azure Data Factory with the Self-hosted Integration Runtime - Part 1
- Executing Integration Services Packages in the Azure-SSIS Integration Runtime
- Migrating SQL Server Integration Services to the Cloud (webcast)
Video Transcript
Hello, and welcome everyone to this video series about moving data around in Azure. This video will talk about some extra features of Azure Data Factory. But let me introduce myself first. My name is Koen Verbeeck, I work as a Senior Business Intelligence Professional for company AE in Belgium. I have some certifications, I write a lot of articles for MSSQLTips.com, and I'm a Microsoft Data Platform MVP for a couple of years now. If you have any questions, please drop them below in the comments section, or you can contact me on Twitter, LinkedIn, or on my blog.
Now, as I mentioned, we will be talking about Azure Data Factory. We already introduced Azure Data Factory itself in a previous video, but there are some extra features which deserve some spotlight as well. One of those features is the mapping data flow. I can compare this data flow with the data flow of integration services. So while a normal pipeline ADF is used for administrating tasks, and orchestrating them - and saying, okay, if this fails, then run this, if that succeeds, go to this activity - this data flow will actually read data into memory, process it, and write it back to a sink just like in integration services. Behind the scenes, however, an active Azure Databricks cluster is being run. So, mapping data flow, is in fact a layer on top of Azure Databricks, which is in itself also a layer of a big data solution. So, mapping data flows, are more for transforming big data on the fly for larger volumes. This is not the kind of flow I would use for just processing small files, because there is just too much overhead associated with it.
Another feature, is a wrangling data flow, which looks a lot like a Power Query designers. So Power Query was added in Excel, which is now being renamed to just the Get Data Experience in Excel, but it's also the same Get Data Experience in Power BI Desktop. So if you create a new model in Power BI Desktop, and click on Get Data, the Query Editor does Power Query as well, you also have a Power Query source and integration services now, and even in analysis services tabular, if you import data, it's the same editor. It's an engine, working with M Query language, which is really user friendly, it can easily apply multiple formulas, and you can step through the data; you can actually see the data changing. So while a mapping data flow is more like a conceptual level, you say, okay, read from this source, apply this transformation, and write it to this destination - you don't actually see the data. While with the wrangling data flow, you actually see the data coming in, and you see it change while you're working it. So that's the biggest difference. However, when I created slides, and when I created demos, the wrangling data flow was still in preview, so make sure, if you do something, in your support or and production environment, or not, and because it was in preview, you could get errors like this one, saying okay, you cannot do this, it's not supported yet, but while they're working on it they're going to support more and more use cases with this this wrangling data flow.
So how does this - all of this - work in Azure data factory? Well we have multiple integration runtimes. So you have the normal Azure Integration Runtime, which just executes normal activities of a pipeline, in a scalable way. It also integrates everything across Azure regions, and so on. And you can created your own integration runtimes as well. If you want to tweak the settings of that Databricks runtime, that runs your mapping data flows. You also have a Self-hosted Integration Runtime, which allows you to run compute on your own local network, instead of in Azure. And you can also use it as a data gateway to import on-premises data to Azure Data Factory. So if you have a pipeline as your data factory, and you want to connect to your local on-premises SQL Server database, then you need to install this Self-hosted Integration Runtime. And one of the last features I want to talk about is the Azure SQL Server Integration Services Integration Runtime - whew! That's a very long name. This is used to run and execute SQL Server integration services packages, in the Azure cloud. So it basically spins up a cluster of virtual machines, and it will run your SSIS Packages on that cluster. Now I'm not going to talk too much about this last option, because there is a whole webinar, a one hour webcast, on this topic alone. And that you can find on the MSSQTips.com website, so I'm going to leave it out of the demos.
Alright, so this is an example of how to set up this Integration Runtime, but again, I refer you to that one-hour-long webcast, if you want more details on how to migrate at your on-premises, integration services project through Azure Data Factory. Alright, let's take a quick look at some demos. So, here we can see a map - sorry, yeah - a mapping data flow, and it reads data from a certain source, in this case, again, Top 250 Movies, and the nice thing about mapping data flows - it allows schema drift, for example. So there are some features coming in from the big data world, which are certainly not available in classic integration services. And, then we do some transformations and this is about the only part in Azure Data Factory where you probably have to write the most expressions, so you say, okay we do sort of a draft column, we do this transformation, and the good thing is, you can do transformations in one specific column, but you can also define a transformation - okay, do this transformation, all the string columns coming in from my source. Then we do yet another transformation and then we convert one column to integer. And then we write a destination - yeah, we write a data destination which is again, a data set, and this is our top movies enabled in the SQL Server database.
If you want to debug this, you need to have an environment running in it that has one of those integration runtimes. And you can enable it here at the bottom - sorry, wrong button, sorry - you can enable it over here, so if you click on the Data flow debug toggle, it's going to start up one of those integration runtimes, which is behind the scenes, actually, an Azure Databricks Cluster. Right, so now it's running, so I can actually debug this.
First, I'm going to show you where you can find those integration runtimes. So, if you go to Connections, we have our linked services over here, and then, we go to Integration Runtimes, and here you can see a couple of those integration runtimes that I have defined. See how I have Self-Hosted, but I had to unconnect for the moment. Then I have one, one or two, for running integration services packages. And then you have the default one, which is the Auto Resolve Integration Runtime. But, I also created my own one - my own Data Flow Runtime. Very original name, I know. And the reason why would you create one like this, if you wanted to tweak some of the parameters, you can choose to create your own Data Flow Runtime. If you enable the debug, it will ask you which runtime do you want to use? And one of the advantages, again, is that you can set up time to list it - you say, okay if there's no activity for, like, 15 or 20 minutes then disable this runtime, and save some costs.
Alright, so let's go to one of the pipelines. So this one we'll call my mapping data flow, so here we can click on debug, and now it will start running. Now, if you click on debug, and the debugging runtime was not running yet, it has to start that cluster first, so then you can add like five minutes to your runtime. So, if you want to debug something, don't forget to enable the data flow debug first, go grab a cup of coffee, and then click on the debug, because, otherwise, you just have to wait a long time. Alright, so it's queued, not it's executing. But as I mentioned before, mapping data flows are big data solutions, so there's quite some more revs - I'm not just processing a CSV file of 250 rows, and it's just way too much overhead for such a small file, so, this is not going to be very efficient, but again, it's just a demo. So, I'll just keep on refreshing until something shows up. And just, hang on for this for a while, and let's take a look at the wrangling data flow.
As I mentioned, this looks a lot like the Power Query editor, so we can start here at the source. It's going to read the data as is. And like in Power Query we can step through the transformations, and I have to do some, like, weird transformations, because of the preview, and not all of the default transformations that I would use were available. So I have to jump through some hoops, just so I can split this column, so. But in the end, I made it. And you can see here the results. So, I split this column into an index column, and into a column containing the actual movie titles. So, as you can see, this is a much more intuitive way of transforming data, than the mapping data flow. Alright. If you go to mapping data flow, it has now succeeded. So if I go to SQL Server, and execute this Query, you can see there are some data in my table. But it took almost one minute and a half, just to read and transform 250 rows. Again, not the most ideal situation. But, if you want to move, like, lots of gigabytes of data, or even more, this is a real scalable framework that you can use. And then the overhead's less important.
Alright, let's return to the slides. So, as I mentioned, there's a whole webcast - there's a link here at the bottom - about moving integration services to the cloud. There are already quite some tips on how to configure this environment, how to execute integration services, packages, and so on. So lots of material on the MSSQLTips website, and you can also find the tips on those mapping data flows. So, this concludes the video, and thank you for watching.