Conclusion to Moving Data Around in Azure
By: Koen Verbeeck
In this last video we summarize the different options for moving data around in Azure.
Welcome in this last video of the Moving Data around in Azure video series, and in this last video we'll do some conclusions. So my name is Koen Verbeeck. I work as a senior business intelligence consultant for the company AE in Belgium, I write articles for MSSQLTips.com, I have some certifications, and I'm a Microsoft Data Platform MVP for a couple of years now. If you have any questions about this video series, please drop them below in the comments or you can contact me on Twitter, LinkedIn, or on my blog.
All right, so the conclusion. Let's start first with those services which require coding. So first we have Azure Functions, which is if you like coding because just writing any script, you have a number of languages available. You have C#, you can have Python, PowerShell. Lots of options available, so any language you like is probably there, and you can compare it with a script task in Integration Services.
So if you have something in your ETL flow, you know, data pipeline, say, okay, the tool that I'm using like Azure Data Factory, you need to do something but it's not there, you can easily extend this using an Azure Function. And, for example, if you create an Azure Function, you can easily call it from Azure Data Factory as well. So it integrates nicely into those data pipelines in Azure. It's event-based, so it's usually triggered by something. This can be an http trigger, if you use Azure Data Factory for example, but it can also be a file trigger. So if a new file arrives in Azure Blob storage, it can automatically trigger an Azure Function that processes these files and dump it into SQL server or something like that. They are typically suitable for smaller, specific tasks. You can say that an Azure Function has a runtime of 10 minutes. If you need code that runs for a much longer amount of time, you probably want to look into Azure Batch instead.
Then we have Azure Databricks, which is a solution, a commercial solution, on top of Apache Spark, so it's a big data solution. You can use it for data engineering and data science. You have these notebooks where you can write code and documentation between each other. And you have various languages available. You have Scala, which is the original Spark language. You can also have SQL and Python and R, and I think you have now also C# available. So it's a framework, you can use it to just big data processing, but also streaming data it can do, lots of options available. But it's sort of an outlier, because it's in Azure, but it's a big data solution. But you can call it from Azure Data Factory, so you can integrate it into your normal data pipelines.
All right. Now we have Logic Apps. So this is the no code department of Azure. Logic Apps are easy to build workflows, where I can say if something happens, then this stalls and this stalls, group over this, and then send an email to that person, for example. You can for example build a Logic App that refreshes your Power BI data set and then just make sure that data set has refreshed successfully. There's lots of connectors available out of the box. Like Azure functions they're also event-based. Again you can trigger it from Azure Data Factory. You can have a file through just the same options over there. If a connector is not available yet, you can build your own custom connector using APIs and Swagger files. But Logic Apps, it's one of my favorites because it's so easy to use. It's very intuitive and you can get a lot done very quickly.
Lastly we have Azure Data Factory, which is the most comprehensive and flexible tools from the Azure stack itself. You can create pipelines, and those pipelines can call other functionality like Logic Apps, like Functions, like Databricks. It's a very visual tool, but behind the scenes everything is in json, so you can also generate your pipelines if you really want to. But you can also create dynamic pipelines as we showed in the demo. Aside from those pipelines, you also have mapping data flows, which is a layer on top of Azure Databricks, which makes working with Databricks a bit easier, because you don't have to write code. You don't have to write your own notebook. Just drag and drop and actually write some expressions, but be aware that not all Azure Databricks functionality is already available in those mapping data flows. They will always lag a bit behind in functionality. So you can use it to process big and small data, and then have the wrangling data flows, which are an implementation of the Power Query Editor, which is a very visual way of transforming your data.
So if you ask me "What is your go-to choice "in moving data around in Azure?" it's probably Azure Data Factory. You can use it to build highly sophisticated, orchestrated pipelines. You can also easily use it for a one-time ad-hoc flow because it's just so easy to use and I don't have to write much code. And it integrates very nicely with Logic Apps and Functions and even Databricks. So my number one choice will probably be Azure Data Factory. All right? So this concludes the video series. I hope it was useful. Again, if you have any comments, just add them in the comment sections, and I would like to thank you for watching.
Last Update: 5/12/2020