Big Data Analytics with Azure Databricks to Move Data
By: Koen Verbeeck
In this video we look at how you can use Azure Databricks as a unified data analytics platform using different languages such as Python, SQL, Scala, Java, etc. to handle large volumes of data for analytic processing.
- Using Azure Databricks to Query Azure SQL Database
- Securely Manage Secrets in Azure Databricks Using Databricks-Backed
- Manage Secrets in Azure Databricks Using Azure Key Vault
Hello and welcome to this video in the series Moving Data Around in Azure. In this video, we'll do a short introduction to Azure Databricks, a big data platform supported in Azure. A little introduction, my name is Koen Verbeeck. I work as a senior business intelligence consultant at AE, and I write some articles for mssqltips.com, I have some certifications, and I'm also a Microsoft MVP for the Microsoft Data Platform. If you have any questions, you can post them in the comments below, contact me on Twitter, LinkedIn, or on my blog.
All right, Azure Databricks, first a little history. So Azure Databricks, it all started a while back, a good 15 years back ago, when Google wrote a series of papers on how to handle large scale data, which resulted in the MapReduce algorithm, which again, resulted in Hadoop and HDFS, and the entire big data ecosystem. However, at the time it was really complicated to work with big data. You had to write your own, not produce jobs in Java, you had to work a lot in the command line, and it felt a little bit like going back to the old stone age of IT. So this guy, Matei, he founded Spark. So it's layer on top of the big data ecosystem, and handles a lot of this stuff for you. So this resulted in the Apache Spark. So it's been adopted by the Apache Foundation, which means its open source and free to use, and Azure Databricks is the commercial solution of Apache Spark. So you pay to use Azure Databricks, but you get some extra features.
All right, so Azure Databricks, you can call it a unified data analytics platform. It can be used by multiple types of people. It can be used for, by the data engineer to move data around, which is of course the topic of today, to create pipelines that do lots of stuff with data. It can pull data in, transform it using Python, Java, or any, or a couple of languages, and drop it maybe a data lake somewhere else. So it's great at doing sort of ETL flows on top of your data lake, for example. It can also be used by machine learning data science people, so to create models in Python, for example, do some machine learning, and using those models, do some predictions, forecasting. So it can be used by multi-types of people. Also ad hoc analysis can be done using the notebook format in Azure Databricks.
All right, so Azure Databricks uses clusters behind the scenes. So you have a cluster manager and you have a driver which drives of course different workers, which will execute your Spark code for you. So when you have an Azure Databricks environment, you need to create one or more clusters. You say, okay, I wanna have a cluster with at least two to eight workers, this type of memory, and you can say, okay, I wanna have these type of virtual machines behind the scenes. You wanna, you can optimize for high concurrency, and one of the main benefits is that you can also set a timeout. So for example, if you don't use a cluster for 30 minutes, it will automatically shut down, and save you costs, which is, yeah, really a great benefit in the cloud.
So I mentioned you can use different languages in Databricks, so you can use Scala, which is a sort of Java based language, which is the original language of Databricks and Spark. But you can also use SQL and Python for example. Now, the people who have been working with Python a lot know, okay, we have dataframes, but these are not the Pandas DataFrames. You also have a Spark DataFrame, and this has a schema, a format, and a location. But you can easily convert a Spark DataFrame to a Pandas DataFrame, if that's what you want. Development in Spark is done using notebooks, and here I can see a screenshot of such a notebook, and at the top you can select your cluster and you can also see which language is being used, but I'm gonna go deeper into the notebook in this demo.
All right, let's switch to my Databricks environment and here we can see the same notebook in the screenshot and you can see here that I configured a cluster, and I also, you can see here at the top that I'm using the Python language. All right, so you can use different languages, but there's one main language that you have to choose for your notebook, so, and it's the default language in every cell that you write. So here this is a cell with code. It's by default Python, unless you specify something else. A good advantage of Databricks and notebooks in general is that you can mix documentation and code together. So it's really useful for giving demos, for learning, because you can easily document your code along with the code itself of course.
So here I have my cluster. It says 12 cores, and so the configuration, it's running. You can see it by the green icon over here, and the first step that I do, which I have already done, is running, is creating a file system. So Databricks uses a file system behind the scenes. So, we could say, okay, a link to a blob storage account and there you can store all these files, and it will create some sort of database behind the scenes. So if I run this, I run this cell, it will send this information to the cluster and it will just show everything from the file system. Okay, and here you can see another example, and this really looks like if you're used to working with Linux, you see that it's basically the same commands, like ls to show me everything that's in this folder, and here we can see the Top250Movies CSV file, which is the one we're going to read in using this notebook, and dump the contents to SQL Server.
So here I read by CSV file using the Spark code, so I say the dataframe, we're gonna read the CSV file. It has a header, it has this type of encoding, and you can find it here. Now, we can show you the first 10 rows, or we can just display the dataframe. Now it's running. It's in the Spark Job, and you can see, okay, here are the first rows of my top 250 movies, and as I mentioned, we can convert it easily to a Pandas DataFrame. So first we need to import the Pandas library, and then okay, we say, okay, convert this dataframe to a Pandas DataFrame, yeah, then just a couple of seconds and if you run this one, you can see that now I have sort of a different dataframe, but with the same data.
You can see here that I have an index over here, and the actual data is just one column, which includes the position of the movie and then the movie title itself. All right, so we can see 250 movies. Now, okay, having this 249.Ip Man, it's not ideal. We're gonna split these out. So this is standard Python code. So we're gonna split it, and then I'm gonna drop an extra column, and here you can see the result. Now I have beautiful index column with the actual position of the movie and the title of the movie. We just convert some data types, and then we're gonna connect to SQL Server. But at first, I need to make sure we can access the driver, and I do this in Scala, and you can change the language of a single cell by having this percentage sign and the name of the language in front of it. Okay, like this. I set my entire connection parameters. Then I'm gonna test my connection, just a random SQL query. Okay, it works. And now I'm gonna read the data from the database. Oh no, so this is the data, sorry, from my test query. Okay, it's like some log data from an integration runtime of Azure Data Factory. And now I'm gonna create a database view on top of the data inside the Databricks database. So okay, we create a dataframe and we're gonna createOrReplace a view over here, and we're gonna call it temphvmovies. And then you can execute some Spark SQL to drop the table if it exists, and then we create a new table, which is just a result of my view. So it's doing some Hive code, for people who are used to working with the big data ecosystem. And as you can see in the next step, we're just gonna execute some SQL code. So we just can switch the language to SQL, so it's no longer Python, and we can just run a SELECT * from this Hive table that we just created once this job has done running.
So as I mentioned a couple of times, Databricks is a big data ecosystem. So it's used for transforming large volumes of data or handle streaming data, stuff like that. So there's a bit of overhead, because it has to use this massive cluster, which is why, okay, dealing with a table of 250 rows plain SQL in SQL Server will always be fast for such small volumes. But that's the thing, okay, this is just a demo, and for larger volumes, the overhead, it's reduced. All right, so we can execute now the SQL query. So it is now running the command on the Hive table in my cluster. And then while it's running, in the last step we're gonna say, okay, write to this table in SQL Server. If it already exists, if there's data in it, just overwrite it, and the source is the hvmovietable. All right, so now I'm done. Now I'm gonna say, okay, Run this command to send the data to SQL Server. And it's done, all right?
This concludes the demo, so let's go back to the slides. If your interest has piqued by this presentation and the demo, there are some tips on MSSQLTips that you can also read that give you a little bit more information about Databricks and how you can use it in certain use cases. All right, thank you for watching.
Last Update: 4/27/2020