Web Scraping a Web Page using Azure Synapse Notebooks and Python

Problem

A great business analyst is never saturated with data, she/he needs more and more. Data analysis frequently calls for data enrichment, to enhance the dataset we already have by adding extra information. Data enrichment makes our data more useful and helps to get deeper insights. Sometimes this data may be internal company data, but in other situations we search for third party data over the internet such as historic sales data, market research data, product data, pricing data, etc.

It’s great when we find the data available in a download format, however, from time to time we may need to scrape data directly from an HTML website. The technique of importing information from a website is called web scraping or data scraping. Data Engineers have a lot of tools that can help with querying web data, such as Chrome plugins and Excel Power Query, but my favorite data manipulation tool is Python which we will use this in this tutorial.

Solution

I find Pandas to be the most useful library in Python. Pandas is a powerful and flexible open-source data analysis tool. Pandas makes it amazingly easy to scrape an HTML table ( <table> tag in HTML code ) from a web page directly into the DataFrame, two dimensional tabular data structure, similar to a database table.

For this tutorial I will use Azure Synapse Analytics, a unified data analytics service in Azure cloud, that helps Data Engineers load, explore and process data using either SQL or Spark engine. I will use Spark engine and query the data using Synapse Notebook, a web interface that can contain live code, text and visualizations.

Azure Spark Pool Configuration

First, we need to create a Spark Pool inside Synapse workspace. Here you can find tutorial on how to create a Spark Pool. Apache Spark in Azure Synapse Analytics has a lot of libraries for common data engineering, data exploration and visualizations. For this example, we will need an additional library lxml that does not come with Apache spark distribution. We will need to install it into our Spark pool. After installation, it will be available for all Notebooks using this Spark pool.

In order to install Python libraries, we need to create a requirements.txt file. It’s a simple text file, like the one you see below. As an example, I have more than just the lxml library, but we will not use the other two in this example.

After file creation, navigate in the Azure Portal to Settings> Packages and upload your file to the Spark pool:

Reading the Data

Now we will open Azure Synapse Studio from Synapse Workspace and go to Develop section to create a new Notebook.

If you want to read more on Notebooks usage, you can use this tutorial.

I will load the libraries that I will use in this example. Fsspec requires the lxml package that we have installed in the previous step.

import pandas as pd
from datetime import datetime
import fsspec

As an example, we will use Country codes translation table from the HTML table on this page https://www.worlddata.info/countrycodes.php.

Pandas library has a nice function that reads all <table> objects from an HTML web page into a list we have called dfs. The table that we are interested in, is the first item in a list. We will access it as dfs[0] and print the data:

datasource_url = 'https://www.worlddata.info/countrycodes.php'
dfs = pd.read_html(datasource_url)
df = dfs[0]
df

Cleaning the Data

Accessing data was very easy but I will probably want to clean the data a little.

I will change all data to be lowercase because its easy to join lowercase data. We will also remove white spaces from the column titles. White spaces in a column title sometimes might break data manipulation logic.

#change all data to be lowercase
df = df.applymap(lambda s:s.lower() if type(s) == str else s)        
 
#remove whitespaces from column names
[df.rename(columns={col: ''.join(col.lower().split())}, inplace=True) for col in df.columns]

Saving the Data

I want to save the data to Azure Data Lake and make it available for anyone who will want to use it.

To save the data to Azure Data Lake I need to create a connection, called “linked service”, in Azure Synapse Studio. This connection that will contain credentials we will use for storage access in a secure form. We will not need to provide credentials in the script in clear text.

We can add the linked service it from the “Manage” section. In the wizard you will need to provide the credentials for the container access.

In this example we will save the data in the csv format using dataframe.to_csv operator. If we want a different format, we can save the data in any format, like PARQUET using .to_parquet or to AVRO using https://github.com/ynqa/pandavro library.

adls_account_name = '<Storage account name>' 
sas_key = TokenLibrary.getConnectionString('Linked service name')
 
fsspec_handle = fsspec.open('abfs://country-codes/codes_mappping.csv', 
                                account_name = adls_account_name, 
                                sas_token=sas_key, 
                                mode="wt") 
 
with fsspec_handle.open() as f:
    df.to_csv(f)

In the above example we have downloaded a dataset from a web page using Pandas Python library, cleaned the data and saved the result set to a csv file on Azure Data Lake Storage.

Next Steps

Maria Zakourdaev

Maria Zakourdaev is a Data Platform Microsoft MVP and a technology expert with more than 20 years of experience, a community leader with a profound knowledge of Microsoft products and services, while also being able to bring together diverse platforms, products, and solutions to solve real-world problems.

Maria has a hands-on experience managing various data management technologies in Azure, AWS and Google public cloud platforms, including SQL Server, Azure CosmosDB, AWS DynamoDB, MemSQL, MySQL, Postgresql, Snowflake, Redis, Amazon Redshift, Couchbase and Elasticsearch, Spark, Databricks, Big Query.

Maria is an organiser of the annual conference “Data TLV” ( ex “SQL Saturday Israel”), a free training event for any data professionals: http://datatlv.com.

MSSQLTips Awards: Trendsetter (25+ tips) – 2022 | Author Contender – 2022 | Rookie Contender – 2018

One comment

Samer Sammour
January 2, 2023 / 8:08 am Reply
Hi,
I was trying to follow the instructions above. Unfortunately, I could not find the package sittings in the Azure Portal. You can see the list of available settings here https://learn.microsoft.com/en-us/azure/azure-portal/set-preferences.
Even if I go to workspace settings,>> configurations + libraries, I cannot find the same screen shown above.

Azure Spark Pool Configuration

Reading the Data

Cleaning the Data

Saving the Data

One comment

Leave a ReplyCancel Reply