Leveraging PolyBase to load data into Azure SQL Data Warehouse

By:   |   Comments (4)   |   Related: > Azure


Most companies are faced with the ever-growing big data problem.  It is estimated that there will be 40 zettabytes of new data generated between 2012 to 2020.  See the Computerworld article for details.  Most of this data will be generated by sensors and machines.   However, only a small portion of the data is available for users.  How can IT professionals help business lines gather and process data from various sources?


There have been two schools of thought when dealing with big data.  Please see the article from The Data Warehouse Institute on this subject.

Schema on write is represented by the traditional relational database.  Raw data is ingested by an extract, transform and load (ETL) process.  The data is stored in tables for quick retrieval.  Each column in the table is declared as a particular data type.  This ensures domain integrity of the information.  Constraints are added to the tables to ensure entity and referential integrity.  Typically, changes to these systems can be time consuming and costly.  Only a portion of the total data owned by the company resides in the database.

Schema on read is represented by technologies such as Hadoop or PolyBase.  These technologies process stored data as simple text files.  It is assumed that data integrity was applied during the generation of these text files.  The actual definition of the table is applied during the read operation.  Therefore, changes can be done quickly at the query level and do not take a-lot of time.  Almost all the company’s data can be placed in blob storage or a logical data lake.  Data Scientist can investigate this data to add predictive analytics to the company’s repertoire. 

Today, we are going to talk about how to load data from Azure Blob Storage into Azure SQL data warehouse using PolyBase.  This technology was introduced by Microsoft in 2012 to allow a relational database such as Parallel Data Warehouse (MPP) to talk to files stored on Hadoop’s Distributed File System (HDFS).  Please see this ZD Net article covering its first introduction.  Since its release, the technology has grown to read data off different storage media and different file formats.

Business Problem

There are many financial companies that invest in the stock market via mutual funds. Our fictitious company named Big Jon Investments is such a company.  They currently invest in the S&P 500 mutual fund but are thinking of investing in individual funds starting in 2017.  The investments department will need historical data to make informed choices when picking stocks.  Our boss has asked us to pull summarized daily trading data for the 505 stocks that make up the list.  Last year’s historical data should be stored in comma delimited files and will be the start of a new data mart in Azure.  We accomplished creating the comma delimited files in a prior tip. 

Now, our boss wants us to copy the comma delimited files to Azure blob storage and load to Azure SQL data warehouse table to the combined results.

How can we solve this business problem?

We are going to use a combination of techniques that we learned in prior articles to accomplish the following tasks.

Task Description
1 Use PowerShell to sign into Azure.
2 Create and load an Azure blob container.
3 Create and define an Azure SQL data warehouse.
4 Defining an external data source.
5 Leverage PolyBase to create external tables

Sign into Azure

We must log in with a valid subscription owner before we can do any work in the PowerShell ISE.  You will be prompted for credentials which is a valid user name and password.  The new sign in experience breaks this process into two dialog boxes or tasks.

The first task is to supply a valid user name.

Azure SQL DW & PolyBase - User name - Description: This screen asks for a valid user name.

The second task is to supply a valid password.

Azure SQL DW & PolyBase - Password - Description: This screen asks for a valid password.

A user account might have several subscriptions associated with it.  The output below shows the subscriptions associated with [email protected] user name or account owner.  The last step is to choose the correct subscription to work with.  This step does not apply since I have only one subscription linked to my user name.

Azure SQL DW & PolyBase - Subscription - Description: Screen shot from PowerShell ISE showing Azure subscription.

In a nutshell, you need to be logged into Azure with a validate account and subscription before any work can start.

Create Blob Container

Most of the objects you create in Azure are contained inside what is called a resource group.  For this project, I am going to create a resource group named rg4tips18, a storage account named sa4tips17 and a blob container named sc4inbound.  Please see my article on “Using Azure to store and process large amounts of SQL data” for the PowerShell cmdlets to accomplish these tasks.  Each of the commands was executed in the PowerShell Integrated Scripting Environment (ISE).

The image below shows the newly created resource group named rg4tips18.

Azure SQL DW & PolyBase - Resource Group - Description: Screen shot from PowerShell ISE showing Azure resource group.

The storage account is a container for objects like blob storage, table storage, file shares and message queues.  The image below shows the newly created storage account named sa4tips18.

Azure SQL DW & PolyBase - Storage Account - Description: Screen shot from PowerShell ISE showing Azure storage account.

I decided to create two storage containers.  The sc4inbound container will hold the newest text files that need to be loaded into the data warehouse.  The sc4archive container will be a landing zone for files that have been processed but are being saved according to our company’s retention period.  We will be using the first container in this article.  When we automate this process in the future, we will write code to move the processed files from inbound to archive containers.  See image below for details around storage containers.

Azure SQL DW & PolyBase - Storage Containers - Description: Screen shot from PowerShell ISE showing Azure storage containers.

After running a PowerShell script, I like to validate the creation of the objects for the project before proceeding.  The image below was taken from the Azure portal and shows the two storage containers in our one storage account.

Azure SQL DW & PolyBase - Blob Services - Description: Screen shot from Azure Portal showing the two containers in the storage account.

To recap, there is a hierarchy of objects when defining Azure storage.  One might compare this hierarchy to the Russian Matryoshka doll in which dolls or objects fit inside another.

Load Blob Container

Now that we have our container defined in Azure, let us take a look at the data files that we want to load.  The S&P data for the year 2017 is comprised of 505 comma separated value files located on a on my laptop’s solid-state drive.  I am enclosing a zip file that contains this data so that you can follow along with the examples.

Azure SQL DW & PolyBase - Local User Directory - Description: Screen shot from Windows Explorer showing local user directory.

There are two additional files that I want to load.  The first file is called a packing list.  It is obvious that the file contains a list of all the comma separated value files that need to be processed.

Azure SQL DW & PolyBase - Packing List - Description: An user defined file listing stock files to be loaded.

The second file is a summary of the data files to load.  It contains the number of files to process and the total records contained by the files.  This information is important when creating an automated process.  It allows the developer to write a validation check after load process has completed.

Azure SQL DW & PolyBase - Packing Summary - Description: A summary file showing the number of files and records contained within the packing list.

The image below, taken from the Azure portal, shows the total number of S&P 500 stock data files loaded into the storage container.  This includes the two packing list control files.  Enclosed is the PowerShell script that was used to create and load data into a blob storage container.

Azure SQL DW & PolyBase - Container - Description: Screen shot from the portal showing the correct number of files.

Since the packing list file is key to the load process, I am going to validate its existence. The image below was taken from the Azure portal.  All of the files that are stored in the sc4inbound directory can be displayed via paging provided by this screen. While Azure Blob Storage does not support directories per se, you can add pathing to the name of the blob file.  The portal interface will display the files in a structured manor resembling directories.  Below, the new path was added to each file and the files were filtered by the letter P.

Azure SQL DW & PolyBase - Listing of files - Description: Screen shot from Azure Portal show files in the container that start with the letter P.

Access to Azure blob storage can be defined by using storage access keys or shared access signatures.  The database scoped credential used by PolyBase supports only storage access keys.  The screen shot from the PowerShell ISE shows the two keys associated with our account.  We will use this information later in the article.

Azure SQL DW & PolyBase - Storage Keys - Description: Screen shot from PowerShell ISE showing the two keys associated with the storage account.

To summarize this section, each of the data files was loaded into a blob storage container.  The two packing files are used to control and validate the load process.

Creating the Azure SQL Server

Almost everything in Azure is contained by a resource group.  For this project, I am going to define a logical SQL database server named sql4tips18 inside the resource group named rg4tips18.  The administrator account or user name for the server is jminer and the password is MS#tIpS$2018.  Please see my article called “Deploying Azure SQL Database Using Resource Manager PowerShell cmdlets” for the details on these tasks.

The image below, taken from the PowerShell ISE command window, shows the Azure SQL server residing in the East US 2 region.

Azure SQL DW & PolyBase - Logical SQL Server - Description: Screen shot from PowerShell ISE showing the logical SQL Server.

Security is a very important concern by every company.  That is why Microsoft blocks all connections to the database by using a firewall.  The image below shows a firewall rule named fr4laptop being added to allow my laptop access to the database server.

Azure SQL DW & PolyBase - Firewall rules - Description: Screen shot from PowerShell ISE showing the firewall rules for the logical server.

The enclosed PowerShell script contains the cmdlets used to create this logical server.

Defining the Azure SQL data warehouse

I am going to use SQL Server Management Studio to manage my new Azure SQL server.  We need to connect to the server to start crafting Transact SQL scripts.  Choose the database engine as the server type.  Enter the fully qualified name of the Azure SQL Server.  Pick SQL Server authentication as the security option and supply the login/password of the server administrator.  The image below shows a typical connect to server login window.

Azure SQL DW & PolyBase - SSMS Login - Description: The login screen for SQL Server Management Studio.

By default, you should be in the master database.  Right now, there is one user defined database named dw4stocks.  Creation of an Azure SQL Data Warehouse can be done via PowerShell or T-SQL.  Please see my article on “Deploying Azure SQL Data Warehouse Using Resource Manager” for the PowerShell cmdlets to accomplish these tasks. 

The T-SQL code below drops and recreates the existing database using a DW1000 service object or performance tier and maximum storage capacity of 10 TB

-- Create [dw4stocks] Azure SQL DW 
-- Delete existing database 
-- Create new database 
MAXSIZE = 10240 GB, 
EDITION = 'datawarehouse', 

The Azure offerings from Microsoft change on a monthly basis.  In November 2017, Microsoft announced the addition of the Compute performance level tier.  The number of concurrent queries, compute nodes, distribution per compute nodes, memory per distribution and maximum memory per data warehouse have been increased significantly.  The prior offering has been renamed as the Elasticity performance tier.  Change the service objective to DW1000c to start using this new tier.

I want to create three user defined schemas.  The ACTIVE schema will have the combined data.  The STAGE schema will contain intermediate results and auditing information.  The BS schema will contain the external table defined for blob storage.  I thought about using ET as the abbreviation for external tables.  However, I did not want people to think of Steve Spielberg’s block buster movie named E.T. which is about a visitor from another planet.

The following tables are key to our solution.

Schema Table Description
ACTIVE STOCKS The final table with surrogate id.
STAGE STOCKS The staging table w/o surrogate id.
STAGE AUDIT Audit trail of T-SQL statements.

The [ACTIVE].[STOCKS] table will contain all of S&P 500 historical data.

-- Create ACTIVE table 

The [STAGE].[AUDIT] table keeps track of the actions performed by our load process.

-- Create AUDIT table 
  AU_CHANGE_DATE [datetime] NOT NULL, 
  AU_CMD_TEXT [varchar](1024) NOT NULL, 
  AU_CHANGE_BY [nvarchar](256) NOT NULL, 
  AU_APP_NAME [nvarchar](128) NOT NULL, 
  AU_HOST_NAME [nvarchar](128) NOT NULL 

Please note there are no primary key constraints or default constraints on these tables.  That is because the Azure SQL data warehouse is not fully compliant with T-SQL. 

There is a new concept that you have to worry about during table creation.  How is the data going to be distributed to the 60 files that are spread across the compute nodes?  There are two ways to distribute data.  First, we can use hash distribution on a given column.  The stock symbol might be a good candidate if data is equally distributed.  Second, we can use a round robin distribution if we do not know a good hash key.  I choose the second method given the fact that our data size is very small.  Please read this MSDN article on distributed table design.

If you successfully executed the enclosed Transact SQL script, your object explorer window should have three newly defined tables.  The image below also contains an external table for the packing list, a view to dynamically create T-SQL statements and a stored procedure to load data from blob storage using external tables (PolyBase).  More information on these objects to follow later in the article.

Azure SQL DW & PolyBase - Object Explorer - Description: Screen shot from SSMS showing the objects that make up our solution.

In short, the Azure SQL data warehouse can be managed by SQL Server Management Studio just like any other database.  While some T-SQL constructs are the same in this offering, there are many features that are not supported.  I suggest you refer to the T-SQL documentation often to make sure a feature is available.

Defining an external table

Up to this point, I have not talked about anything new.  In order for the Transact SQL commands to access the Azure blob storage, we need to define an external table.  The technology behind external tables is PolyBase.  There are five simple steps that are required to create this object.

First, we need to create a master key from a given password.  I suggest using a strong random password generated by one of the free online password sites such as https://www.random.org/passwords/.  The screen shot below shows the newly created master key.

Azure SQL DW & PolyBase - Master Key - Description: Screen shot from SSMS showing the master key for the logical server.

Second, we want to define a database credential called ‘CRD_AZURE_4_STOCKS’.  The secret is the storage access keys that we captured above.  The identity string can be any valid name.  The screen shot below shows our newly defined credential.

Azure SQL DW & PolyBase - Database Credential - Description: Screen shot from SSMS showing the database credential.

Third, we must create the external data source named ‘EDS_AZURE_4_STOCKS’ using database credential we just created.  Make sure you choose the type as HADOOP.  The location is the fully qualified path to the Azure blob storage container.  Use the newly defined database credential to access the blob storage.

Azure SQL DW & PolyBase - External Data Source - Description: Screen shot from SSMS showing the external data source.

Fourth, we require an external file format named ‘EFF_AZURE_4_STOCKS’ to be created.  This object defines the format of the files stored in blob storage.

Azure SQL DW & PolyBase - External File Format - Description: Screen shot from SSMS showing the external file format.

The fifth and final step puts all the pieces together by defining an external table.  The location parameter is full name of the file that is saved in the storage container.  The data source and file format parameters use the database objects we previously defined.  It is best to state the fields in the external table as strings or variable characters.  That way, bad data in the file will not cause an error when reading from the external table.

Azure SQL DW & PolyBase - External Table - Description: Screen shot from SSMS showing the external table.

External tables act like read-only versions of regular tables.  Only a SELECT statement is valid with these tables.  Therefore, the following T-SQL will list all the entries in the packing list file.

-- Show packing list data 

The image below shows the first 9 records in the packing list file.

Azure SQL DW & PolyBase - External Table 4 Packing List - Description: Screen shot from SSMS showing the external table defined for the packing list file.

To rehash this section, there are five objects (steps) that are used to define an external table.  This collection of technologies is called PolyBase.

Life without cursors

The following algorithm is the pseudo code for our business solution. 

Usually, I would create a cursor on the external table for the packing list file.  Unfortunately, Azure SQL data warehouse does not support any of the cursor constructs in T-SQL.  How do we translate each record in the packing list file into Dynamic T-SQL to do the following:  drop an external table, create an external table, and insert data into our staging table from the external table?

Step Description
1.0 Truncate the staging table.
2.0 Read the packing list file.
3.0 For each record in the packing list file.
3.1 Drop existing external table.
3.2 Create new external table.
3.3 Insert data from external table to staging table.
3.4 Insert audit records for each above statement.

Azure SQL data warehouse does support a WHILE loop statement.  The key to this technique is to create a user defined view that has the dynamic T-SQL (text) already built for consumption.  Thus, we just need assign each file name a row number so that we can process the view incrementally in the loop.

Since the view represents nothing new in terms of technology, I am enclosing the completed T-SQL script for both the view and store procedure for your review.  The image below was captured from the query window within SQL Server Management Studio.  The view named TSQL_FOR_PACKING_LST_FILES supplies text or dynamic T-SQL to the stored procedure we are going to define.

Azure SQL DW & PolyBase - View with dynamic T-SQL - Description: Screen shot from SSMS showing the user defined view for our drop, create and insert statements.

Custom Stored Procedure

A custom stored procedure is a great way to package a bunch of T-SQL statements into a repeatable process.  The [ACTIVE].[LOAD_FROM_BLOB_STORAGE] takes one parameter as input.  The verbose flag indicates whether or not the code displays debugging information to the results window.

The T-SQL code below implements the previously discussed algorithm as a stored procedure.  To reduce code, I removed the typical TRY/CATCH logic that I use with my stored procedures.

-- Load from blob storage 
-- Drop stored procedure 
    SELECT * FROM sys.objects 
    WHERE object_id = OBJECT_ID(N'[ACTIVE].[LOAD_FROM_BLOB_STORAGE]') AND type in (N'P', N'PC') 
-- Create stored procedure 
  -- Global variables 
  -- Loop variables 
  -- 1.0 - Clear data from stage 
  -- 3.0 - While loop replaces cursor 
    -- Local variables 
    -- Debugging 
        PRINT ' ' 
    -- 2.0 - Current file name 
    -- 3.1 - Remove existing external table 
    EXEC sp_executesql @SQL_STMT; 
    -- 3.4 - Insert Audit record 
    VALUES ( @CURR_DATE, @SQL_STMT, @USER_NAME, 'Drop External Table', 'db4stocks' ); 
    -- 3.2 - Create existing external table 
    EXEC sp_executesql @SQL_STMT; 
    -- 3.4 - Insert Audit record 
    VALUES ( @CURR_DATE, @SQL_STMT, @USER_NAME, 'Create External Table', 'db4stocks' ); 
    -- 3.3 Insert into staging table 
    EXEC sp_executesql @SQL_STMT; 
    -- 3.4 - Insert Audit record 
    VALUES ( @CURR_DATE, @SQL_STMT, @USER_NAME, 'Load Staging Table', 'db4stocks' ); 
    -- Debugging 
        PRINT ' ' 
    -- Update date 
    -- Increment the counter 
    SET @LOOP_CNT +=1; 

Here are some things to note about the above code.  Basically, the code is a work around the lack of cursor support in Azure SQL data warehouse.

Normally, when using a cursor, we would fetch the next row of data until there was no more data.  Instead, we get the number of files to process by selecting a total count from the packing list external table.  This value is stored in a variable named @FILE_CNT.  Next, we set a loop counter variable named @LOOP_CNT to 1.  The while loop continues until this counter exceeds the file counter.

Again, if we were using a cursor, the file name would be available to the stored procedure as a variable.  We would stitch together dynamic T-SQL within the loop of the cursor.  In the code above, we are selecting a single row and single field from the view that has the pre-built T-SQL we want to execute. 

Last but not least, audit tables that I have defined in the past come with default values in which I pull in key information such as current date/time and current user.  Because Azure SQL data warehouse does not support default values, we need to pass this information along to the INSERT statement.

The T-SQL snippet below calls the above stored procedure.  This stored procedure will process the 505 files using external tables and PolyBase.  The resulting data is stored in the staging table.  We know that some records in the table have zero values since they represent dividends.  The final insert statement moves the data from the staging table to the active table when leaving behind these zero value records.

-- Test load process 
-- Import blob files into stage 
-- Move from stage to active 

The image below shows the completed execution of the load process.  While PolyBase makes it easy to load files from Azure blob storage, it is definitely not the fastest technology in the world.  It takes 40 minutes and 34 seconds to process the records stored in 505 files.

Azure SQL DW & PolyBase - Test load process - Description: Screen shot from SSMS showing the total time for the load process.

If we take a look at the total number of records and divide by the total process time in seconds, we are able to process about 52 records per second.

Azure SQL DW & PolyBase - Staging Table Record Count - Description: Screen shot from SSMS showing the total number of records in the staging table.

The real speed with Azure SQL data warehouse is the ability to aggregate large amounts of data.  One of the analysis at Big Jon Investments has asked for the lowest and highest stock sale price by month for all the S&P 500 stocks.  The below query returns in less than a second with the answers.  The resulting data set is 6,054 records in size.

Azure SQL DW & PolyBase - Simple Aggregate Query - Description: Screen shot from SSMS showing a TSQL query to get high and low sell values of stocks by month.

To make a brief statement about this section, PolyBase allows a relational database to use schema on read technology.  The execution time for this technology is not fast. However, the ease of use makes it very attractive.  Azure SQL data warehouse is not fully T-SQL compliant.  But with enough perseverance, you will find ways to work around these limitations.  The real attractive feature about the warehouse is the blinding speed in which huge amounts of data can be aggregated.


Big Jon’s Investments wanted to collect S&P 500 historical data from a third party and load it into an Azure SQL data warehouse.  Azure blob storage was used as a logical data lake for these comma separated files.  PowerShell was used to collect and post the historical data to blob storage.  Microsoft provided the data warehouse developer with external tables (PolyBase) to read files located in blob storage. 

The external table depends on five objects being defined.  First, a master key needs to be defined for a given unique password.  Second, storage access keys need to be retrieved from the blob storage account.  Third, a database credential should be defined using the storage access key.  Fourth, an external data source is created for access to the blob storage container.  Fifth, an external file format is defined to describe the contents of the file.  Last but not least, the external data source and external file format are used to define the external table.

In the past, I wrote an article about “Bulk Inserting Data into an Azure SQL Database with PowerShell”.  One issue that we could not address is the 4 TB limit of this offering.  Today, I talked about Azure SQL Data Warehouse which can scale up to 240 TB of uncompressed data and almost 1 PB of compressed data using column store technology.  However, there are still some outstanding items to address to provide an automated solution. 

First, how do we move all the data files and packing list to an archive directory after successful processing?  Second, how can we schedule the load from the blob storage process?  These topics will be talked about in the near future.

Next Steps
  • Using Azure Automation to schedule a PowerShell workflow.

sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip

About the author
MSSQLTips author John Miner John Miner is a Data Architect at Insight Digital Innovation helping corporations solve their business needs with various data platform solutions.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Comments For This Article

Wednesday, August 22, 2018 - 8:11:32 PM - Martin Piecyk Back To Top (77277)

Thanks for the great and informative article! It made it very easy to change our Azure SQL Data Warehouse from using slower bulk-inserts to faster PolyBase Blob processing.

Wednesday, April 25, 2018 - 4:09:11 AM - Badeeul Back To Top (75778)

Hi John,

I'm not able to download the script of TSQL_FOR_PACKING_LST_FILES. Could you please mail me the script?



Tuesday, March 27, 2018 - 2:59:51 PM - John Miner Back To Top (75532)

Hi Irfan,

That is the beauty about Microsoft supplying the developer with various tools. You can put the utlities together to solve the problem the way you see fit.

Here are just a couple of ways to accomplish the copy.

1 - AZ Copy data files to blob storage.

2 - Use PowerShell cmdlets to copy files to blob storage.

3 - Schedule a Data Factory job to do the same.



Tuesday, March 27, 2018 - 11:38:29 AM - Irfan Khan Back To Top (75528)

 I hope we can use Azure Data Factory to run the pipeline, provided that first part of pipeline is to move data from source to azure blob. Right?

get free sql tips
agree to terms