Copy Data from and to Snowflake with Azure Data Factory


By:   |   Updated: 2020-08-04   |   Comments   |   Related: > Azure Data Factory


Problem

We are using Snowflake for our data warehouse in the cloud. We would like to use the Azure toolset for managing the data pipelines. Is it possible to use Azure Data Factory to get data in or out of Snowflake?

Solution

Snowflake is a cloud-based data warehouse solution, which is offered on multiple cloud platforms. Once you’ve configured your account and created some tables, you most likely have to get data into your data warehouse. But sometimes you also have to export data from Snowflake to another source, for example providing data for a third party.

If you’re invested in the Azure stack, you might want to use Azure tools to get the data in or out, instead of hand-coding a solution in Python, for example. The main tool in Azure to move data around is Azure Data Factory (ADF), but unfortunately integration with Snowflake was not always supported. This meant work arounds had to be created, such as using Azure Functions to execute SQL statements on Snowflake. This concept is explained in the tip Create an Azure Function to execute SQL on a Snowflake Database - Part 2. Luckily, Snowflake integration has now been implemented, which makes implementing pipelines more straight forward.

At the time of writing, not all functionality in ADF has been yet implemented. Data flows are in the pipeline, and you cannot use a Snowflake linked service in the Execute Stored Procedure activity. If you created such a linked service, you cannot use it in the activity:

no stored proc support

In this tip, we’ll show you how you can create a pipeline in ADF to copy the data from a .csv file in Azure Blob Storage to a table in Snowflake, and vice versa.

If you need more information about Snowflake, such as how to set up an account or how to create tables, you can check out the Snowflake tutorial.

Copying Data from Snowflake to Azure Blob Storage

The first step is to create a linked service to the Snowflake database. ADF has recently been updated, and linked services can now be found in the new management hub:

create new linked service in management hub

In the Linked Services menu, choose to create a new linked service:

create new linked service

If you search for Snowflake, you can now find the new connector:

new snowflake connector

You can specify the integration runtime you wish to use to connect, the account name (without the https), the username and password, the database and the warehouse. Remember, you always need to specify a warehouse for the compute engine in Snowflake. You can also specify additional connection properties, such as for example a default role.

specify linked service properties

First, let's create a dataset for the table we want to export.

create new dataset

In this tip, we’re using the Stack Overflow sample data, but any dataset can be used. We’re going to export the data from the Badges table to a csv file. This table has over 28 million rows and is about 244 megabytes in size.

badges table size

In the New Dataset dialog, search for the Snowflake dataset:

create snowflake dataset

In the next screen, select the Snowflake linked service we just created and choose the desired table from the list.

configure snowflake dataset

The following step is to create a dataset for our CSV file. Select Azure Blob Storage from the available locations:

create blob dataset

Next, choose the DelimitedText format:

choose delimited text format

If you haven’t already, create a linked service to a blob container in Azure Blob Storage. Next, specify the name of the dataset and the path to the csv file. For the CSV dataset, configure the filepath and the file name. Since the file does not exist yet, we’re not going to import the schema.

configure csv dataset

These are the default settings for the csv file, with the first row configured as the header:

csv file default settings

However, it seems auto-detecting the row delimiter does not work:

auto detecting row delimiter does not work

So, make sure to give it an explicit value:

set row delimiter

Now we can create a new pipeline. Add a Copy data activity.

add pipeline with copy data activity

For the source, choose the Snowflake dataset:

copy data source

Since the Badges table is quite big, we’re going to enlarge the maximum file size using one of Snowflake’s copy options, as demonstrated in the screenshot. For the sink, choose the CSV dataset with the default options (the file extension is ignored since we hard-coded it in the dataset):

Once everything is configured, publish the new objects:

publish all the things

Once you run the pipeline, you can see the COPY INTO statement being executed in Snowflake:

snowflake history

The full SQL statement:

copy into location

In about 1 minute, the data from the Badges table is exported to a compressed CSV file:

pipeline run succesful

We can verify the file is actually created in the Azure Blob container:

file is created

A Couple of Pitfalls

When exporting data from Snowflake to another location, there are some caveats you have to take into account.

  • JSON is not yet supported. Only delimitedtext and parquet file formats are supported for direct copying data from Snowflake to a sink.
json not supported
  • When using Azure Blob Storage as a source or sink, you need to use SAS URI authentication. The reason for this is that a COPY INTO statement is executed in Snowflake and it needs to have direct access to the blob container. An example of creating such an SAS URI is done in the tip Customized Setup for the Azure-SSIS Integration Runtime.
sas uri is needed
  • If the table contains too much data, you might go over the maximum file size. You can enlarge this as we’ve shown earlier. Also make sure you’re using compression. If the output is still too big, you might want to create a solution that writes to multiple files. Mapping data flows have this ability, but they do not support Snowflake at the time of writing.
file size exceeded

Copying Multiple Files Azure Blob Storage from to Snowflake

Let’s reverse the roles. Now we’re going to copy data from multiple CSV files to a Snowflake table. First, let’s clone the CSV file we created in the previous section:

multiple source files

In the configuration of the dataset, we’re going to leave the filename blank:

dataset config for multiple files

In Snowflake, we’re going to create a copy of the Badges table (only the schema, not the data) with the following SQL statement:

create table dbo.BADGES_TEST like dbo.BADGES;

The Snowflake dataset is then changed to this new table:

point to new table

Create a new pipeline with a Copy Data activity (of clone the pipeline from the previous section). For the source, choose the csv dataset and configure the filename with a wildcard:

configure source with wildcard

For the sink, choose the Snowflake dataset and configure to truncate the destination table before the data is copied:

configure sink

When the pipeline is started, the destination table will be truncated, but it’s schema will be retrieved as well (for the mapping). Finally, the COPY INTO statement will be executed.

The full SQL statement:

full copy into table statement

You can see the wildcard from the filename is translated into an actual regular expression. After about one minute, the two CSV files are copied into the table. This is 56 million rows and almost half a gigabyte. The performance of the COPY INTO statement is quite good.

statistics

Conclusion

In this tip, we’ve shown how you can copy data from Azure Blob storage to a table in a Snowflake database and vice versa using Azure Data Factory. At the moment, ADF only supports Snowflake in the Copy Data activity and in the Lookup activity, but this will be expanded in the future.

Next Steps





get scripts

next tip button



About the author
MSSQLTips author Koen Verbeeck Koen Verbeeck is a BI professional, specializing in the Microsoft BI stack with a particular love for SSIS.

View all my tips


Article Last Updated: 2020-08-04

Comments For This Article





download














get free sql tips
agree to terms