Synapse Analytics vs Snowflake for Delta Format Files, Machine Learning, Table Relationships and Constraints
By: Ron L'Esteve | Updated: 2022-05-19 | Comments | Related: > Azure Synapse Analytics
When evaluating the capabilities of Synapse Analytics and Snowflake for your Lakehouse Platform, you may be interested in knowing about how both Synapse and Snowflake handle Delta format files, Machine Learning workloads, table relationships, constraints, and more. Having answers to these questions will help with thoroughly evaluating the features and offerings of both Synapse and Snowflake to determine if one or both are a good fit for your Modern Data Lakehouse Platform.
In this article, you will learn more about the advanced features of both Synapse Analytics workspaces and Snowflake. For Synapse Analytics, you will learn about Delta tables, Machine Learning, SQL Server Integration Services Integration Runtime (SSIS IR), Map Data Tool, Data Sharing, SQL Incremental, relationships, and constraints. For Snowflake, you will learn about its support for Delta Lake, Machine Learning, and how constraints are handled on the platform.
Synapse Analytics workspace is a unified data and analytics platform that brings storage, serving, reporting, and advanced analytics options along with it. While it is possible to query parquet files from the workspace, in this section, you will learn more about its support for Delta Lake. In addition, you will also learn more about advanced machine learning features, SQL Server Integration Services Integration Runtime support, the availability of constraints within the out-of-the-box data warehouses and databases, out-of-the-box Azure SQL DB incremental support, data sharing capabilities, and the new Map Data Tool.
Working with Spark Pools within the Synapse Analytics workspace is like Spark compute clusters within Databricks. Like Databricks, when creating a Delta table in a Spark database, simply run SQL syntax like the code shown below which includes defining the code language using the %%sql magic command, running the create table statement which includes defining the data's schema, and finally specifying the use of Delta format along with the ADLSgen2 location folder. Complex merge statements with custom logic can be written to load data into the tables and the table can be queried using common SQL syntax. If you are running this code, please ensure that your ADLSgen2 account is linked.
%%sql -- Create Delta Lake table, define schema and location CREATE TABLE Delta_Customer ( Customer STRING NOT NULL, Customer_ID INT NOT NULL, BeginDate DATE NOT NULL, EndDate DATE NOT NULL, CurrentRecord INT NOT NULL ) USING DELTA -- specify data lake folder location LOCATION '/Delta/Customer/'
Like the Databricks Delta log, Synapse will also maintain a _delta_log. It also supports the DESCRIBE command to retrieve the history of the table operations performed on the Delta tables. Similarly, the VACCUM command can also be run to remove old data files. The concept of time travel is also a capability to read previous versions of data by running code similar to the following:
# Load a previous version of the Delta_Customer table into a dataframe df = spark.read.format("delta").option("versionAsOf", 3).load("/Delta/Customer/") df.show()
Serverless SQL Pools can also query Delta Lake. The advantage of using Serverless SQL pools includes quicker retrieval of data since there is no wait time for Spark nodes to spin up. Synapse supports querying Delta Lake using Serverless SQL Pools in multiple languages by running the OPENROWSET function, as shown in the code below. Your URI is a link to the ADLSgen2 account containing your data so it will be important to ensure that your _delta_log table is contained within the root folder that is specified. Also, ensure to specify the Delta format. The query can be as simple as the one shown in the code below, or quite complex to include multiple joins, filters, orderings, and groupings. See this for more information about querying Delta Lake from Azure Synapse Analytics Workspace. Be sure to stay informed about the limitations of this capability.
SELECT * FROM OPENROWSET( BULK 'https://adlsaccount.blob.core.windows.net/data/Delta/Customer/', --Specify Delta Lake folder FORMAT = 'delta') as rows --Specify delta format ORDER BY Customer
Both Databricks and Synapse Analytics workspaces provide robust Machine Learning capabilities. With SynapseML, Data Scientists working in the Synapse Analytics workspace have access to open-source libraries to enable the creation of scalable ML Pipelines through its multilingual APIs which will run on your Apache Spark Clusters. It can also be used locally and with Databricks. With Synapse ML, you will have access to a variety of Cognitive Services such as multivariate anomaly detection, Deep Learning models, Responsible AI, Light GBM, OpenCV, and more. Read this for more information about getting started with Synapse ML, including code samples.
SQL Server Integration Services Integration Runtime (SSIS IR)
Azure SSIS Integration Runtime (IR) is one of two IRs within the Synapse Analytics workspace with which developers can lift and shift their on-premises SSIS workloads into Synapse Analytics. They can run packages that have been deployed in SSIS and hosted by Azure SQL Database or Managed Instance. In addition, they can run SSIS packages that have been deployed in Azure Files. Once configured, the SSIS IR can be selected from the dropdown within the Execute SSIS package activity, as shown in the figure below. Packages can be chosen from either SSISDB, File Systems (Package or Project), Embedded packages, or Package stores. Other required configuration parameters include Folder, Project, and Package location. Other optional configurations can be included in the Settings tab. Notice that there are a variety of other tabs that can also be included as part of this SSIS IR activity.
Map Data Tool
The Map Data tool is a GUI-driven, code-free ETL tool to enable users to create their mappings and data flows from source systems to Synapse Lake database tables. This tool supports quick time to insights since developers can easily build their ELT pipelines without the need to write custom code or scripts. The user will first choose their primary source table, along with the mapping method, source column, and target column, as shown in the figure below Mapping methods can include common mappings such as surrogate keys composed of multiple columns, lookups, unpivots, and direct mappings. Aggregate mappings are also supported such as averages, standard deviations, and sums. Finally, derived columns are also supported. Once you have the desired mappings established, you can switch on the 'Data flow debug' setting to display a preview of the target dataset. If all looks well from a target dataset standpoint, when you click 'Create pipeline', a new Execute Pipeline activity will be created for you which contains a Mapping Data Flow activity with the pre-built mappings, aggregations, and derived columns that were selected in the Map Data Tool. This is a neat capability that further demonstrates the code-free features of Synapse Analytics workspaces.
Sharing data within Synapse Analytics is also supported by a native Azure capability called Azure Data Share. Much like Databricks Delta Sharing and Snowflake Data Sharing, Azure Data Share within Synapse Analytics also provides sharing capabilities for both providers and consumers. With Azure Data Share, data providers can share data from a list of supported data stores with consumers and partners. Currently, ADLSgen2 and Synapse Dedicated SQL Pools are supported data stores. Azure Data Share supports various sharing methods ranging from full or incremental snapshots. It provides robust management and monitoring capabilities to track the shared assets. Consumers can access data that is shared with them through invitations and can integrate Azure Data Share with REST APIs. So, if you need to share data with consumers and are not on the Databricks on Snowflake platform, rest assured that you can use Azure Data Share for this use case.
Synapse Analytics Pipelines supports a new out-of-the-box feature called 'Enable incremental extract' for Azure SQL Database sources, as shown in the figure below. When enabled, the source will only read incremental data from a previous run or full data for a first run. Previously, this could be achieved by building custom incremental and Change Data Capture (CDC) pipelines from source to target. Additional options to specify the incremental date column and whether to start from the beginning enables engineers to build incremental ELT patterns without having to write any custom code or design custom pipelines. While this feature is currently only available for Azure SQL Database, it is a promising feature that significantly accelerates time to market through this out of box feature which would have previously taken a significant amount of time to custom develop the full and incremental pipeline.
While thinking through the concept of the Lakehouse, customers often ask if they will be able to apply constraints to their data in the Lakehouse, much like they would in their traditional SQL-based systems. The concept of keys, relationships, and partitions are all capabilities of Lake databases in Synapse Analytics. Primary, Foreign, and Composite keys are all supported within Lake Databases and their relationships can visually be depicted in the Lake database designer which displays the relationships between multiple tables, as shown in the figure below. As for Synapse Analytics Dedicated SQL Pools (SQLDW), constraints are not well suited for this environment since primary keys and unique keys are allowed but not enforced. Also, there is no foreign key concept. Lastly, primary keys work when Not Enforced and Nonclustered. This means that duplicate entries will be allowed so ensure that your ELT process enforced data quality and cleansing logic. To create a table with constraints, the syntax would need to be similar to the code which follows: CREATE TABLE Customer(Id int PRIMARY KEY NONCLUSTERED NOT ENFORCED, value int)
Both Databricks and Synapse Analytics provide Delta Lake support which enables support for workloads that are run on open-source Apache Spark compute through well-integrated APIs. The support for the Delta format is less well-known with Snowflake. Organizations that are considering Snowflake as part of their Lakehouse are also interested to know about the Machine Learning capabilities of Snowflake. In this section, you will learn more about this support for Delta format which Snowflake offers in addition to its Machine Learning capabilities and support for constraints.
While Snowflake does support integration with Delta format, it is both an experimental and proprietary process. Snowflake supports the reading of Delta Tables by using a manifest file that contains a list of data files to read for querying the Delta Table. This integration can be achieved by first generating a manifest file using Databricks. As an example, the following code can be used to generate this manifest file with a SQL Query within the Databricks workspace.
GENERATE snowflake_delta_manifest FOR TABLE DeltaDimCustomer
After this, Snowflake can be configured to read this manifest file through either an external table or view. You will need to begin by first creating a stage using code like the following from Snowflake. A Snowflake Stage is used to reference the inside and outside of Snowflake. For external Stages, Snowflake supports ADLSgen2, AWS S3, and GCP buckets.
create or replace stage delta_stage_table url='<path-to-delta-table>'
Once this Stage is created, you can create an external table using code as follows.
CREATE OR REPLACE EXTERNAL TABLE delta_DimCustomer( filename VARCHAR AS split_part(VALUE:c1, '/', -1) ) WITH LOCATION = @delta_stage_table/snowflake_delta_manifest/ FILE_FORMAT = (TYPE = CSV) PATTERN = '.*[/]manifest' AUTO_REFRESH = true;
When the Delta table is updated, the manifest file must be re-generated through either explicit or automatic updates. Be sure to understand the limitations around Data Consistency and Schema Evolution. When the manifest file is updated, it overwrites the existing manifest file. If the table is partitioned, the partitioned manifest files will be updated automatically, however since all manifests of the partitions cannot be updated concurrently, this may lead to versioning issues or other errors. Since Snowflake uses a schema defined in its table definition, it does not support Delta Lake's Schema Evolution capability since it will not query the updated schema until after the table definition is updated to the new schema. See this for more information on Snowflake to Delta Lake integration.
Databricks and Synapse Analytics workspaces support Machine Learning through various libraries, run-times, APIs, and other out-of-the-box functionality. When considering a Lakehouse, customers are interested in understanding if Snowflake also provides support for Machine Learning workloads and model development. Snowflake supports executing Python, Spark, and Java code therefore models can be trained and executed in Snowflake for development purposes by using custom code. Once developed, these models will need a separate environment to run. The samples shown in the following GitHub Repository demonstrate that models can be pushed inside Snowflake UDFs to run the model as part of SQL. While this is not as robust as having access to open-source libraries, packages, runtimes, APIs, and compute on Spark, it does demonstrate that Snowflake can support ML Model Development initiatives through custom scripts and processes.
Snowflake provides support for constraints including UNIQUE, PRIMARY KEY, FOREIGN KEY, and NOT NULL. A table can have multiple unique and foreign keys but only one primary key. NOT NULL constraints are always enforced, while other constraints can be defined by are not always explicitly enforced. When a table is copied using a CREATE TABLE..LIKE or CREATE TABLE..CLONE command, all constraints are also copied over. Constraints can be defined on single or multi-columns. Oftentimes, customers look closely at the capabilities of constraints when choosing an enterprise data warehouse platform, so understanding Snowflake's capabilities around constraints will be useful.
Synapse Analytics workspaces present a modern approach to managing Data Engineering and Advanced Analytics workloads with the advantage of not requiring cluster startups. Synapse provides great support for Parquet and supports a variety of data disciplines within its unified platform including Power Bi Reporting, ELT Pipelines, Machine Learning, and much more. Its Serverless offering can be accessed directly from commonly used IDEs including Power BI and SQL Server Management Studio (SSMS) through various types of enterprise authentications. While these are all great features of Synapse Analytics workspaces, its support for DELTA and data caching can be improved. When compared to Databricks, there is also more overhead related to access setup of credentials, data formats, and data structure specifications. While Databricks is a more mature Lakehouse Platform technology with robust integration and support for DELTA format, it lacks a persistent Data Warehouse layer technology, which Synapse provides with its Dedicated SQL Pools offering. On the other hand, while Synapse Analytics has room for growth and maturity in the Data Lakehouse Platform space, it is quickly growing and evolving its feature offerings to be on par with those of Databricks. As you evaluate if these technologies are a fit for your organization, also consider understanding how these resources can be deployed securely from an infrastructure perspective to enable private connectivity to ensure that organizational security requirements are met.
Snowflake has oftentimes been compared to Synapse Analytics Dedicated SQL Pools (SQLDW) as organizations evaluate a persistent Data Warehouse consumption layer for their Data Lakehouse Platform. At its core, the Synapse Analytics offerings are well integrated with other Azure native technologies and support native DELTA and Spark-based analytics on top of your Data Lake in addition to its persistent Datawarehouse, SQL Dedicated Pools. It is also great at handling ML, AI, and steaming workloads. Snowflake, on the other hand, is designed for conventional BI workloads while also offering robust support for streaming. Additionally, Snowflake supports ML and DELTA format, with certain limitations and less robustness than Synapse. Snowflake offers unlimited scale, superior partner integrations, and marketplace offerings, and great value out-of-the-box, while Synapse may have a slightly elevated learning curve due to its advanced offerings. Ultimately, Synapse Analytics and Snowflake have their advantages and disadvantages. When evaluating them for your organization, by considering the use case and business requirements, you may decide that all three of these great technologies will earn a place in your Modern Data Lakehouse Platform.
- For more information about deploying Snowflake securely using Private Link, please read: Azure Private Link & Snowflake — Snowflake Documentation
- For more information about understanding Private Endpoints with Synapse Analytics, please read: Understanding Azure Synapse Private Endpoints | The Data Guy
- See this for more details about comparing Databricks with Synapse Analytics workspaces
- Read my previous tip about Understanding the Capabilities of Snowflake.
- Read my previous tip about Azure Data Lakehouse Ingestion and Processing Options (mssqltips.com).
- Read my previous tip about Data Lakehouse Storage and Serving Layers (mssqltips.com)
- Read my previous tip about Lakehouse Reporting, Data Modeling, and Advanced Analytics in Azure (mssqltips.com)
About the author
View all my tips
Article Last Updated: 2022-05-19