Azure Data Factory vs SSIS vs Azure Databricks
Choosing the right E-T-L tool can be difficult based on the many data integration offerings from Microsoft’s ever-growing Data integration ecosystem. Technology professionals ranging from Data Engineers to Data Analysts are interested in choosing the right E-T-L tool for the job and often need guidance when determining when to choose between Azure Data Factory (ADF), SQL Server Integration Services (SSIS), and Azure Databricks for their data integration projects.
Both SSIS and ADF are robust GUI-driven data integration tools used for E-T-L operations with connectors to multiple sources and sinks. SSIS development is hosted in SQL Server Data Tools, while ADF development is a browser-based experience and both have robust scheduling and monitoring features. With ADF’s recent general availability of Mapping Data Flows, ADF now also supports aggregations, derived columns, fuzzy lookups, and other visually designed data transformations, similar to SSIS, that allow Data Engineers to build E-T-L in a code free manner. Both ADF’s Mapping Data Flows and Databricks utilize spark clusters to transform and process big data and analytics workloads in the cloud. This article aims to cover the similarities and differences between ADF, SSIS, and Databricks in addition to providing some guidance to help determine how to choose between these various data integration services.
When should I use SSIS, Azure Data Factory, or both?
When choosing between Azure Data Factory (ADF) and SQL Server Integration Services (SSIS) for a new project, it would be critical to understand whether your organization has an Azure foot-print and if so, could this project be hosted in Azure? If the answer is yes, then ADF is the perfect tool for the job. On the other hand, if the new project must be completed on-premises for either security reasons or because there is already an existing SSIS ecosystem, then SSIS is the tool of choice. For example, customers that may be utilizing Azure Gov Cloud, may notice the unavailability of ADF and/or Mapping Data Flows and for this reason SSIS may be the better tool for their projects.
SSIS is part of SQL Server’s several editions, ranging in price from free (Express and Developer editions) to ~$14K per core (Enterprise), and SSIS integration runtime nodes start at $0.84 per hour on Azure. That said, data volume can become a concern from both a price and performance stand-point when running big data workloads using SSIS since hardware will need to be purchased and often times maintained.
Azure Data Factory’s (V2) pay-as-you-go plan starts at $1 per 1000 orchestrated runs and $1.5 per 1000 self-hosted IR runs. ADF would be a great resource for organizations that have hundreds of SSIS packages that they would not want to re-write in ADF but would like to reduce operational costs, increase high availability and increase scalability by leveraging Azure. For this scenario, a hybrid Lift and shift SQL Server Integration Services workloads to the cloud would be ideal.
From a data velocity perspective, ADF natively supports event-based and tumbling window triggers in addition to scheduled batch triggers, whereas SSIS only supports batching natively with the capability of potentially building custom triggers for near real-time data streams. See Developing a File Watcher Task for SQL Server Integration Services for more information on continuously checking a directory for incoming files before processing them.
From a data variety perspective, ADF can natively connect to over 90+ sources ranging from REST APIs to CRM Systems to complex JSON structures, while SSIS is better suited for structured data sources but can integrate well to either 3rd party or custom C# connectors for JSON, REST APIs more.
From a programmability perspective, Azure Data Factory does not have a native programming SDK but does support automation through PowerShell without any third-party components, whereas SSIS has a programming SDK, along with automation through BIML and a variety of other third-party components.
When should I use Azure Data Factory, Azure Databricks, or both?
Both Data Factory and Databricks are cloud-based data integration tools that are available within Microsoft Azure’s data ecosystem and can handle big data, batch/streaming data, and structured/unstructured data. Both have browser-based interfaces along with pay-as-you-go pricing plans. ADF’s recent general availability of Mapping Dataflows uses scaled-out Apache Spark clusters, which is similar to Databricks’ underlying architecture, and performs similarly for big data aggregations and transformations. It is important to note that Mapping Data Flows currently does not support connectivity to on-premises data sources. Also, ADF’s original Copy Activity does not use spark clusters but rather self-hosted integration run-times and does allow connectivity to on-premises SQL Servers. Based on these options to connect to on-premises SQL Servers, Databricks does have capabilities to connect to on-premises data sources and may out-perform ADF on big data workloads since it utilizes spark clusters.
From a velocity perspective, both ADF and Databricks support batch and streaming options. ADF does not natively support Real-Time streaming capabilities and Azure Stream Analytics would be needed for this. Databricks supports Structured Streaming, which is an Apache Spark API that can handle real-time streaming analytics workloads.
From a development interface perspective, ADF’s drag-and-drop GUI is very similar to that of SSIS which fosters a low learning curve and ease of use for developers that are familiar with the code-free interface of SSIS. Additionally, cluster types, cores, and nodes in the Spark compute environment can be managed through the ADF activity GUI to provide more processing power to read, write, and transform your data. Databricks does require the commitment to learn either Spark, Scala, Java, R or Python for Data Engineering and Data Science related activities. This can equate to a higher learning cure for traditional MSSQL BI Developers that have been engrained in the SSIS E-T-L process for over a decade. For data engineers and scientists that are familiar and comfortable with the Databricks programming languages, Databricks offers a neat and organized method of writing and managing code through notebooks.
The last and most notable difference between ADF and Databricks is related to its primary purpose. ADF, which resembles SSIS in many aspects, is mainly used for E-T-L, data movement and orchestration, whereas Databricks can be used for real-time data streaming, collaboration across Data Engineers, Data Scientist and more, along with supporting the design and development of AI and Machine Learning Models by Data Scientists. For example, MLflow from Databricks simplifies the machine learning lifecycle by for tracking experiment runs between multiple users within a reproducible environment, and manages the deployment of models to production. Additionally, Databricks supports a variety of third-party machine learning tools in Databricks.
Once these Databricks models have been developed, they can easily be integrated within ADF’s Databricks activity and chained into complex ADF E-T-L pipelines, along with a seamless experience for parameter passing from ADF to Databricks. Additionally, the Databricks models can be scheduled and monitored via ADF.
In this article, I explored the differences and similarities between ADF, SSIS, and Databricks along with recommendations on when to choose one over the other along with when to use them together. In summary, it truly depends on a number of different factors such as performance, cost, preference, security, feature capability and more.
- For more information on Copy performance and scalability achievable using ADF, see Copy activity performance and scalability guide.
- To create, start, and monitor a tumbling window trigger in ADF, see Create a trigger that runs a pipeline on a tumbling window.
- To better understand event-based triggers that you can create in your Data Factory pipelines, see Create a trigger that runs a pipeline in response to an event.
- To understand the Azure Data Factory pricing model with detailed examples, see Understanding Data Factory pricing through examples.
- To understand how to link Azure Databricks to your on-prem SQL Server, see Deploy Azure Databricks in your Azure virtual network (VNet injection).
- For more information on the most popular third-party ML tools in Databricks, see Third-party machine learning integrations
- For more detail on tuning ADF’s Mapping Data Flow performance, see Mapping data flows performance and tuning guide.
- For more information on running a Databricks notebook against the Databricks jobs cluster within ADF and passing ADF parameters to the Databricks notebook during execution, see Run a Databricks notebook with the Databricks Notebook Activity in Azure Data Factory.
About the author
View all my tips
Article Last Updated: 2020-06-08