SSIS Bad Habits: Inefficient data staging strategies - Part 2
When external files are located on the network and need to be imported into a SQL database, several ETL jobs require that data be transferred to the destination server before importing. This can be accomplished using multiple SSIS packages, where one processes and stages the external files into the destination server, and another imports the data into the destination database. Many SSIS developers use text, XML, or JSON files to stage the data as they are widely used, while this choice may decrease the package performance.
In a previously published article in the SSIS bad habits series, we explained data staging and different approaches to stage flat files into a staging database. We also noted that using a data flow task may be a better choice when transformations are required outside the database engine or when reading the data source requires implementing complex logic such as reading data from a web service.
This article explains when databases are not suitable for data staging and what formats SSIS developers should use to stage their data when ETL jobs are developed using SQL Server Integration Services.
When is the SQL database not suitable for data staging?
As noted in the previous articles in this series, the available system resources and the intended use of the data are the main factors considered when designing ETL projects. Even if staging data in a SQL database is better from a data load performance point of view, there are several scenarios where the data needs to be staged using other formats such as flat files, XML, JSON, and Excel. For example, we were asked several times to develop an SSIS package that performs several data cleaning operations and then to stage the data as flat files on another server. This occurred before inserting the data into the destination SQL Server database since different data warehousing tasks needed to consume the staged flat files.
Popular data staging approaches using SSIS
There are several approaches to stage data using SSIS. The previous articles explained how to stage data into a SQL server database using Bulk Insert or Data Flow tasks. This article will discuss other file formats used to stage the data.
Staging Data using flat files
One of the most popular approaches to stage data is using flat files and mostly comma-separated values (*.csv) since most database engines, business intelligence applications, office applications, and information systems support these types.
Several SSIS components can read and store data within flat files:
- Flat File Connection Manager: A Flat File connection manager enables a package to access data in a flat file.
- Flat File Source: The Flat File source reads data from a text file.
- Flat File Destination: The Flat File destination writes data to a text file.
- Script Task/Component: The Script Task and Component are two SSIS objects allowing the developers to extend the SSIS functionalities by adding a custom C# or VB.NET code. These objects can be used to explicitly manipulate flat files.
Staging data using XML files
Another widely used approach is staging data using XML files. Using the eXtensible Markup Language files is preferred for many companies since they are considered a universal data format for data exchange.
Several SSIS components can be used to manipulate XML files:
- XML Task: The XML Task is used to retrieve XML documents, apply operations using Extensible Stylesheet Language Transformations (XSLT) style sheets and XPath expressions, merge multiple documents, or validate, compare, and save the updated documents to files and variables.
- XML Source: The XML source reads an XML data file and populates the columns in the source output with the data.
- Script Task/Component: Since SSIS does not contain an XML destination component, scripts can be used as a workaround to stage data into XML files.
Staging data using JSON files
As mentioned previously, the main factors to consider when designing data warehousing operations are the available system resources and the intended use of the data. This is why, in several cases, the application/services used to read or consume the staged data force the developer to use a specific file format.
In contrast, it is sometimes meaningless to choose a file format only because it is widely used or considered a standard data exchange format. In modern data warehousing operations, data loading performance and the ability to run in parallel are critical. For this reason, Microsoft developed two SSIS components for data staging operations: the Raw file source and Raw file destination. Raw Files are used to dump data between different ETL stages. The data is stored in binary format and can only be used by the SSIS Raw Files component. Weirdly, most SSIS developers do not even know about these components.
A detailed comparative study of the use of flat files and raw files during data staging was posted on John Welch's blog: Performance of Raw Files vs. Flat Files. The results show that exporting data to raw files is slightly faster than using flat files. In contrast, the data read operations are significantly faster.
Data in Raw Files does not require translation or parsing, unlike SSIS Flat Files, which need parsing and validation. Besides, Raw Files are highly recommended when implementing parallel data import logic; we can split data over multiple Raw Files and then import them in parallel using Raw. However, Raw Files cannot be edited or consumed outside of SSIS, which limits their use to data staging.
In conclusion, you will be best served by a widely used data format like CSV, XML, or JSON if exporting data to be used in other systems or published. In contrast, Raw Files are recommended when dumping data for use in another ETL stage.
- After choosing the right strategy to stage data using SSIS, read more about importing and exporting data using Raw Files: SSIS Raw File Source and Destination Example (mssqltips.com)
- If using the JSON format, refer to the following article for more details: Importing JSON Files Using SSIS (mssqltips.com)
- If using the CSV format, refer to the following article for more details: Export Data to Excel from SQL Server Integration Services Package (mssqltips.com)
- For using XML files in SSIS, refer to the following article: Importing XML documents using SSIS (mssqltips.com)
- Read more about SSIS package performance optimization methods here: SQL Server Integration Services SSIS Performance Tuning Techniques (mssqltips.com)
About the author
View all my tips
Article Last Updated: 2022-08-04