SSIS Bad Habits: Inefficient data staging strategies - Part 2

By:   |   Updated: 2022-08-04   |   Comments (1)   |   Related: More > Integration Services Best Practices


Problem

When external files are located on the network and need to be imported into a SQL database, several ETL jobs require that data be transferred to the destination server before importing. This can be accomplished using multiple SSIS packages, where one processes and stages the external files into the destination server, and another imports the data into the destination database. Many SSIS developers use text, XML, or JSON files to stage the data as they are widely used, while this choice may decrease the package performance.

Solution

In a previously published article in the SSIS bad habits series, we explained data staging and different approaches to stage flat files into a staging database. We also noted that using a data flow task may be a better choice when transformations are required outside the database engine or when reading the data source requires implementing complex logic such as reading data from a web service.

This article explains when databases are not suitable for data staging and what formats SSIS developers should use to stage their data when ETL jobs are developed using SQL Server Integration Services.

When is the SQL database not suitable for data staging?

As noted in the previous articles in this series, the available system resources and the intended use of the data are the main factors considered when designing ETL projects. Even if staging data in a SQL database is better from a data load performance point of view, there are several scenarios where the data needs to be staged using other formats such as flat files, XML, JSON, and Excel. For example, we were asked several times to develop an SSIS package that performs several data cleaning operations and then to stage the data as flat files on another server. This occurred before inserting the data into the destination SQL Server database since different data warehousing tasks needed to consume the staged flat files.

Popular data staging approaches using SSIS

There are several approaches to stage data using SSIS. The previous articles explained how to stage data into a SQL server database using Bulk Insert or Data Flow tasks. This article will discuss other file formats used to stage the data.

Staging Data using flat files

One of the most popular approaches to stage data is using flat files and mostly comma-separated values (*.csv) since most database engines, business intelligence applications, office applications, and information systems support these types.

Several SSIS components can read and store data within flat files:

  • Flat File Connection Manager: A Flat File connection manager enables a package to access data in a flat file.
  • Flat File Source: The Flat File source reads data from a text file.
  • Flat File Destination: The Flat File destination writes data to a text file.
  • Script Task/Component: The Script Task and Component are two SSIS objects allowing the developers to extend the SSIS functionalities by adding a custom C# or VB.NET code. These objects can be used to explicitly manipulate flat files.

Staging data using XML files

Another widely used approach is staging data using XML files. Using the eXtensible Markup Language files is preferred for many companies since they are considered a universal data format for data exchange.

Several SSIS components can be used to manipulate XML files:

  • XML Task: The XML Task is used to retrieve XML documents, apply operations using Extensible Stylesheet Language Transformations (XSLT) style sheets and XPath expressions, merge multiple documents, or validate, compare, and save the updated documents to files and variables.
  • XML Source: The XML source reads an XML data file and populates the columns in the source output with the data.
  • Script Task/Component: Since SSIS does not contain an XML destination component, scripts can be used as a workaround to stage data into XML files.
XML Source description from the SSIS toolbox

Staging data using JSON files

In the last decade, JSON has started to slowly replace XML since it is used and supported by new data technologies and Web APIs. The JavaScript Object Notation (JSON) is considered a new standardized data format for data storage and exchange. Since no SSIS component is designed to handle JSON files, developers must use a script component to read the data from a JSON file.

Discussion

As mentioned previously, the main factors to consider when designing data warehousing operations are the available system resources and the intended use of the data. This is why, in several cases, the application/services used to read or consume the staged data force the developer to use a specific file format.

In contrast, it is sometimes meaningless to choose a file format only because it is widely used or considered a standard data exchange format. In modern data warehousing operations, data loading performance and the ability to run in parallel are critical. For this reason, Microsoft developed two SSIS components for data staging operations: the Raw file source and Raw file destination. Raw Files are used to dump data between different ETL stages. The data is stored in binary format and can only be used by the SSIS Raw Files component. Weirdly, most SSIS developers do not even know about these components.

Raw file source description from the SSIS toolbox

A detailed comparative study of the use of flat files and raw files during data staging was posted on John Welch's blog: Performance of Raw Files vs. Flat Files. The results show that exporting data to raw files is slightly faster than using flat files. In contrast, the data read operations are significantly faster.

Raw file destination description from the SSIS toolbox

Data in Raw Files does not require translation or parsing, unlike SSIS Flat Files, which need parsing and validation. Besides, Raw Files are highly recommended when implementing parallel data import logic; we can split data over multiple Raw Files and then import them in parallel using Raw. However, Raw Files cannot be edited or consumed outside of SSIS, which limits their use to data staging.

Summary

In conclusion, you will be best served by a widely used data format like CSV, XML, or JSON if exporting data to be used in other systems or published. In contrast, Raw Files are recommended when dumping data for use in another ETL stage.

Next Steps





get scripts

next tip button



About the author
MSSQLTips author Hadi Fadlallah Hadi Fadlallah is an SQL Server professional with more than 10 years of experience. His main expertise is in data integration. He's one of the top ETL and SQL Server Integration Services contributors at Stackoverflow.com. On the academic level, Hadi holds two master's degrees in computer science and business computing and is seeking a Ph.D. in data science.

View all my tips


Article Last Updated: 2022-08-04

Comments For This Article




Thursday, August 4, 2022 - 11:39:55 AM - Jimbo99 Back To Top (90340)
I'd rather be secure and relatively inefficient than sorry that I allowed an insecurely connected server to pollute & potentially corrupt a database or even server. Any feeds of data that connects to the outside world is a potential for corrupt data. A certain level of inefficiency is like a trap or firewall process of a system to correct an issue before it corrupts.














get free sql tips
agree to terms