Import Data Using SSIS and Transposing Columns to Rows

By:   |   Comments (4)   |   Related: More > Integration Services Development


Problem

I have a requirement to import a text file, using SQL Server Integration Services (SSIS), that has multiple columns that need to be transposed to multiple rows. I can't find any way to do this in SSIS. How can this be done?

Solution

This limitation of importing a varying number of columns can be overcome with a script task where we can use C# or VB.NET code to transpose columns to rows and store the valid data in the database, redirect the error rows to the database and fail the package if the file is not in the expected format. This script technique is quite useful when we want to load data into a non-SQL environment (Example: EDM platforms) as it avoids loading the data into SQL and makes use of T-SQL functions. Also, this script is flexible as it works irrespective of data type unlike some T-SQL functions.

Overview

There could be a scenario where a data provider has ‘N’ number of measures and may provide a data feed with dynamic columns depending on the data that was asked for and it is important to develop a solution which would accommodate this varying nature.

For example, a financial markets data feed could contain multiple measures and a data provider (i.e. Bloomberg, Reuters, FactSet, etc.) could supply only the information that we ask for instead of all the columns in every single feed.

Example Data Feed where number of facts could vary each time:

Company,Date,Datasource,Open Price,High Price,Low Price,Average Price 
Microsoft,2018-03-10,Traders,100.30,140.20,104.35,120.10 
Amazon,2018-03-10,Traders,120.75,140.23,130.80,130.50 

We need to build a solution to accommodate this varying nature of columns and the output that is needed is:

Company,Date,Datasource,Property Name,Property Value 
Microsoft,2018-03-10,Traders,Open Price,100.30 
Microsoft,2018-03-10,Traders,High Price,140.20 
Microsoft,2018-03-10,Traders,Low Price,104.35 
Microsoft,2018-03-10,Traders,Average Price,120.10
Amazon,2018-03-10,Traders,Open Price,120.75
Amazon,2018-03-10,Traders,High Price,140.23 
Amazon,2018-03-10,Traders,Low Price,130.80 
Amazon,2018-03-10,Traders,Average Price,130.50 

An example package has been developed to demonstrate this solution.

This package contains a script component under a data flow task which will act as a data source to read the input file and does the following steps:

  1. Check whether the header row has fixed columns coming through correctly. For example, you are expecting the first few columns to be fixed in the file and check whether those columns are coming through as the package may not behave correctly if it is not the case. The package needs to fail as the source file is not in the expected format.
  2. Check whether any rows are missing data by comparing the number of columns in each row with the header and redirect these rows to an error log.
  3. Transpose the varying number of columns to rows by reading the columns for each line in the file and storing the values in new rows. These are classified as valid rows and these output to the next step in the process.

Step 1 - Variable/Parameters Definition

We will create the following parameters for the package:

  • FileName – Fully qualified name for the input file.
  • LogFilePath – Path to store the error log file (with timestamp) where the incomplete rows would be redirected.

Step 2 - Configuring script component under a Data Flow

A data flow task was created and a script component was added by setting its type to Source and then the variable was configured in the script component. Please refer to the below picture for variable configuration.

Script Transformation Editor - Description: Script Editor screesnshot to show variable and parameters configurations to be passed onto the script

The following outputs and columns have been added in the Inputs and Outputs section in the Script Transformation Editor. TransposedRows output will contain the transposed rows and ErrorRows will contain the error rows (for example, incomplete rows or to contain an error message that the file is not in the expected format).

Script Outputs - Description: 2 Outputs have been created for the script to output transposed rows and error rows to redirect them to DB and error log respectively

Step 3 – C# script to process the file and output the rows

The script is then edited to process the file and create the output rows. The below code contains the comments to understand each bit of the code.

public override void CreateNewOutputRows()
{
    string[] columnvalues;
    //string variable to store line from the file when we loop through the lines
    string line=null; 
    int lineno = 0; //variable to keep track of line no whilst processing the file
    StreamReader fr = File.OpenText(Variables.FileName); //Open the File
    int headercolumnslength = 0; //variable to record header columns length
    string[] PropertyName=new string[1]; // Array to store dynamic column names

    if (fr != null)
    {
        line = fr.ReadLine(); //Read the first line of the file
        //Assuming the file is comma delimited, we split the column values
        columnvalues = line.Split(','); 

         //Calculate the no. of columns in the header row
        headercolumnslength = columnvalues.Length; 
        lineno++;

        /*Check whether there are any issues in the fixed columns in the file,
          Any issues here to be added to the errorrows output and fail the dataflow task  
         */

        if (columnvalues[0].Contains("Company") == false || columnvalues[1].Contains("Date") == false || columnvalues[2].Contains("Datasource") == false)
        {
            AddErrorRow("Error", 0, "Header is not in the expected format, Required fields   missing from file");
            bool pbCancel = false;
            this.ComponentMetaData.FireError(100, "Script Component", "Header not in excepted format", "", 0, out pbCancel);
            fr.Close();
            fr = null;
        }
        /*Loop through the file line by line
        1. Check whether the number of columns in each line matches with no. of columns in the header line. 
           Any mismatch to be redirected to ErrorRows output
        2. The example file contains 3 fixed columns and therefore the varying no. of columns 
           in the first row (header row) to be saved onto PropertyName column in the TransposedRows output.
        3. Varying no. of columns for each line to be saved onto PropertyValue column in the TransposedRows output

           */
        else
        {
            while (fr.EndOfStream == false)
            {
                if (lineno != 1)
                {
                    line = fr.ReadLine();
                    columnvalues = line.Split(',');
                }
                if (lineno++ >= 2) //proceed if the line is not header
                {
                   if (columnvalues.Length != headercolumnslength)
                    {
                        AddErrorRow("Error", lineno - 1, "Number of columns does not match header row for the Company " + columnvalues[0]);
                    }
                    else
                    {
                        for (int i = 3; i < columnvalues.Length; i++)
                        {
                            TransposedRowsBuffer.AddRow();
                            TransposedRowsBuffer.Company = columnvalues[0];
                            TransposedRowsBuffer.Date = columnvalues[1];
                            TransposedRowsBuffer.Datasource = columnvalues[2];
                            TransposedRowsBuffer.PropertyName = PropertyName[i - 3];
                            if (string.IsNullOrEmpty(columnvalues[i]) == false)
                            { TransposedRowsBuffer.PropertyValue = Convert.ToDecimal(columnvalues[i]); }
                        }
                    }
                }
                else
                {
                    PropertyName = new string[columnvalues.Length - 3];
                    for (int i = 3; i < columnvalues.Length; i++)
                    {
                        PropertyName[i - 3] = columnvalues[i];
                    }
                }
            }
        }
        if (fr != null)
        {
            fr.Close();
        }
    }

}
/*AddErrorRow Function to be called to add any error rows to the ErrorRows Output with appropriate message*/
private void AddErrorRow(string errorType, long lineno,string message)
{
    ErrorRowsBuffer.AddRow();
    ErrorRowsBuffer.Lineno = lineno;
    ErrorRowsBuffer.FileName = Variables.FileName;
    ErrorRowsBuffer.ErrorMessage = message;
    ErrorRowsBuffer.Timestamp = DateTime.Now.ToString("yyyy-MM-dd h:mm tt");
}
			

The script does the following:

  • Checks whether the file is in the expected format – In this example, we have 3 fixed columns and that’s why we assign the value 3 in the “for” loop. It reads the first line which is the header and finds out whether the expected fixed columns are coming through. If it is not coming through as expected then the dataflow task fails by the FireError method after adding the message in the ErrorRows Output.
  • Check whether the rows are incomplete – It reads each line in the files and compares it to the number of columns in the header to identify whether there are any ragged rows and if anything found needs to be marked in the ErrorRows Output, so that those lines could be redirected to an error log.
  • Transpose columns to rows
    • The header row is taken first and puts the varying column values into the PropertyName array.
    • Then the other lines in the file are read one by one to extract the actual values and stored in the PropertyValue array along with the respective PropertyName and these rows are then stored as TransposedRows output.
  • AddError is a function which can be called to store any error message in the ErrorRows output.

Step 4 – Redirecting the outputs

The outputs need to be redirected accordingly. The below screenshot shows that the “TransposedRows” output which contain the valid rows (after transposing from columns to rows) is being redirected to a SQL destination and “ErrorRows” output gets redirected to an error log which is a flat file destination.

Package Design - Description: Screenshot to show the implementation of script and outputs getting redirected to respective destinations

Execution Results

Let’s take an example file with 3 rows (2 valid rows and 1 ragged row) and we would expect the 2 valid rows get transposed and redirected to the database and the 1 bad row gets redirected to the error log.

In the below example, the company Amazon is incomplete (ragged) and should be redirected to the error log.

Company,Date,Datasource,Open Price,High Price,Low Price,Average Price 
Microsoft,2018-03-10,Traders,100.30,140.20,104.35,120.10 
Alphabet,2018-03-10,Traders,70.80,90.60,72.20,80.10 
Amazon,2018-03-10,Traders,120.75,140.23 
Package Execution workflow - Description: Screenshot to show what happens after executing the package.

Results stored in Database

We can see that for Microsoft and Alphabet we have 4 rows for each company.

Execution Results in DB - Description: Execution Results in DB

Results redirected to Error Log

We can see below that the row for Amazon was a problem and was written to the error log file.

Execution Result Error Log  - Description: Execution Result Error log

Similarly, if the header row does not have the right columns that you were expecting the package would fail after adding the message in the ErrorRows output and redirecting it to the log file.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Balamurugan Loganathan Balamurugan Loganathan is a UK based Data Warehousing and Business Intelligence professional with 13 years of experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, April 23, 2018 - 6:16:51 AM - Balamurugan Loganathan Back To Top (75758)

 @Anne, This script is quite supple as it would work on any number of columns that come with different data types whereas TSQL functions have certain limitations (fixed no. of columns or columns to have same data type). Hope this explains it.

 


Saturday, April 21, 2018 - 11:00:58 AM - C Matthews Back To Top (75748)

 

 Very helpful.  I have run across this problem a number of times.


Friday, April 20, 2018 - 3:05:32 PM - Isaiah A Back To Top (75743)

Excellent article. 

 


Friday, April 20, 2018 - 11:50:20 AM - Anne Back To Top (75741)

 

 Thanks for the great article.

I'm curious if there is a way to do this using SQL instead of C# for what the script to do since I am not a programmer of C#.















get free sql tips
agree to terms