Learn more about SQL Server tools



solving sql server problems for millions of dbas and developers since 2006 join MSSQLTips for free SQL Server tips













































   Got a SQL tip?
            We want to know!

Import UTF-8 Unicode Special Characters with SQL Server Integration Services

MSSQLTips author Scott Murray By:   |   Read Comments (3)   |   Related Tips: More > Integration Services Development
Problem

As you work with large scale SQL Server Integration Services ETL processes and sequences, you are bound to have to work with UTF-8 encoded text files. Without proper handling, UTF-8 / Unicode characters can cause havoc with your SSIS load tasks. What are some ideas for handling the UTF-8 file formats with SSIS?  Check out this tip to learn more.

Solution 

First, providing some background about UTF-8 and Unicode would likely go a long way into explaining how to handle these different code page types. Unicode is an encoding standard maintained by the Unicode Consortium; most of the biggest players in the technology field (Google, SAP, Microsoft, Oracle) along with many others belong to the consortium. The consortium was setup to standardize the numbering scheme assigned to represent a letter or character across all languages, countries, software products, and hardware platforms. Drilling down further, UTF-8 is actually an encoding method for handling all the characters in the Unicode set of characters and stands for Unicode Transformation Format. I know I am going to do a disservice to the encoding process by explaining the transformation in these terms, however, UTF is sort of like the decryption key (or secret decoder ring!) used to map your backend bytes to the actual characters displayed on the screen. Of course, there is much more to it, so feel free to take a look at the link in the next steps section of this tip. To ease the conversion from prior encoding standards, keep in mind the first 128 UTF-8 characters match the ANSI character encoding standards.

Working with SSIS and UTF-8 Unicode Data

Enough of the theory and background; let us put this knowledge into practice. First, we need to create a UTF-8 encoded text file with some special characters. As you can see from the screen prints below, most of the rows contain one or several special characters. I actually created two different text files; the first text file is a bar delimited, 4 column file as shown in the upper screen print below.  The second file is a ragged right, fixed width 4 column file.

TextFile

Ragged Right

To create the files in the correct encoding format, I used a free tool called NotePad++.  As you can see in the below screen print, the encoding for our sample files is set to UTF-8.

UTF Encode 8

Fortunately, SSIS is smart enough to handle many different encoding code pages. Furthermore, UTF-8 Unicode encoding format is one of the most popular encoding pages as it is used by a wide variety of web pages. Using the bar delimited file, noted in the initial screen prints above, we can quickly load the data with minimal adjustment to the flat file connection. First, you will need to open up SQL Server Data Tools or SSDT (formerly known as BIDS), and then create a new SSIS Package with a proper and descriptive name.  Next as shown below,  is to create a connection to the bar delimited flat file. After browsing to the flat file, you will notice that on the General Options page that the Code page defaults to the 65001 (UTF-8) code page which is the proper UTF-8 code page. On this same window, adjust the Connection manager name to an appropriate name, change the format to delimited, and then select the "Column names are in the first data row" check box.

Delimited Advanced

Next, on the columns screen, verify that the column delimiter is set to vertical bar. 

columns window

Then on the Advanced screen, the data type for each string field must be changed from string [DT_STR] to Unicode string [DT_WSTR].

delimited advanced screen

Finally, we can preview the flat file connection setup, as revealed below.

Delimited Preview

Based on the above preview, the character sets looks like they are displaying the appropriate characters especially from the UTF-8 encoding standpoint. Next, we can setup up our data flow with a flat file source and a SQL Server destination and finally mapping the source and destination columns

Data Flow

Flat File Source

Next the SQL Server destination is created and finally the source and destination columns are mapped as previewed in the next two screen shots.

Data Flow SQL Destination

Data flow mapping

At last, we are ready to test run our data flow package which reports success, as shown below, with two green check marks!

Data Flow Success

Running a query against the table displays success also; please note all the fields are NVARCHAR data types.

Query Results

Of course, you are probably saying, "so what is the problem"? With delimited flat file sets, nothing.  However, let us move on to a fixed length file. Running through the same connection file process, we start with browsing and selecting our fixed format flat file. Be sure that the Code page is 65001 (UTF-8), the format is set to Ragged right, and that the "Column names in the first data row" check box is selected. So far, so good.

Ragged right fixed

However, as displayed on the below screen print, we immediately begin to see some problems as we attempt to set the fixed column widths. On this screen you set the column widths for each field by click on the ruler area. Taking a look at the first column, the name field, we see that even though the column is supposed to be just 14 characters wide, some of the names with special characters push beyond 14 characters wide. We could adjust the column size by one or more units, but then some of the columns would actually contain characters belonging to another field. Also notice, that each subsequent column is pushed right based on the number of special characters on that row.

Ragged Right columns window

I decided that maybe the problem is just a preview issue and trudge forward onto the advanced screen. Just as with the delimited file, we need to adjust all the character columns to use a Unicode string [DT_WSTR] data type.

Advanced Fixed File

Finally, reviewing the preview window, as illustrated below, the same problem which appeared on column screen surfaces on the preview screen. Again, I contemplated that maybe the issue was just a "preview" issue and moved forward with creating the data flow.

Ragged Right Preview

Following the same process noted above for the delimited file, we create a data flow with a flat file source and SQL Server destination are created. Finally we run the fixed file package.

Data flow success

Wow, the data loaded successfully. However reviewing the below query results shows that the preview problem raised earlier permeated to load. The good thing is that characters were loaded correctly, but just into the wrong column.

Fixed file Query Results

After completing some various web search results, I come to the conclusion that the problem is a known issue with a related Microsoft Connect entry: http://connect.microsoft.com/SQLServer/feedback/details/742632/european-character-causes-error-when-importing-flat-file-utf-8-code-page-with-ragged-right-format. So what is our work around? Since the characters are being loaded correctly, we actually have a few options.  First, and my preferred option is to load the entire row of data into a single column and then use SQL to complete the parsing process. Another option could be to write a VB or C script component to parse out the characters. For the rest of this example, we will focus on the SQL solution.  First we need to create a staging table with a single column large enough to load the entire row; the data type for the column needs to be NVARCHAR. I have heard that some folks mention that they cannot fit the entire row into one column, so you can potentially load the data into two or more rows. 

Stage Table

Now we can test run the package to load the entire row into the staging table whose result is displayed below.

Single Row query results

Next we create a query to parse the long string into the appropriate column parts using the SUBSTRING function. Using the original file layout of column widths and start positions can be of great help when creating the Substring query.

substring query

Finally, we can add this query as a source for a second data flow to move the data from the single column staging table to the final table, as illustrated in the next two screen prints.

Stage to Prod Data Flow

Staging Data Source

Using this alternate method, we now have the fixed ragged right file loaded into the table as noted below.

Final Query Results

Conclusion

Even though SSIS supports a large variety of various code page encoding formats, loading data from a UTF-8 encoded flat files that contain special characters (and other encoding formats) may not load as expected. A delimited file with special characters loads fairly easily and without many issues outside of setting the data types correctly. However if you have a ragged right fixed format flat file, the load will generally occur without error, but as seen in the preview mode of the flat file connection manager, the fields and rows of data that contain special characters will often push other characters to the right and thus load incorrectly parsed data into the wrong fields. This situation can be worked around by loading the entire row as a Unicode string into a column wide enough to accept all the row data. Once loaded into the single column, the data can then be parsed using the SQL Server Substring function. As an alternative, you could also parse the row using a script component in SSIS. Although the issue has been acknowledged on Microsoft Connect, it has been around since SQL Server 2005 and was not addressed in SQL Server 2012. Thus, we will likely need to deal with the workaround for the near future.

Next Steps



Last Update: 12/5/2013


About the author
MSSQLTips author Scott Murray
Scott Murray has a passion for crafting BI Solutions with SharePoint, SSAS, OLAP and SSRS.

View all my tips


print tip Print  
Become a paid author





join MSSQLTips for free SQL Server tips     



Learn more about SQL Server tools
Comments and Feedback:
Wednesday, January 08, 2014 - 8:19:55 AM - John Read The Tip

Well done!


Thursday, August 07, 2014 - 11:48:39 PM - Sam Read The Tip

This was a huge help!

I super appreciate you writing this. Was stuck on this import issue for a few hours.

Thanks!

Sam


Tuesday, August 19, 2014 - 2:07:47 PM - Heather Phillips Read The Tip

I would like to open a DAT file from a Windows Mail or Microsoft Outlook email. It does not open with file viewers etc. It has codes in the notepad file that I am able to view and it's an email that was saved. Are you able to help me?

 

I would be so grateful if you could.

 

Thank you



Post a Comment or Question

Keep it clean and stay on the subject or we may delete your comment.
Your email address is not published. Required fields are marked with an asterisk (*)

*Name   *Email Notify for updates



Comments
Get free SQL tips:

*Enter Code refresh code


 
Sponsor Information







Copyright (c) 2006-2014 Edgewood Solutions, LLC All rights reserved
privacy | disclaimer | copyright | advertise | about
authors | contribute | feedback | giveaways | free t-shirt | user groups | community | events | first timer?
Some names and products listed are the registered trademarks of their respective owners.