Learn more about SQL Server tools

 

Tutorials          DBA          Dev          BI          Career          Categories          Events          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Events

DBA    Dev    BI    Categories

 

Import UTF-8 Unicode Special Characters with SQL Server Integration Services


By:   |   Read Comments (10)   |   Related Tips: More > Integration Services Development

Problem

As you work with large scale SQL Server Integration Services ETL processes and sequences, you are bound to have to work with UTF-8 encoded text files. Without proper handling, UTF-8 / Unicode characters can cause havoc with your SSIS load tasks. What are some ideas for handling the UTF-8 file formats with SSIS? Check out this tip to learn more.

Solution

First, providing some background about UTF-8 and Unicode would likely go a long way into explaining how to handle these different code page types. Unicode is an encoding standard maintained by the Unicode Consortium; most of the biggest players in the technology field (Google, SAP, Microsoft, Oracle) along with many others belong to the consortium. The consortium was setup to standardize the numbering scheme assigned to represent a letter or character across all languages, countries, software products, and hardware platforms. Drilling down further, UTF-8 is actually an encoding method for handling all the characters in the Unicode set of characters and stands for Unicode Transformation Format. I know I am going to do a disservice to the encoding process by explaining the transformation in these terms, however, UTF is sort of like the decryption key (or secret decoder ring!) used to map your backend bytes to the actual characters displayed on the screen. Of course, there is much more to it, so feel free to take a look at the link in the next steps section of this tip. To ease the conversion from prior encoding standards, keep in mind the first 128 UTF-8 characters match the ANSI character encoding standards.

Working with SSIS and UTF-8 Unicode Data

Enough of the theory and background; let us put this knowledge into practice. First, we need to create a UTF-8 encoded text file with some special characters. As you can see from the screen prints below, most of the rows contain one or several special characters. I actually created two different text files; the first text file is a bar delimited, 4 column file as shown in the upper screen print below. The second file is a ragged right, fixed width 4 column file.

TextFile

Ragged Right

To create the files in the correct encoding format, I used a free tool called NotePad++. As you can see in the below screen print, the encoding for our sample files is set to UTF-8.

UTF Encode 8

Fortunately, SSIS is smart enough to handle many different encoding code pages. Furthermore, UTF-8 Unicode encoding format is one of the most popular encoding pages as it is used by a wide variety of web pages. Using the bar delimited file, noted in the initial screen prints above, we can quickly load the data with minimal adjustment to the flat file connection. First, you will need to open up SQL Server Data Tools or SSDT (formerly known as BIDS), and then create a new SSIS Package with a proper and descriptive name. Next as shown below, is to create a connection to the bar delimited flat file. After browsing to the flat file, you will notice that on the General Options page that the Code page defaults to the 65001 (UTF-8) code page which is the proper UTF-8 code page. On this same window, adjust the Connection manager name to an appropriate name, change the format to delimited, and then select the "Column names are in the first data row" check box.

Delimited Advanced

Next, on the columns screen, verify that the column delimiter is set to vertical bar.

columns window

Then on the Advanced screen, the data type for each string field must be changed from string [DT_STR] to Unicode string [DT_WSTR].

delimited advanced screen

Finally, we can preview the flat file connection setup, as revealed below.

Delimited Preview

Based on the above preview, the character sets looks like they are displaying the appropriate characters especially from the UTF-8 encoding standpoint. Next, we can setup up our data flow with a flat file source and a SQL Server destination and finally mapping the source and destination columns

Data Flow

Flat File Source

Next the SQL Server destination is created and finally the source and destination columns are mapped as previewed in the next two screen shots.

Data Flow SQL Destination

Data flow mapping

At last, we are ready to test run our data flow package which reports success, as shown below, with two green check marks!

Data Flow Success

Running a query against the table displays success also; please note all the fields are NVARCHAR data types.

Query Results

Of course, you are probably saying, "so what is the problem"? With delimited flat file sets, nothing. However, let us move on to a fixed length file. Running through the same connection file process, we start with browsing and selecting our fixed format flat file. Be sure that the Code page is 65001 (UTF-8), the format is set to Ragged right, and that the "Column names in the first data row" check box is selected. So far, so good.

Ragged right fixed

However, as displayed on the below screen print, we immediately begin to see some problems as we attempt to set the fixed column widths. On this screen you set the column widths for each field by click on the ruler area. Taking a look at the first column, the name field, we see that even though the column is supposed to be just 14 characters wide, some of the names with special characters push beyond 14 characters wide. We could adjust the column size by one or more units, but then some of the columns would actually contain characters belonging to another field. Also notice, that each subsequent column is pushed right based on the number of special characters on that row.

Ragged Right columns window

I decided that maybe the problem is just a preview issue and trudge forward onto the advanced screen. Just as with the delimited file, we need to adjust all the character columns to use a Unicode string [DT_WSTR] data type.

Advanced Fixed File

Finally, reviewing the preview window, as illustrated below, the same problem which appeared on column screen surfaces on the preview screen. Again, I contemplated that maybe the issue was just a "preview" issue and moved forward with creating the data flow.

Ragged Right Preview

Following the same process noted above for the delimited file, we create a data flow with a flat file source and SQL Server destination are created. Finally we run the fixed file package.

Data flow success

Wow, the data loaded successfully. However reviewing the below query results shows that the preview problem raised earlier permeated to load. The good thing is that characters were loaded correctly, but just into the wrong column.

Fixed file Query Results

After completing some various web search results, I come to the conclusion that the problem is a known issue with a related Microsoft Connect entry: http://connect.microsoft.com/SQLServer/feedback/details/742632/european-character-causes-error-when-importing-flat-file-utf-8-code-page-with-ragged-right-format. So what is our work around? Since the characters are being loaded correctly, we actually have a few options. First, and my preferred option is to load the entire row of data into a single column and then use SQL to complete the parsing process. Another option could be to write a VB or C script component to parse out the characters. For the rest of this example, we will focus on the SQL solution. First we need to create a staging table with a single column large enough to load the entire row; the data type for the column needs to be NVARCHAR. I have heard that some folks mention that they cannot fit the entire row into one column, so you can potentially load the data into two or more rows.

Stage Table

Now we can test run the package to load the entire row into the staging table whose result is displayed below.

Single Row query results

Next we create a query to parse the long string into the appropriate column parts using the SUBSTRING function. Using the original file layout of column widths and start positions can be of great help when creating the Substring query.

substring query

Finally, we can add this query as a source for a second data flow to move the data from the single column staging table to the final table, as illustrated in the next two screen prints.

Stage to Prod Data Flow

Staging Data Source

Using this alternate method, we now have the fixed ragged right file loaded into the table as noted below.

Final Query Results

Conclusion

Even though SSIS supports a large variety of various code page encoding formats, loading data from a UTF-8 encoded flat files that contain special characters (and other encoding formats) may not load as expected. A delimited file with special characters loads fairly easily and without many issues outside of setting the data types correctly. However if you have a ragged right fixed format flat file, the load will generally occur without error, but as seen in the preview mode of the flat file connection manager, the fields and rows of data that contain special characters will often push other characters to the right and thus load incorrectly parsed data into the wrong fields. This situation can be worked around by loading the entire row as a Unicode string into a column wide enough to accept all the row data. Once loaded into the single column, the data can then be parsed using the SQL Server Substring function. As an alternative, you could also parse the row using a script component in SSIS. Although the issue has been acknowledged on Microsoft Connect, it has been around since SQL Server 2005 and was not addressed in SQL Server 2012. Thus, we will likely need to deal with the workaround for the near future.

Next Steps



Last Update:






About the author
MSSQLTips author Scott Murray Scott Murray has a passion for crafting BI Solutions with SharePoint, SSAS, OLAP and SSRS.

View all my tips


 









Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


Get free SQL tips:

*Enter Code refresh code     



Tuesday, March 22, 2016 - 2:47:46 PM - Bosko Back To Top

:) :)  Python one liner conversions...

 

python.exe -c "open('out.csv','wb').write( unicode( open('inp.csv','rb').read(), 'utf8').encode('UTF-16LE') )"


Wednesday, March 16, 2016 - 11:41:30 AM - Scott Murray Back To Top

It all depends on the data in the file and what it has been assigned.  Yes the check box is the "default" setting, but the DSN may discover other data source types based on the 1st xx number of lines


Wednesday, March 16, 2016 - 10:26:04 AM - Jared Back To Top

I am confused as to the redundancy in unicode settings available.  You can set the code page in the general file connection settings to 65001 (UTF-8), as you demonstrated.  Then what does the "Unicode" checkbox on the same page do exactly?  And why is it necessary to go to each field in a file source and change the datatype to DT_WSTR?  Shouldn't they already be unicode strings?


Wednesday, September 30, 2015 - 12:44:04 PM - Robert Carnegie Back To Top

Alternatively, you may be able to convert a file from UTF-8 to UTF-16 and then treat it as UCS-2 in SQL Server.  Even use BCP or BULK INSERT.  Notepad++ has a setting to alter the encoding when you save a file, but I've seen Notepad++ do funny things when editing a large file.  Java comes with a converter "native2ascii" which also did some funny things, but the data seemed to come out OK in SQL Server.  I used a recipe like this in a CMD file.  This is actually converting to and then from UTF-32.

SET FILE1=%1
SET FILE2=C:\OUTPUT\UCS.CSV
SET CONVERT=C:\jdk1.6.0_10\bin\native2ascii
%CONVERT% -encoding UTF-8 %FILE1% | %CONVERT% -reverse -encoding X-UTF-16LE-BOM > %FILE2%

By dropping any UTF-8 text file on the script's icon, the file data should be shoved into file UCS.CSV.

At the time (2011) I wrote, "Conversion by this method prior to import is not completely satisfactory, because the output file is a little larger than expected from the input size, without explanation, except that something like blank lines seems to appear in the converted output.  That appears to be filtered out in import, but it may be not the only unintended alteration."

Looking at it subsequently, I found I also couldn't remember why I apparently deliberately chose an encoding with "Byte order mark" (BOM) in the output.  It may be better without.

 


Thursday, April 16, 2015 - 3:07:06 PM - MMM Back To Top

Great post, saved me tons of hours of working. Very nice work around, really appreciate it, THANKS!!!


Wednesday, October 29, 2014 - 6:08:16 PM - Jason Simmons Back To Top

Very helpful article.  I have been dealing with the exact same issue and spent 24 hours tweaking my database to no avail.... this got me what I needed and I am well on my way with loading data.

 

Thanks again!

 

~Jason


Tuesday, October 21, 2014 - 5:02:09 AM - Pat Back To Top

Very helpful tips. Thank you!


Tuesday, August 19, 2014 - 2:07:47 PM - Heather Phillips Back To Top

I would like to open a DAT file from a Windows Mail or Microsoft Outlook email. It does not open with file viewers etc. It has codes in the notepad file that I am able to view and it's an email that was saved. Are you able to help me?

 

I would be so grateful if you could.

 

Thank you


Thursday, August 07, 2014 - 11:48:39 PM - Sam Back To Top

This was a huge help!

I super appreciate you writing this. Was stuck on this import issue for a few hours.

Thanks!

Sam


Wednesday, January 08, 2014 - 8:19:55 AM - John Back To Top

Well done!


Learn more about SQL Server tools