SQL Server Bulk Insert For Full Files, Misformatted Data and Unusual Characters


By:   |   Updated: 2018-08-01   |   Comments (2)   |   Related: More > Import and Export

Problem

Some of our ETL data come in as XML, JSON or flat files, which we retain in full, as we may send these to other sources as we receive them. In some cases, we already have T-SQL which parses the data for our SQL Server platform. How can we use bulk insert to retain data files in full, along with using bulk insert to parse some delimited files with other characters outside the standard comma and vertical bar, such as tab delimited files?

Solution

Bulk insert allows us to specify other delimiters than commas, vertical bars, specific letters (such as the character "t" if we wanted). We can even use the tool to insert an entire file of data, if we only want to store the data from a file without parsing it. In this tip, we'll be looking at some other delimiters we may face when we get files along with some other methods we can use around bulk insert.

In our first example, we'll insert a csv file without specifying any column delimiter (field terminator), which will insert the entire data file. Since we're familiar with inserting comma-delimited data with bulk insert, this shows us that we can also insert comma-delimited data as an entire data column. After creating our table, which allows for a data type of VARCHAR(MAX), we see that without any row terminator specified, bulk insert assumes that new line is our row terminator. In the first example, we get five rows of data as a result. What if we wanted to get all the data in one row? After we truncate the table, we perform the bulk insert again, this time specifying the end of file (EOF) as the row terminator. This means that our bulk insert will treat the entire file as one row of data and we see the result of the entire file in one row.

--CREATE TABLE tbAllData (
--	AllData VARCHAR(MAX)
--)

BULK INSERT tbAllData
FROM 'C:\import\csvdata.csv'

SELECT *
FROM tbAllData

TRUNCATE TABLE tbAllData

BULK INSERT tbAllData
FROM 'C:\import\csvdata.csv'
WITH (
	ROWTERMINATOR = 'EOF'
)

SELECT *
FROM tbAllData

TRUNCATE TABLE tbAllData
results

In most cases, we won't be inserting csv data into one row of data, but we might with HTML, XML or JSON, especially if we already have T-SQL tools to handle these, or if we use these for an API layer or output where the format will be exactly what we received. In our next example, we insert XML and HTML data into the same table and we see the result on each import where all the data end up in one data row within the table.

BULK INSERT tbAllData
FROM 'C:\Import\xmldata.xml'
WITH (
	ROWTERMINATOR = 'EOF'
)

SELECT *
FROM tbAllData

TRUNCATE TABLE tbAllData

BULK INSERT tbAllData
FROM 'C:\Import\htmldata.html'
WITH (
	ROWTERMINATOR = 'EOF'
)

SELECT *
FROM tbAllData

TRUNCATE TABLE tbAllData
Image 2

Let's look further at the HTML insert again, as the EOF character may be within our file (this could also be possible with XML or JSON). If we have the EOF character within our file, it will become a separate row. In the below images, we see that we add an EOF character within the HTML file, then perform the insert again. The result is that we get two rows of data and the EOF character is not within the data, as it demarcated the new row. This provides a caution about inserting data files as entire rows of data: if our row delimiter ("rowterminator" in bulk insert) exists throughout the file and yet we want the entire file in one row, we'll either need to remove the character throughout the file, or specify a new character that ends the file, which we can add if necessary.

BULK INSERT tbAllData
FROM 'C:\Import\htmldata.html'
WITH (
	ROWTERMINATOR = 'EOF'
)

SELECT *
FROM tbAllData

TRUNCATE TABLE tbAllData
bulk insert
notepad
results

In our next example, we insert tab delimited data and want the tab character ("\t") to specify a new column along with the new line character specifying a new row of data. This insert is similar to a comma or vertical bar delimited file in that we use the delimiter in the same way that we would use it with vertical bars or commas. Like with commas or vertical bars, there will be some data that are misformatted on some lines and we can use the error file command in bulk insert to help with this (shown in the next example).

CREATE TABLE tbTab (
	DataPointX VARCHAR(10),
	DataPointFour VARCHAR(10),
	ActionColumn VARCHAR(10),
	SentColumn VARCHAR(10)
)

BULK INSERT tbTab
FROM 'C:\Import\tabdata.txt'
WITH (
	FIRSTROW=2
	,ROWTERMINATOR = '0x0a'
	,FIELDTERMINATOR = '\t'
)

SELECT *
FROM tbTab

TRUNCATE TABLE tbTab
create table

We will sometimes import data that are misformatted and we can use bulk insert's ERRORFILE command to continue performing the insert and log errors, which provides an alternative to removing misformatted data prior to import. Using the ERRORFILE command, we can still insert data even if there are errors. We'll look at two examples, one in which we have misformatted data below our threshold of maximum errors allowed and an example where we have above the amount of errors and the results. In the below code we do this with a misformatted comma delimited file - one of the lines of the file has too many commas (line 7) in our first example. In our second example, we'll add six bad lines to the file so that it crosses our maximum threshold. The below two blocks are the files misformat1 and misformat2 with the inserts following.

Test,Result1,Result2,Result3,Result4
Test1,1,1,0,0
Test2,1,0,1,0
Test3,0,1,1,0
Test4,0,1,0,0
Test5,0,0,0,1
Test6,0,0,1,0,0
Test7,0,0,0,0
Test8,0,1,0,0
Test,Result1,Result2,Result3,Result4
Test1,1,1,0,0
Test2,1,0,1,0
Test3,0,1,1,0
Test4,0,1,0,0,1
Test5,0,0,0,1,1
Test6,0,0,1,0,0
Test7,0,0,0,0
Test8,0,1,0,0,0
Test9,0,1,0
Test10,0,0,0,0
Test11,0,0,0,1,0
Test11,0,1,0,1
CREATE TABLE TestResults(
	Test VARCHAR(10),
	Result1 TINYINT,
	Result2 TINYINT,
	Result3 TINYINT,
	Result4 TINYINT
)

BULK INSERT TestResults
FROM 'C:\import\misformat1.txt'
WITH (
	FIELDTERMINATOR = ','
	,ROWTERMINATOR = '0x0a'
	,FIRSTROW=2
	,ERRORFILE='C:\import\errorfile1.log'
	,MAXERRORS=5
)

SELECT *
FROM TestResults

TRUNCATE TABLE TestResults

BULK INSERT TestResults
FROM 'C:\import\misformat2.txt'
WITH (
	FIELDTERMINATOR = ','
	,ROWTERMINATOR = '0x0a'
	,FIRSTROW=2
	,ERRORFILE='C:\import\errorfile2.log'
	,MAXERRORS=5
)

SELECT *
FROM TestResults

TRUNCATE TABLE TestResults
bulk insert
results
field terminator
select

In the first file with an erroneous value, bulk insert errors, but still inserts the valid values. In the second example, once bulk insert hits too many errors, it fails the entire insert. We'll also see that in the error log files we specified (errorlog1 and errorlog2), the bad records are saved to the files:

Test6,0,0,1,0,0
Test4,0,1,0,0,1
Test5,0,0,0,1,1
Test6,0,0,1,0,0
Test8,0,1,0,0,0
Test9,0,1,0
Test10,0,0,0,0
Test11,0,0,0,1,0

This provides us with an alternative to pre-validating our data, but it may be impractical depending on how large our files are and what percent of the files we're willing to save to the error log - if too much of the file is erroneous, we may want to reconsider the delimiter or how the data are formatted.

Next Steps
  • When we insert data using bulk insert, we must know the format of the data and create a compatible table that matches the format, otherwise we will get errors when it tries to parse the data. In this tip we look at a few data types, such as several examples with an entire file, tab delimited and a file broken up by new line characters. This does not cover every possibility, as different companies may use different delimiters (though comma and vertical bar are typically the most common). When we get a new file type, we should understand what format it is and adjust our specifications appropriately.
  • In some cases, our data prior to import are either corrupt, misformatted, or have a wrapped format, such as quotes around characters on some lines. We can use PowerShell to parse these separately by moving them to another file and looking at them, or we can try to create a custom function to address them during an import - though this latter step may take more time depending on the size of the file.


Last Updated: 2018-08-01


get scripts

next tip button



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name
*Email
Email me updates

Signup for our newsletter

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.





Wednesday, August 01, 2018 - 3:18:34 PM - Kyle Back To Top

Is there an easy way to deal with double quotes around fields that have commas in the data? I think it can be done with a BULK INSERT formatting file, but I'm wondering there's a new easy way of doing it now. 


Wednesday, August 01, 2018 - 6:47:16 AM - Ron Abbott Back To Top

 Very interesting tip, thanks.

I have an additional requirement - is there any way that Bulk Insert can cater for fields with a large number of characters - say up to 4,000 ?

I used the SQL import data wizard and was able to specify the incoming fields as DB_TEXT, which allowed this, but I would prefer an option to do this with Bulk Insert.

 



download

























get free sql tips

I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



Learn more about SQL Server tools