SQL Server Bulk Insert For Full Files, Misformatted Data and Unusual Characters
Some of our ETL data come in as XML, JSON or flat files, which we retain in full, as we may send these to other sources as we receive them. In some cases, we already have T-SQL which parses the data for our SQL Server platform. How can we use bulk insert to retain data files in full, along with using bulk insert to parse some delimited files with other characters outside the standard comma and vertical bar, such as tab delimited files?
Bulk insert allows us to specify other delimiters than commas, vertical bars, specific letters (such as the character "t" if we wanted). We can even use the tool to insert an entire file of data, if we only want to store the data from a file without parsing it. In this tip, we'll be looking at some other delimiters we may face when we get files along with some other methods we can use around bulk insert.
In our first example, we'll insert a csv file without specifying any column delimiter (field terminator), which will insert the entire data file. Since we're familiar with inserting comma-delimited data with bulk insert, this shows us that we can also insert comma-delimited data as an entire data column. After creating our table, which allows for a data type of VARCHAR(MAX), we see that without any row terminator specified, bulk insert assumes that new line is our row terminator. In the first example, we get five rows of data as a result. What if we wanted to get all the data in one row? After we truncate the table, we perform the bulk insert again, this time specifying the end of file (EOF) as the row terminator. This means that our bulk insert will treat the entire file as one row of data and we see the result of the entire file in one row.
--CREATE TABLE tbAllData ( -- AllData VARCHAR(MAX) --) BULK INSERT tbAllData FROM 'C:\import\csvdata.csv' SELECT * FROM tbAllData TRUNCATE TABLE tbAllData BULK INSERT tbAllData FROM 'C:\import\csvdata.csv' WITH ( ROWTERMINATOR = 'EOF' ) SELECT * FROM tbAllData TRUNCATE TABLE tbAllData
In most cases, we won't be inserting csv data into one row of data, but we might with HTML, XML or JSON, especially if we already have T-SQL tools to handle these, or if we use these for an API layer or output where the format will be exactly what we received. In our next example, we insert XML and HTML data into the same table and we see the result on each import where all the data end up in one data row within the table.
BULK INSERT tbAllData FROM 'C:\Import\xmldata.xml' WITH ( ROWTERMINATOR = 'EOF' ) SELECT * FROM tbAllData TRUNCATE TABLE tbAllData BULK INSERT tbAllData FROM 'C:\Import\htmldata.html' WITH ( ROWTERMINATOR = 'EOF' ) SELECT * FROM tbAllData TRUNCATE TABLE tbAllData
Let's look further at the HTML insert again, as the EOF character may be within our file (this could also be possible with XML or JSON). If we have the EOF character within our file, it will become a separate row. In the below images, we see that we add an EOF character within the HTML file, then perform the insert again. The result is that we get two rows of data and the EOF character is not within the data, as it demarcated the new row. This provides a caution about inserting data files as entire rows of data: if our row delimiter ("rowterminator" in bulk insert) exists throughout the file and yet we want the entire file in one row, we'll either need to remove the character throughout the file, or specify a new character that ends the file, which we can add if necessary.
BULK INSERT tbAllData FROM 'C:\Import\htmldata.html' WITH ( ROWTERMINATOR = 'EOF' ) SELECT * FROM tbAllData TRUNCATE TABLE tbAllData
In our next example, we insert tab delimited data and want the tab character ("\t") to specify a new column along with the new line character specifying a new row of data. This insert is similar to a comma or vertical bar delimited file in that we use the delimiter in the same way that we would use it with vertical bars or commas. Like with commas or vertical bars, there will be some data that are misformatted on some lines and we can use the error file command in bulk insert to help with this (shown in the next example).
CREATE TABLE tbTab ( DataPointX VARCHAR(10), DataPointFour VARCHAR(10), ActionColumn VARCHAR(10), SentColumn VARCHAR(10) ) BULK INSERT tbTab FROM 'C:\Import\tabdata.txt' WITH ( FIRSTROW=2 ,ROWTERMINATOR = '0x0a' ,FIELDTERMINATOR = '\t' ) SELECT * FROM tbTab TRUNCATE TABLE tbTab
We will sometimes import data that are misformatted and we can use bulk insert's ERRORFILE command to continue performing the insert and log errors, which provides an alternative to removing misformatted data prior to import. Using the ERRORFILE command, we can still insert data even if there are errors. We'll look at two examples, one in which we have misformatted data below our threshold of maximum errors allowed and an example where we have above the amount of errors and the results. In the below code we do this with a misformatted comma delimited file - one of the lines of the file has too many commas (line 7) in our first example. In our second example, we'll add six bad lines to the file so that it crosses our maximum threshold. The below two blocks are the files misformat1 and misformat2 with the inserts following.
Test,Result1,Result2,Result3,Result4 Test1,1,1,0,0 Test2,1,0,1,0 Test3,0,1,1,0 Test4,0,1,0,0 Test5,0,0,0,1 Test6,0,0,1,0,0 Test7,0,0,0,0 Test8,0,1,0,0
Test,Result1,Result2,Result3,Result4 Test1,1,1,0,0 Test2,1,0,1,0 Test3,0,1,1,0 Test4,0,1,0,0,1 Test5,0,0,0,1,1 Test6,0,0,1,0,0 Test7,0,0,0,0 Test8,0,1,0,0,0 Test9,0,1,0 Test10,0,0,0,0 Test11,0,0,0,1,0 Test11,0,1,0,1
CREATE TABLE TestResults( Test VARCHAR(10), Result1 TINYINT, Result2 TINYINT, Result3 TINYINT, Result4 TINYINT ) BULK INSERT TestResults FROM 'C:\import\misformat1.txt' WITH ( FIELDTERMINATOR = ',' ,ROWTERMINATOR = '0x0a' ,FIRSTROW=2 ,ERRORFILE='C:\import\errorfile1.log' ,MAXERRORS=5 ) SELECT * FROM TestResults TRUNCATE TABLE TestResults BULK INSERT TestResults FROM 'C:\import\misformat2.txt' WITH ( FIELDTERMINATOR = ',' ,ROWTERMINATOR = '0x0a' ,FIRSTROW=2 ,ERRORFILE='C:\import\errorfile2.log' ,MAXERRORS=5 ) SELECT * FROM TestResults TRUNCATE TABLE TestResults
In the first file with an erroneous value, bulk insert errors, but still inserts the valid values. In the second example, once bulk insert hits too many errors, it fails the entire insert. We'll also see that in the error log files we specified (errorlog1 and errorlog2), the bad records are saved to the files:
Test4,0,1,0,0,1 Test5,0,0,0,1,1 Test6,0,0,1,0,0 Test8,0,1,0,0,0 Test9,0,1,0 Test10,0,0,0,0 Test11,0,0,0,1,0
This provides us with an alternative to pre-validating our data, but it may be impractical depending on how large our files are and what percent of the files we're willing to save to the error log - if too much of the file is erroneous, we may want to reconsider the delimiter or how the data are formatted.
- When we insert data using bulk insert, we must know the format of the data and create a compatible table that matches the format, otherwise we will get errors when it tries to parse the data. In this tip we look at a few data types, such as several examples with an entire file, tab delimited and a file broken up by new line characters. This does not cover every possibility, as different companies may use different delimiters (though comma and vertical bar are typically the most common). When we get a new file type, we should understand what format it is and adjust our specifications appropriately.
- In some cases, our data prior to import are either corrupt, misformatted, or have a wrapped format, such as quotes around characters on some lines. We can use PowerShell to parse these separately by moving them to another file and looking at them, or we can try to create a custom function to address them during an import - though this latter step may take more time depending on the size of the file.
About the author
View all my tips