Importing XML documents using SQL Server Integration Services

By:   |   Comments (22)   |   Related: More > Integration Services Development


Problem

What is the process for importing an XML document into SQL Server via SSIS? What are some helpful hints for loading the data effectively and efficiently?

Solution

SSIS is a robust and versatile ETL tool that can handle most any type of load process that it is tasked with completing. XML files are no exception to this statement as Integration Services can load various XML files at various levels of complexity.

Before we dive into the load process, we need to cover a few basics about XML files. First, an XML file is an extensible Markup Language file; XML itself is a Markup Language which shares some similarities to HTML. Just like HTML, XML and XML files use tags; in XML's case, these tags are used to define, describe, structure, and nest data. Furthermore, and contrary to HTML, each tag is not predefined, but is established by the users who generates the files. Also, XML files are all about the information and data within a file and not about how to display the data or information. Thus, an XML file is often self defining although special files called XML Schema Definition files (XSD) can also be used to validate, format, and describe a particular XML file. Of course, XSD's are coded in XML.

Import XML File

The SSIS import process starts with creating a new SSIS solution and project in SQL Server Data Tools (SSDT formerly known as BIDS or Business Intelligence Design Studio). Be sure to name the new solution and project appropriately as shown below. Also, rename the package to a descriptive name.

New Project

Next we can drag and drop a new data flow task onto the design area and again rename the new data flow to a meaningful name, as illustrated below.

Data flow

Next, double click on the data flow task to define the details of the data flow. From the SSIS toolbox, Other Sources category, drag the XML Source onto the design grid and rename as needed. 

XML Source Drag

Double click on the XML Source to browse to the location of the XML file as shown below. This screen is where some of the decision making must take place and also where many problems surface. First the Data access mode is selected; for this example use an actual file location, but XML File from variable or XML data from variable can also be selected. Thus, you can browse to the location of the XML file.

Finally, the last three items, Use inline schema, XSD location, and Generate XSD make a significant difference in what happens during the XML processing (commonly known as shredding). The Use Inline schema option tells SSIS that the schema definition actually resides in the XML file itself. In essence the XSD file is embedded in the XML file. If this option is selected than the XSD location and Generate XSD options are grayed out. Alternately, you can browse and select the location of a separate XSD file; this file will outline in detail the tags within the XML file. Generally if you get an XML file from an alternate source, the XSD should already be embedded inline within the XML file or the organization providing the XML should provide you with the appropriate XSD file. 

XML Source

In the below example, I am using sample "purchase order" and "customers and orders" XML files which can be downloaded from the MSDN site: http://msdn.microsoft.com/en-us/library/bb387034.aspx and http://msdn.microsoft.com/en-us/library/bb387025.aspx respectively.  I also downloaded related XSD files at: http://msdn.microsoft.com/en-us/library/dd489284(v=vs.110).aspx and http://msdn.microsoft.com/en-us/library/bb675181.aspx. Both of these XML files are moderately complex with several layers (or nested) values. Unfortunately, as an XML file's hierarchy structure and complexity increases, often times the supplied XSD file do not parse and load the data correctly. Often the load will occur without error, however no data is transferred. Thus, the Generate XSD button will need to be used to generate a new XSD file to be used for importing. The Generate XSD process works well most often, but you may need to adjust it for special cases within your XML file. 

PO Source

Unfortunately, in our example using the downloaded XSD file for the purchase order XML file did not load any data. However, utilizing the Generate XSD, will load, which will be detailed next, was successful. The generate XSD process actually creates a new XSD file which must then be browsed to and selected in the XSD location field.

Once the files are selected, the next step is to review the potential columns to load. For XML files which contain hierarchies and layers of elements within other elements, a separate tabular set of data is generated for each item falling under the higher element. In the Purchase Order XML file example, the purchase order set also contains shipping addresses, item, and items elements. Thus, a tabular set is generated for each purchase order set, each address set, each items set, and each item set.  

source column

For our PO example, each tabular set can be loaded individually into its own table.  Thus, we actually can add multiple SQL Server destinations, and split the data flow accordingly. When each source is connected to the destination, you must select which XML tabular set you would like to flow through to the destination as shown below.

Input Output

Keeping with our purchase order example, the below screen print shows two of the four data flows are loaded in SQL Server. Notice how the data flow path shows the XML output set on top of the blue line. Of course, you could do the same for the "red line" or error row data flow.

Split Data Flow

Our final step is to map source columns to the SQL Server destination table columns. Thus, double clicking on the destination, as shown below, requires selection of the data source and table name where the data will be placed.

SQL Destination

Next clicking on the Mapping property allows us to complete the column by column mapping as illustrated below. Depending on the insert you are completing you may have to adjust the settings on the Advanced property window to account for items such as identity inserts, trigger firings, and constraint checks.

Mapping

For the items set used in our example, 5 items loaded successfully as illustrated below.

Load Success

Load Query Success

Often when you load XML files you will run into two main issues. First, using the Generate XSD option creates an XSD file where column lengths are not specified for string based columns. As such, you will often see the below warning on the source screen when navigating the columns property window.

Length Warning

The above warning stems from having the XSD file define string based columns which do not have a minimum and maximum length noted for that particular element within the XSD. This issue can be alleviated by adding text similar to the below notation to your string columns. Notice how I commented out the original line and then added 8 new lines of XML to further define the column lengths.

XSD Length

The second issue that frequently occurs concerns the default data types defined in the XSD and again flows through to the source. The most common of these is that the XSD will define a numeric field as an unsigned integer while the SQL Server destination is expecting a signed integer. Thus the following error results.

XSD Integer

To correct this error we have two choices. We can either modify the XSD file to change the data type for the field from unsignedint to just int (which signifies a signed integer) as show below.

XSD Integer Update

As an alternative, we can also go to the XML Source Advanced Editor by right clicking on the XML Source task and selecting Show Advanced Editor. Next, as displayed in the subsequent image, we can navigate to the Input and Output Properties tab and then to the columns in question within the Output Columns lists. Finally, we can change the data type which is output from unsigned to signed integer. A third alternative is to use the data conversion transformation to convert the columns types.

Adv Edit

Conclusion

Loading data from a XML file is not nearly as easy as loading text files or flat files. However, well formatted XML documents can be loaded using the XML Source data flow. When completing the load and in addition to the XML file itself, the XML format and schema must also be defined in SSIS either by using an XSD file or via inline schema contained within the XML file itself. The XML Source task actually contains a button which can generate the XSD file if needed. XML files often contain multiple hierarchical structures or elements nested within other elements; in order to compensate for this setup, SSIS will actually output multiple tabular sets for these hierarchical structures. Thus a single XML file might be loaded into many different tables. Finally, some common errors often occur and center around data length and data type conflicts between the XSD file schema and the SQL Server table. These errors and warnings can often be corrected by either adjusting the XSD file or the Output Column properties in the XML Source.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Scott Murray Scott Murray has a passion for crafting BI Solutions with SharePoint, SSAS, OLAP and SSRS.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Wednesday, December 12, 2018 - 5:43:33 PM - Dan Shargel Back To Top (78476)

 Thanks for the example. I used this today to import a large XML file for the first time into 5 tables.


Thursday, March 23, 2017 - 11:21:31 AM - Mike Back To Top (51570)

This works great for loading in one file but when I load in multiple files the id that is created starts over.  So in my case I have an InventoryTransaction and then a complex type of PriceInfo with PriceType, CountryCode, and Price.  Then it added an InventoryTransaction_id column to connect the two tables.  When I load files two and three the InventoryTransaction_id starts back over so I have 3 InventoryTransaction records with id of 2 and three PriceInfo records with InventoryTransaction_id of 2.  

What I would like to do is either have the InventoryTransaction_id not start over or be able to say in the XSD which column should be the key to use between the two tables.  There is a column in the InventoryTransaction called HistorySequence that would be the key because it is unique.


Tuesday, October 25, 2016 - 8:59:30 AM - Scott Back To Top (43634)

I would try using an XML parse to see if there is anything unusual about your XML format.


Tuesday, October 25, 2016 - 1:51:47 AM - Priyanka Srivastava Back To Top (43628)

 Hi Scott,

 

Thanks for the wonderful article explaining the XML load.

However, when I am trying to implement this in my SSIS package, I am getting an error.

The error is due to I am loading only 1 set out of all the available tabular set from the XML Source. Like in your example.

"Thus, a tabular set is generated for each purchase order set, each address set, each items set, and each item set.  "

And you choose to load only Item and Address and ignore rest.

But while doing so I am getting error message 

 TITLE: Package Validation Error

------------------------------

 

Package Validation Error

 

------------------------------

ADDITIONAL INFORMATION:

 

Error at Data Flow Task [SSIS.Pipeline]: "XML Source.Outputs[dataLevel]" contains no output columns. An asynchronous output must contain output columns.

 

Error at Data Flow Task [SSIS.Pipeline]: "XML Source.Outputs[characteristic]" contains no output columns. An asynchronous output must contain output columns.

 

Error at Data Flow Task [SSIS.Pipeline]: "XML Source.Outputs[applicant]" contains no output columns. An asynchronous output must contain output columns.

 

Error at Data Flow Task [SSIS.Pipeline]: "XML Source.Outputs[CreditCardsVSAReportReportData]" contains no output columns. An asynchronous output must contain output columns.

 

Error at Data Flow Task [SSIS.Pipeline]: "XML Source.Outputs[characteristicValue]" contains no output columns. An asynchronous output must contain output columns.

 

Error at Data Flow Task [SSIS.Pipeline]: One or more component failed validation.

 

Error at Data Flow Task: There were errors during task validation.

 

 (Microsoft.DataTransformationServices.VsIntegration)

 

------------------------------

BUTTONS:

 

OK

 

Thanks,

Priyanka


Monday, March 21, 2016 - 3:03:01 PM - Mr DataWrangler Back To Top (41014)

 Very interesting read. In future SSIS articles, I would consider including downloadable code (sample SSIS package)

Thanks, DW

 


Tuesday, February 2, 2016 - 6:52:44 AM - SQL Daddy Back To Top (40573)

 Hello Scott,

I have gone through your artcle . Its really great. I like the way You have explained all the things in your artilcle and Its really much helpful .

 

 

Thank You.

 


Tuesday, January 12, 2016 - 12:10:42 AM - Karthick Back To Top (40400)

Neatly put up. Thanks a lot for the XML notes.


Thursday, October 15, 2015 - 7:48:51 AM - Kunal Choudhary Back To Top (38901)

Hello Scott

can you help me in the following Scenario.

this is my req.

on monthly basis i receive a zip files containg Multiple Csv later which gets converted into Xml and then the data is loaded into MySql by some java utlities. I have to try this using SSIS. Is this is possible in ssis . i downloaded that zip file but when i am unziping it using winrar it is creating a folder containing two file. One file with same name as foldername with .sf extension and another again zip file when  iam trying to unzip that file it is giving the error as mgiht go curropted or not downloaded properly  and i also wanted a know wheather multiple xml files with same data structure can be loaded


Thursday, October 15, 2015 - 7:40:02 AM - Kunal Choudhary Back To Top (38900)

Thanks very Much Scott

This article helped me a lot....

 


Monday, February 9, 2015 - 11:31:01 PM - Noah Meyer Back To Top (36191)

This is extremely informative and helpful, Scott.  Thank you so much!


Wednesday, September 24, 2014 - 11:45:16 AM - Scott Murray Back To Top (34700)

Jay... take a look at the XML data types...http://technet.microsoft.com/en-us/library/bb510446(v=sql.105).aspx

 

 


Tuesday, September 23, 2014 - 8:32:52 PM - JaY Back To Top (34680)

HI Scott,

Very Informative article thank you for that, my concern is is there any way I can save whole XML file in one column ?? in a table . 

Please let me know

Thanks a lot in advance


Friday, September 19, 2014 - 8:28:17 AM - Jay Back To Top (34624)

I tried it the ForEachLoop by assining the list of XML files to an object variable and then passing that variable to theXMLSource.  But the XML Source kept complaining that the variable did not contain an xml file.

 

So I did this instead:

  1. Created a Sequence Container (A)
  2. Created a Script Task with the code below* and placed it in Sequence Container A
  3. Created a second Sequence Container (B)
  4. Placed a Data Flow Task in Sequence Container B
    1. Created an XML Source that points at a hard coded XML Location & Schema
    2. Created a Destination that the XML Source would send the data to
  5. This worked for me

 

*CODE IN SCRIPT TASK
//This code reads each xml file in a directory and appends the xml to a singls DataSet and then writes one combined xml file.
DataSet ds = new DataSet();
string xmlDir = Dts.Variables["XMLDataDirectory"].Value.ToString();
string xmlFile = Dts.Variables["XMLDataFile"].Value.ToString();
foreach (string fileName in Directory.GetFiles(xmlDir)){
if (fileName.Contains(".xml"))
ds.ReadXml(fileName);
}
if (File.Exists(xmlFile))
File.Delete(xmlFile);
ds.WriteXml(xmlFile, XmlWriteMode.IgnoreSchema);
Dts.TaskResult = (int)ScriptResults.Success;

Thursday, September 18, 2014 - 1:41:42 PM - scott Back To Top (34608)

You should be able to use a ForEachLoop container to achieve that process.


Thursday, September 18, 2014 - 1:25:27 PM - Jay Back To Top (34607)

Is there a way process a directory of xml files?

My XML files all have the same schema but each file contains a data dump from a specific US State.

I am able to import the data using the XMl Source and a hard coded xml file path & name but how can I make this a dynamic process?  Can it be done using SSIS controls or do I need to write a script in C#?


Tuesday, July 15, 2014 - 3:23:11 AM - Prashant Shelavadi Back To Top (32703)

 

Great job Easy to understand about XML Source 


Monday, May 12, 2014 - 12:21:08 PM - Matt Massey Back To Top (30737)

Thanks for the write up. This really helped me out!


Friday, May 9, 2014 - 4:40:07 PM - Scott Coleman Back To Top (30709)

I think you mean SSIS shredding does not always work.  SQL Server shredding is very reliable, in my experience.  Figuring out all the XPATH syntax may take a while though.

You're going to need a server with a lot of memory to deal with a 1.3GB XML file, no matter what language you're using.  You may want to split the XML file into smaller files for easier handling.

I took a look at the documentation at the URL you provided and found this sample:

<dblp>
  <article key="journals/jods/HurtadoPW08" mdate="2008-04-15">
    <author>Carlos A. Hurtado</author>
    <author>Alexandra Poulovassilis</author>
    <author>Peter T. Wood</author>
    <title>Query Relaxation in RDF.</title>
    <pages>31-61</pages>
    <year>2008</year>
    <volume>10</volume>
    <journal>J. Data Semantics</journal>
    <ee>http://dx.doi.org/10.1007/978-3-540-77688-8 2</ee>
    <crossref>journals/jods/2008-10</crossref>
    <url>db/journals/jods/jods10.html#HurtadoPW08</url>
  </article>
<dblp> 

Since there are multiple authors per article, you need two tables.  These queries should import article records, if my assumptions about field types and sizes are reasonable.

-- Read the XML file into a T-SQL variable
DECLARE @xml XML ;
SELECT @xml = BulkColumn FROM OPENROWSET(BULK '\\a\b\c.xml', SINGLE_BLOB) x ;

-- Query it with XML nodes() and value() methods
INSERT INTO dbo.articles
SELECT title = article.value('title[1]', 'varchar(100)'),
       articleKey = article.value('@key', 'varchar(100)'),
       mdate = article.value('@mdate', 'smalldatetime'),    -- Or 'date' on SQL 2008 or later
       journal = article.value('journal[1]', 'varchar(100)')
       journalYear = article.value('year[1]', 'smallint'),
       volume = article.value('volume[1]', 'varchar(50)'),
       number = article.value('number[1]', 'smallint'),
       pages = article.value('pages[1]', varchar(50)'),
       ee = article.value('ee[1]', 'varchar(500)'),
       url = article.value('url[1]', 'varchar(500)'),
       crossref = article.value('crossref[1]', 'varchar(100)')
FROM @xml.nodes('/dblp/article') x(article) ;

-- To link authors to articles, from the <author> node you use '../@key' to get the key attribute of the parent <article> node
INSERT INTO dbo.authors
SELECT articleKey = author.value('../@key', 'varchar(100)'),
       author = author.value('text()[1]', 'varchar(100)')
FROM @xml.nodes('/dblp/article/author') x(author) ;

Since you have have to build XPATH queries that fit the structure, it does not need any XSD or DTD information.


Tuesday, March 11, 2014 - 7:25:01 AM - Scott Murray Back To Top (29706)

Sounds like you may need use a pro software package to shred the XML first into flat files. The SQL Server shredding is rebost, but does not always work. I also am aware of times when you reaching the memory limits of your local machine.


Tuesday, March 11, 2014 - 5:33:28 AM - Aurangzeb Muzammil Back To Top (29702)

Hi and thanks for response,

DBLP is a database for bibliographic data of computer science only, e.g. research papers, conferences and etc. Its xml file is provided on the web which is of 1.3 GB size now and expanding day by day. Its DTD and XSD is also given on dblp web. I used your prescribed method on this current page to load that xml to my local DB but unfortunately it did not work. I am not much fimiliar with xml. It warned for upper limits not specified for each column. I also tried to generate the xsd form xml file but it gave a message like "dtd is prohibited" and there were errors for pipeline as wll.

I need to parse and save the data of that dblp into my local db for my semester project of data wherehouse and apply some model on that as our teachers instrucuted us to do..

link for dblp is as under;

http://dblp.uni-trier.de/

Regards...


Monday, March 10, 2014 - 12:21:28 PM - scott murray Back To Top (29690)

Sorry I am not familar with the DBLP XML.  Sounds like you need to get the schema files to setup the load.


Sunday, March 9, 2014 - 4:06:42 PM - Aurangzeb Muzammil Back To Top (29685)

hi,

Ausome work... 

I am CS scholar and need your help to import DBLP's xml file into sql server maintaining the primary and foreign key relation ship..

it would be a great help for me.. 

thanks...















get free sql tips
agree to terms