Importing XML documents using SQL Server Integration Services
What is the process for importing an XML document into SQL Server via SSIS? What are some helpful hints for loading the data effectively and efficiently?
SSIS is a robust and versatile ETL tool that can handle most any type of load process that it is tasked with completing. XML files are no exception to this statement as Integration Services can load various XML files at various levels of complexity.
Before we dive into the load process, we need to cover a few basics about XML files. First, an XML file is an extensible Markup Language file; XML itself is a Markup Language which shares some similarities to HTML. Just like HTML, XML and XML files use tags; in XML's case, these tags are used to define, describe, structure, and nest data. Furthermore, and contrary to HTML, each tag is not predefined, but is established by the users who generates the files. Also, XML files are all about the information and data within a file and not about how to display the data or information. Thus, an XML file is often self defining although special files called XML Schema Definition files (XSD) can also be used to validate, format, and describe a particular XML file. Of course, XSD's are coded in XML.
Import XML File
The SSIS import process starts with creating a new SSIS solution and project in SQL Server Data Tools (SSDT formerly known as BIDS or Business Intelligence Design Studio). Be sure to name the new solution and project appropriately as shown below. Also, rename the package to a descriptive name.
Next we can drag and drop a new data flow task onto the design area and again rename the new data flow to a meaningful name, as illustrated below.
Next, double click on the data flow task to define the details of the data flow. From the SSIS toolbox, Other Sources category, drag the XML Source onto the design grid and rename as needed.
Double click on the XML Source to browse to the location of the XML file as shown below. This screen is where some of the decision making must take place and also where many problems surface. First the Data access mode is selected; for this example use an actual file location, but XML File from variable or XML data from variable can also be selected. Thus, you can browse to the location of the XML file.
Finally, the last three items, Use inline schema, XSD location, and Generate XSD make a significant difference in what happens during the XML processing (commonly known as shredding). The Use Inline schema option tells SSIS that the schema definition actually resides in the XML file itself. In essence the XSD file is embedded in the XML file. If this option is selected than the XSD location and Generate XSD options are grayed out. Alternately, you can browse and select the location of a separate XSD file; this file will outline in detail the tags within the XML file. Generally if you get an XML file from an alternate source, the XSD should already be embedded inline within the XML file or the organization providing the XML should provide you with the appropriate XSD file.
In the below example, I am using sample "purchase order" and "customers and orders" XML files which can be downloaded from the MSDN site: http://msdn.microsoft.com/en-us/library/bb387034.aspx and http://msdn.microsoft.com/en-us/library/bb387025.aspx respectively. I also downloaded related XSD files at: http://msdn.microsoft.com/en-us/library/dd489284(v=vs.110).aspx and http://msdn.microsoft.com/en-us/library/bb675181.aspx. Both of these XML files are moderately complex with several layers (or nested) values. Unfortunately, as an XML file's hierarchy structure and complexity increases, often times the supplied XSD file do not parse and load the data correctly. Often the load will occur without error, however no data is transferred. Thus, the Generate XSD button will need to be used to generate a new XSD file to be used for importing. The Generate XSD process works well most often, but you may need to adjust it for special cases within your XML file.
Unfortunately, in our example using the downloaded XSD file for the purchase order XML file did not load any data. However, utilizing the Generate XSD, will load, which will be detailed next, was successful. The generate XSD process actually creates a new XSD file which must then be browsed to and selected in the XSD location field.
Once the files are selected, the next step is to review the potential columns to load. For XML files which contain hierarchies and layers of elements within other elements, a separate tabular set of data is generated for each item falling under the higher element. In the Purchase Order XML file example, the purchase order set also contains shipping addresses, item, and items elements. Thus, a tabular set is generated for each purchase order set, each address set, each items set, and each item set.
For our PO example, each tabular set can be loaded individually into its own table. Thus, we actually can add multiple SQL Server destinations, and split the data flow accordingly. When each source is connected to the destination, you must select which XML tabular set you would like to flow through to the destination as shown below.
Keeping with our purchase order example, the below screen print shows two of the four data flows are loaded in SQL Server. Notice how the data flow path shows the XML output set on top of the blue line. Of course, you could do the same for the "red line" or error row data flow.
Our final step is to map source columns to the SQL Server destination table columns. Thus, double clicking on the destination, as shown below, requires selection of the data source and table name where the data will be placed.
Next clicking on the Mapping property allows us to complete the column by column mapping as illustrated below. Depending on the insert you are completing you may have to adjust the settings on the Advanced property window to account for items such as identity inserts, trigger firings, and constraint checks.
For the items set used in our example, 5 items loaded successfully as illustrated below.
Often when you load XML files you will run into two main issues. First, using the Generate XSD option creates an XSD file where column lengths are not specified for string based columns. As such, you will often see the below warning on the source screen when navigating the columns property window.
The above warning stems from having the XSD file define string based columns which do not have a minimum and maximum length noted for that particular element within the XSD. This issue can be alleviated by adding text similar to the below notation to your string columns. Notice how I commented out the original line and then added 8 new lines of XML to further define the column lengths.
The second issue that frequently occurs concerns the default data types defined in the XSD and again flows through to the source. The most common of these is that the XSD will define a numeric field as an unsigned integer while the SQL Server destination is expecting a signed integer. Thus the following error results.
To correct this error we have two choices. We can either modify the XSD file to change the data type for the field from unsignedint to just int (which signifies a signed integer) as show below.
As an alternative, we can also go to the XML Source Advanced Editor by right clicking on the XML Source task and selecting Show Advanced Editor. Next, as displayed in the subsequent image, we can navigate to the Input and Output Properties tab and then to the columns in question within the Output Columns lists. Finally, we can change the data type which is output from unsigned to signed integer. A third alternative is to use the data conversion transformation to convert the columns types.
Loading data from a XML file is not nearly as easy as loading text files or flat files. However, well formatted XML documents can be loaded using the XML Source data flow. When completing the load and in addition to the XML file itself, the XML format and schema must also be defined in SSIS either by using an XSD file or via inline schema contained within the XML file itself. The XML Source task actually contains a button which can generate the XSD file if needed. XML files often contain multiple hierarchical structures or elements nested within other elements; in order to compensate for this setup, SSIS will actually output multiple tabular sets for these hierarchical structures. Thus a single XML file might be loaded into many different tables. Finally, some common errors often occur and center around data length and data type conflicts between the XSD file schema and the SQL Server table. These errors and warnings can often be corrected by either adjusting the XSD file or the Output Column properties in the XML Source.
- Bulk Import / Export XML - http://technet.microsoft.com/en-us/library/ms191184.aspx
- Use Management Studio to load XML - http://www.mssqltips.com/sqlservertip/2899/importing-and-processing-data-from-xml-files-into-sql-server-tables/
- Using XQuery in Management Studio - http://www.mssqltips.com/sqlservertip/2738/examples-of-using-xquery-to-update-xml-data-in-sql-server/
About the author
View all my tips