Using T-SQL to Fix Bad Data in XML Documents Using FLWOR

By: Laeeq Hamid | Comments | Related: > XML

Problem

I have an XML document with some incorrect data in it. Is there a way of iterating through the nodes of XML document using T-SQL and correcting the data before processing the contents?

Solution

FLWOR allows us to iterate through the nodes of an XML document. We can use this feature to construct or reconstruct an XML document. Generally, there are two issues with the contents of data 1) a missing piece of information or 2) incorrect information. In this article we will go through both of these issues and reconstruct an XML document to a level which can be acceptable to the business.

FLWOR Statements

FLWOR stands for FOR, LET, WHERE, ORDER BY and RETURN. It is a standard method defined by the W3 Consortium to read XML documents which is sort of equivalent to SELECT <ColumnName> FROM <tablName> WHERE <Criteria> ORDER BY <ColumnName> queries and it is supported by different relational database management systems including SQL Server.

We will use FLWOR with SQL Server XML functions query, value and node.

The query function takes an XPath expression as a parameter and returns a fragment of XML or the whole XML document as a single fragment and it also allows us to construct XML documents.
The value function returns the value of a particular element or attribute at a specific location. It takes two parameters, an XPath expression and the data type of value returned.
The node function shreds or transforms XML nodes to a tabular structure. It takes XPath expression as a single parameter and returns a number of rows. The number of rows returned by the node function is equivalent to the number of times the path is provided to the node function as a parameter repeats in an XML document. It requires aliasing the output to a table and column name. We combine the output from the node and value functions to read data from a specific node of XML.

Now, let’s see an example.

Identifying the Problem

The XML document we will use as an example has a list of products and sale amounts for the first quarter of a year (amount of sales in January, February and March).

Below is a piece of code for the declaration of this XML document and its transformation to a tabular structure without using FLOWR.

The original XML document used as an example

The first section of the code below is a declaration of an XML document.

The second section is a combination of nodes and value functions to display a list of products. For the nodes functions is uses a path to the product nodes as ‘/Products/Product’ and we’ll be able to extract 5 rows, one for each product. The output of nodes is stored in table-column alias as Products(Product). The first column in the result set is ProductID, which is an attribute and it is read using the value function with path ‘/@ID[1]’ . The / operator will establish a path from the root node and @ID[1] is for reading the first occurrence of the attribute ID and then the outcome is converted to varchar(255). The second column is product name which is an element node and its value is read using path ‘Product[1]’ and it is converted to varchar(255).

The third section of this code is similar to the second, but sales details are added against each product which is achieved by cross applying the Product node to its SaleHistory node. The two other columns in this section are sales in January and total sales in the quarter which is January to March. Sales in January is read using XPath expression ‘Month[@MonthName="January"][1]’ which means that SQL Server will read the value of the first occurrence of the element node Month, and match it against attribute ‘MonthName’ January. Using the second parameter of the value function, the output of sale in January is converted to VARCHAR(255). The other column is a sum of sale amounts from January, February and March, which is calculated using the sum function and is converted to FLOAT, because this data type is more suitable for data with mathematical operations.

DECLARE @x XML =
'<Products>
   <Product ID="1">
      <ProductName>Yo-Yo</ProductName>
      <Price>2.50</Price>
      <SaleHistory>
         <Month MonthName="January">250</Month>
         <Month MonthName="February">50</Month>
         <Month MonthName="March">175</Month>
      </SaleHistory>
   </Product>
   <Product ID="2">
      <ProductName>Rubik''s Cube</ProductName>
      <Price>5.00</Price>
      <SaleHistory>
         <Month MonthName="January">35</Month>
         <Month MonthName="February">38</Month>
         <Month MonthName="March">42</Month>
      </SaleHistory>
   </Product>
   <Product ID="3">
      <ProductName>VR Glasses</ProductName>
      <Price>100.00</Price>
   </Product>
   <Product ID="4">
      <ProductName>Power Bank</ProductName>
      <Price>15.00</Price>
      <SaleHistory>
         <Month MonthName="January">20</Month>
         <Month MonthName="February">-25</Month>
         <Month MonthName="March">5</Month>
      </SaleHistory>
   </Product>
   <Product ID="5">
      <ProductName>Abacus</ProductName>
      <Price>15.00</Price>
      <SaleHistory>
         <Month MonthName="January">115</Month>
         <Month MonthName="February">125</Month>
         <Month MonthName="March">90</Month>
      </SaleHistory>
   </Product>
</Products>'

SELECT 
   Product.value('@ID[1]','VARCHAR(255)')ProductID
  ,Product.value('ProductName[1]','VARCHAR(255)')ProductName
FROM @x.nodes('/Products/Product') Products(Product)

SELECT 
   Product.value('@ID[1]','VARCHAR(255)')ProductID
  ,Product.value('ProductName[1]','VARCHAR(255)')ProductName
  ,Sale.value('Month[@MonthName="January"][1]','VARCHAR(255)')SaleInJanuary
  ,Sale.value('sum(Month)','FLOAT')[Sale(Jan, Feb & Mar)]
FROM @x.nodes('/Products/Product') Products(Product)
CROSS APPLY Product.nodes('./SaleHistory') SaleHistory(Sale)

By running this code, we will see that there are two problems in our XML document:

There are 5 products in total, but when we add sales details, ProductID 3 which is VR Glasses does not show up in the result. This is due to a missing SaleHistory node for VR Glasses.
The total sales of Power Bank in January is 20, but combined sales of January, February and March is 0. This is logically impossible and it’s due to bad data sitting in the SaleHistory which makes the quarterly sales of Power Bank equal 0.

VR Glasses are not appearing in the sale history, because when we CROSS APPLY the Product node to the SaleHistory node we get an empty dataset, due to a missing sales history node for VR Glasses. And a CROSS APPLY of an empty dataset to any dataset results in an empty dataset. The fact is, we cannot avoid this CROSS APPLY situation for reading XML data from the SaleHistory node, because it is inside the Product node, which is similar to a one-many relationship in a relational data model.
The sale of Power Bank in January is 20 and in March it is 5, but in our XML document the sale in February is -25, which cannot be correct and it brings down the total sales of Power Bank in the first quarter of the year to 0.

XML Reconstruction

In a production environment we cannot always control the quality of XML documents, especially when exchanging data with external clients, but we can make ourselves prepare for it.

The code below is based on FLWOR statements which has FOR and RETURN clauses in it. These two clauses are mandatory. The other three clauses LET, WHERE and ORDER BY are optional.

Since this is an XPath based syntax that’s why SQL Server enforces it to be case sensitive. Below is a code for reconstructing XML and removing the two data issues.

SET @x = @x.query('<Products> {for $i in /Products/Product                              
         return                                          
            if(empty($i/SaleHistory)) then (<Product ID= "{($i/@ID)}">  
                    {$i/ProductName}{$i/Price}<SaleHistory/></Product>) 

            else( <Product ID="{($i/@ID)}">
                    {$i/ProductName}{$i/Price}
                    <SaleHistory>{ 
                    for $j in $i/SaleHistory/Month
                    return <Month MonthName="{$j/@MonthName}">
                    {if (data($j)>=0) then data($j) else(0)}
                    </Month>}
                    </SaleHistory></Product>            
            )} </Products>')

When an XML document is exchanged, the knowledge of its schema and structure is already known. We will use this prior knowledge to reconstruct the XML document by iterating through its nodes and correcting errors. The ‘For’ statement allows us to iterate through nodes, in our case we have two iterations. The first one is to iterate through product nodes using ‘for $i in /Products/Product’ and then in each product node we’ll iterate through the SaleHistory nodes using ‘for $j in $i/SaleHistory/Month’. The empty function is used as ‘if(empty($i/SaleHistory))’, which checks whether a SaleHistory node is available for a product or not. If it is not available, a node is created with no data in it, otherwise the original node is returned. And then the sale amount data is checked by using the data function as ‘if (data($j)>=0)’, if the value is greater or equal to 0 then the original value is returned. Otherwise, the value is overwritten to 0. The rest of the code around the for and return statements is the XML structure. The output of this code is an error free document.

At this stage our XML document is fine. If we run the same SQL query, which returned the unexpected results in the first place will return realistic results this time.

-- Executing the same query after fixing XML document           
SELECT 
    Product.value('@ID[1]', 'VARCHAR(255)') ProductID
   ,Product.value('ProductName[1]','VARCHAR(255)') ProductName
   ,Sale.value('Month[@MonthName="January"][1]','VARCHAR(255)') SaleInJanuary
   ,Sale.value('sum(Month)','FLOAT') [Sale(Jan, Feb & Mar)]
FROM @x.nodes('/Products/Product') Products(Product)
CROSS APPLY Product.nodes('./SaleHistory') SaleHistory(Sale)

We can see that VR Glasses now appears in the sales details with a NULL which makes sense, because there is no sale record for it. And the total sales of Power Bank is now a realistic number, because the negative sale value was set to 0.

Details of the sales after fixing the data issues of our XML document

Finally, we can use let and where clauses of FLWOR to filter out the results we want. Let is similar to assigning a value to a variable. As mentioned in the final section of the code, Product ID is stored in $j variable, which is used in the where clause as ‘where $j = (1,2,3,4,5) to include all 5 products. And in the last line of this code the SQL ORDER BY clause is used to sort results in ascending order for product names. If we wanted to control the sort order of the product nodes during XML reconstruction, we could use the FLOWR order by clause as:

 for $i in /Products/Product                   
   let $k:= $i/ProductName
   order by $k[1] ascending
   return
   . . .

In FLOWR, the order by clause default sort order is ascending. We can specify either of the keywords ascending or descending e.g. ascending keyword is used for ascending sort order and descending keyword is used for descending order, which is similar to ASC and DESC keywords in TSQL.

Final Set of Code

Now, we are at the final stage of our code. The extracted product nodes are cross applied to its SalesHistory node by using a table-column alias of product nodes which is Products.Product.nodes(’./SaleHistory’). This will allow us to read the sale history in the SELECT section of the SQL query, and it is wrapped around the ISNULL function to replace any NULL values with N/A as we don’t have sales history for that product.

DECLARE @x XML =
'<Products>
   <Product ID="1">
      <ProductName>Yo-Yo</ProductName>
      <Price>2.50</Price>
      <SaleHistory>
         <Month MonthName="January">250</Month>
         <Month MonthName="February">50</Month>
         <Month MonthName="March">175</Month>
      </SaleHistory>
   </Product>
   <Product ID="2">
      <ProductName>Rubik''s Cube</ProductName>
      <Price>5.00</Price>
      <SaleHistory>
         <Month MonthName="January">35</Month>
         <Month MonthName="February">38</Month>
         <Month MonthName="March">42</Month>
      </SaleHistory>
   </Product>
   <Product ID="3">
      <ProductName>VR Glasses</ProductName>
      <Price>100.00</Price>
   </Product>
   <Product ID="4">
      <ProductName>Power Bank</ProductName>
      <Price>15.00</Price>
      <SaleHistory>
         <Month MonthName="January">20</Month>
         <Month MonthName="February">-25</Month>
         <Month MonthName="March">5</Month>
      </SaleHistory>
   </Product>
   <Product ID="5">
      <ProductName>Abacus</ProductName>
      <Price>15.00</Price>
      <SaleHistory>
         <Month MonthName="January">115</Month>
         <Month MonthName="February">125</Month>
         <Month MonthName="March">90</Month>
      </SaleHistory>
   </Product>
</Products>'

SET @x = @x.query('<Products> {for $i in /Products/Product                              
         return                                          
            if(empty($i/SaleHistory)) then (<Product ID= "{($i/@ID)}">  
                    {$i/ProductName}{$i/Price}<SaleHistory/></Product>) 

            else( <Product ID="{($i/@ID)}">
                    {$i/ProductName}{$i/Price}
                    <SaleHistory>{ 
                    for $j in $i/SaleHistory/Month
                    return <Month MonthName="{$j/@MonthName}">
                    {if (data($j)>=0) then data($j) else(0)}
                    </Month>}
                    </SaleHistory></Product>            
            )} </Products>')
	
SELECT 
   Product.value('ProductName[1]','VARCHAR(255)') ProductName
  ,ISNULL(Sale.value('Month[@MonthName="January"][1]','VARCHAR(255)'),'N/A') [Sale in Jan]
  ,ISNULL(Sale.value('Month[@MonthName="February"][1]','VARCHAR(255)'),'N/A') [Sale in Feb]
  ,ISNULL(Sale.value('Month[@MonthName="March"][1]','VARCHAR(255)'),'N/A') [Sale in Mar]
 ,Sale.value('sum(Month)','FLOAT') [Sale (Jan, Feb & Mar)]
FROM @x.nodes('for $i in /Products/Product
            let $j:= $i/@ID
            where $j = (1,2,3,4,5)
            return $i'
            ) Products(Product)
CROSS APPLY Product.nodes('./SaleHistory') SaleHistory(Sale)
ORDER BY ProductName

Here is the final query results.

Next Steps

Check out these related tips:

About the author

Laeeq Hamid has more than 10 years of experience working with SQL Server databases. Laeeq gets involved in database application design, monitoring, development, data migration, performance tuning and ETL.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips