Data Extraction from XML to JSON in SQL Server 2025

Problem

XML is a type of file format often used for standardizing data by creating a structured tag-defined hierarchical output. In the life of the data professional, XML is a frequent site, for example, when dealing with legacy systems or industrial data. How can we extract the contents of an XML data file to store in a database table? What if we also wanted to model and store the XML data in native JSON format? Let’s see how we can convert XML to JSON.

Solution

To read and transform the contents of an XML file, we can write a Python script. Having the data in memory, we can convert it in a straightforward way to JSON as well. Finally, using the newly available native JSON datatype, we can store the output as JSON in a table in SQL Server 2025.

Scenario

For this scenario, I have decided to use a synthetic XML file representing the output from a software used for vehicle technical inspections. In such a scenario, the generated XML output can easily grow to multiple, sometimes nested, tags and hundreds of rows. To fully leverage the data from multiple tests over a certain period, we need to apply automated transformation logic on the data. In this case, both occur: the data gets to a table and is converted to JSON.

Here is a preview of one of the sample files:

The information we are interested in is structured in the following way, under a root tag <inspectionResults>:

<vehicle> tag containing VIN number and test metadata, such as start and end time.
A <test> tag containing:
- <system> tag for each vehicle system being inspected, such as brakes, exhaust, lights, etc.
- <systemtest> tag for each test for the respective system. Examples could include the braking force of the wheels for a brake test or CO emission content for an exhaust emissions test.
- <measurements> tag containing a list of measurements for each system test.
- <measurement> tag containing name, value, and unit of the measurement.

To analyze these data further, especially when multiple files are in question, we should structure the data in a tabular form.

Environment Setup

Create a project folder and open it in VS Code. Then, create a requirements.txt file containing four lines: attrs, pandas, pyodbc, SQLAlchemy. Next, hit Ctrl+Shift+P and choose Python: Create Environment.

Follow the prompts for creating a local virtual environment. Make sure to check requirements.txt so the environment agent will install the required Python packages directly:

To complete the namespace setup, create two subfolders:

files: containing source XML files.
program: containing the Python packages we are about to write.

Here is how the namespace should look at this point:

Once you have your environment ready, please proceed to the next steps where we set up the complete solution.

XML File Processing

The first step is to create a script to process the source XML files and return a Pandas dataframe:

--MSSQLTips.com Python
01: import os
02: import xml.etree.ElementTree as ET
03: import pandas as pd
04: import logging
05: from attrs import define
06: 
07: logger = logging.getLogger(__name__)
08: logging.basicConfig(level=logging.INFO,
09:                     format="%(asctime)s [%(levelname)s] %(message)s",
10:                     handlers=[
11:                         logging.FileHandler("debug.log"),
12:                         logging.StreamHandler()
13:                     ])
14: 
15: 
16: @define
17: class XMLProcessor:
18: 
19:     @staticmethod
20:     def process_xml(path: str) -> pd.DataFrame:
21:         cols = ['VIN', 'System', 'SystemTest', 'SystemMeasurement', 'MeasurementName',
22:                 'MeasurementValue', 'MeasurementUnit', 'FullResult']
23: 
24:         rows = []
25: 
26:         arr = os.listdir(path)
27:         files = [f for f in arr
28:                  if f.endswith('xml')]
29: 
30:         for xml_file in files:
31:             logger.info(f'Processing {xml_file}')
32:             tree = ET.parse(fr'{path}\{xml_file}')
33:             root = tree.getroot()
34: 
35:             ser_num = root[1].attrib['serialNumber']
36: 
37:             for test in root[2].iter():
38:                 try:
39:                     for system in test.findall('system'):
40:                         for systemtest in system.findall('systemtest'):
41:                             for measurements in systemtest.findall('measurements'):
42:                                 for measurement in measurements.findall('measurement'):
43:                                     system_name = system.attrib['des']
44:                                     system_test_name = systemtest.attrib['name']
45:                                     meas_name = measurements.attrib['name']
46:                                     meas = measurement.attrib['name']
47:                                     meas_val = measurement.attrib['value']
48:                                     meas_unit = measurement.attrib['units']
49: 
50:                                     full_result_data_json = {
51:                                         'VIN': ser_num,
52:                                         'data': [
53:                                             {
54:                                                 'System': system_name,
55:                                                 'SystemTest': system_test_name,
56:                                                 'SystemMeasurement': meas_name,
57:                                                 'MeasurementName': meas,
58:                                                 'MeasurementValue': meas_val,
59:                                                 'MeasurementUnit': meas_unit
60:                                             }
61:                                         ]
62:                                     }
63: 
64:                                     rows.append({
65:                                         'VIN': ser_num,
66:                                         'System': system_name,
67:                                         'SystemTest': system_test_name,
68:                                         'SystemMeasurement': meas_name,
69:                                         'MeasurementName': meas,
70:                                         'MeasurementValue': meas_val,
71:                                         'MeasurementUnit': meas_unit,
72:                                         'FullResult': full_result_data_json
73:                                     })
74:                 except Exception as e:
75:                     logger.error(f'Error {e}')
76: 
77:             logger.info(f'Finished processing {xml_file}')
78: 
79:             df = pd.DataFrame(rows, columns=cols)
80:             return df

Let us break it down line by line as usual:

1 – 5: Import the necessary modules. Importantly, we use xml.etree.ElemenTree to parse out the XML files.
7 – 13: Create an instance of logging that will output the important script events to a file debug.log.
17: Define a class XMLProcessor.
20: Define a method that does not require class instantiation and accepts a single argument for a path containing the files.
21: Define the list of columns we want to end up with in the resulting dataframe.
24: Empty list data structure that we will append the resulting rows to.
26: Get a list of all files and directories from the source path.
27: Store all files with the .xml extension in a list. In other use cases, we can adapt the code to connect to a cloud service such as Azure blob storage or AWS S3 to get the files from there.
30: Create an outermost for-loop to go over each XML file.
32: Parse the XML into a tree structure.
33, 35: Get the root element of the tree and acquire the car VIN number.
37 – 73: Since we already know the expected tree structure, we create a nested for-loop for extracting the data behind each tag we are interested in. Then, in the innermost loop:
- 50 – 62: Create a dictionary that will become the JSON object stored natively later in the database.
- 64 – 73: To the previously initialized rows list, we add the current record.

74: Catch any exception and log it.
79, 80: Turn the rows list into a dataframe and return it.

The script should be in the /program subfolder.

Database Table Model

The next step is to create a database table model in our code. This is not a required step; you could create the table manually in advance or call .to_sql on the dataframe object. There are many benefits of using a predefined model, as described in the following tip: Python Modules for Developing Data Engineering Workloads.

Let’s create a model:

--MSSQLTips.com Python
01: from sqlalchemy import (
02:     MetaData,
03:     Table,
04:     Column,
05:     Integer
06:     )
07: from sqlalchemy.dialects.mssql import VARCHAR, JSON
08: from attrs import define, field
09: 
10: 
11: @define
12: class VehicleInspectionResultsModel():
13:     metadata: MetaData = field(factory=MetaData)
14:     schema_name: str = field(default='dbo')
15:     vehcile_inspection_results: Table = field(init=False)
16: 
17:     def __attrs_post_init__(self):
18:         """Initialize the tables with the given schema."""
19:         self.vehcile_inspection_results = Table(
20:             'VehicleInspectionResults',
21:             self.metadata,
22:             Column('Id', Integer, primary_key=True, autoincrement=True),
23:             Column('VIN', VARCHAR(17), primary_key=True, nullable=False),
24:             Column('System', VARCHAR(128), nullable=False),
25:             Column('SystemTest', VARCHAR(128), nullable=False),
26:             Column('SystemMeasurement', VARCHAR(128), nullable=False),
27:             Column('MeasurementName', VARCHAR(128), nullable=False),
28:             Column('MeasurementValue', VARCHAR(128), nullable=False),
29:             Column('MeasurementUnit', VARCHAR(32), nullable=False),
30:             Column('FullResult', JSON, nullable=True),
31:             schema=self.schema_name
32:         )

Here is the breakdown:

01 – 06: Import the necessary SQLAlchemy datatypes.
07: Import the specific SQL data types we need.
12 – 31: Define a class called VehicleInspectionResultsModel that will contain representations of database tables as attributes. It has the following attributes:
- Metadata for database metadata.
- Schema name.
- A table, in this case, only one. In more complex scenarios, our model class could contain multiple tables.

The class has a method initializing the single table. The initialization follows the SQLAlchemy Table datatype. The number of columns and their datatype must correspond to the source data we have previously prepared as a dataframe.

JSON Data Type Caveat

SQL Server 2025 is still in public preview; therefore, some of the newest features may not work out of the box. If we execute the script to create the model now, the FullResult column will be created as NVARCHAR(MAX) instead of a native JSON. To fix this, we can create the following simple stored procedure that we will call after the model has been created:

--MSSQLTips.com T-SQL
CREATE PROCEDURE [dbo].[ConvertColumnToJSON]
    AS
    BEGIN
        ALTER TABLE dbo.VehicleInspectionResults 
        ALTER COLUMN FullResult JSON NULL
    END
GO

Database Connection

For this scenario, I am using SQL Server 2025 for the final data destination. This brand-new release is currently in public preview. You can download it from the Microsoft website. After installing, let’s check what the newest version number is, then create the target database:

check db engine version and create database

As a result, we see the current public preview version is 17.0.700.9, and we have a new database. Let us create the script that will connect us to that database:

--MSSQLTips.com Python
01: """
02: MSSQLTips.com General database connection configuration class
03: """
04: import os
05: from attrs import define, field, validators
06: from sqlalchemy.engine import Engine, create_engine, URL
07: 
08: 
09: @define
10: class RDSConfig:
11:     DB_SERVER: str = field(
12:         factory=lambda: os.getenv("DB_SERVER", ".\\MSSQLSERVER02"),
13:         validator=validators.instance_of(str)
14:     )
15:     DB_DATABASE: str = field(
16:         factory=lambda: os.getenv("DB_DATABASE", "MSSQLTips2025"),
17:         validator=validators.instance_of(str)
18:     )
19:     DB_USERNAME: str = field(
20:         factory=lambda: os.getenv("DB_USERNAME", ""),
21:         validator=validators.instance_of(str)
22:     )
23:     DB_PASSWORD: str = field(
24:         factory=lambda: os.getenv("DB_PASSWORD", ""),
25:         validator=validators.instance_of(str)
26:     )
27:     # DB_PORT: int = field(
28:     #     factory=lambda: int(os.getenv("DB_PORT", " ")),
29:     #     converter=int,
30:     #     validator=validators.instance_of(int)
31:     # )
32:     DB_DRIVER: str = field(
33:         factory=lambda: os.getenv("DB_DRIVER", "mssql+pyodbc"),
34:         validator=validators.instance_of(str)
35:     )
36: 
37:     @property
38:     def url(self) -> URL:
39:         return URL.create(
40:             drivername=self.DB_DRIVER,
41:             host=self.DB_SERVER,
42:             database=self.DB_DATABASE,
43:             username=self.DB_USERNAME,
44:             password=self.DB_PASSWORD,
45:             # port=self.DB_PORT,
46:             query=dict(driver='ODBC Driver 18 for SQL Server',
47:                        Trusted_Connection='True',
48:                        TrustServerCertificate='yes')
49:         )
50: 
51:     def get_engine(self) -> Engine:
52:         return create_engine(self.url)

Using attrs and SQLAlchemy, we create a reusable database connection class. The class can be further abstracted to be more general and applicable to different engines.

Lines 11 to 35 Define the attributes representing the mandatory connection properties applicable to any database.
37 – 49 Using the URL method, we create a URL for connecting to the database using the mandatory connection properties. Using this approach on SQL Server 2025 on localhost, we must pass a dictionary with some extra connection properties, such as the driver, TrustedConnection, and TrustServerCertificate.
51: Create a function to return the connection URL.

Putting It All Together

Having all the modules, now it is time to see how everything comes together. Create a new module:

--MSSQLTips.com Python
01: from file_processor import XMLProcessor
02: from rds import RDSConfig
03: from models import VehicleInspectionResultsModel
04: 
05: # connect
06: db_config = RDSConfig()
07: engine = db_config.get_engine()
08: conn = engine.connect()
09: 
10: # get data
11: data_df = XMLProcessor.process_xml(r'./files')
12: data_dict = data_df.to_dict(orient='records')
13: print(data_df)
14: print(data_dict)
15: 
16: # create model
17: data_model = VehicleInspectionResultsModel(schema_name='dbo')
18: data_model.metadata.create_all(engine)
19: 
20: conn.exec_driver_sql('ConvertColumnToJSON')
21: conn.commit()
22: 
23: conn.execute(data_model.vehcile_inspection_results.insert(), data_dict)
24: conn.commit()
25: 
26: conn.close()

Here is how the solution works:

01 – 03: Import our custom modules:
- For extracting xml data (1),
- Connecting to database (2), and
- Creating a model (3).

06 – 08: Instantiate an RDSCOnfig object, get the engine, and connect using SQLAlchemy connect().
11 – 14: Process all files in the files folder.
- 12: Convert the dataframe to a dictionary so it can be used as an execute parameter.
17, 18: Instantiate the data model and create it using the database engine from line 7 following the typical SQLAlchemy flow.
20, 21: Execute the previously defined stored procedure for turning the FullResult column to native JSON.
23, 24: Using the model from line 17, call the insert method using all the data we acquired from line 12 and the XMLProcessor class.
24, 26: Commit and close the connection.

Running this script will populate the target table with the XML data. Note the warning message telling us the ODBC driver did not recognize the installed SQL Server version due to the version not being officially released yet:

On the left-hand side, you can also see what the final solution namespace looks like with all custom modules:

Root
- /.venv
- /files
- /program
- debug.log
- requirements.txt

Checking the Results

Finally, let’s head over to the database and review the results.

Here is the target table with a native JSON column:

The data from the single XML file:

We should use the IS_JSON function to check if we have valid JSON objects everywhere. The function returns 1 if that is true:

DECLARE @count_all INT= (SELECT Count(*)
   FROM   VEHICLEINSPECTIONRESULTS)
DECLARE @count_json INT= (SELECT Sum(is_json) AS total_json_count
   FROM   (SELECT ISJSON(fullresult) AS IS_JSON
           FROM   VEHICLEINSPECTIONRESULTS) AS json_count)
SELECT @count_all  count_all_records
       , @count_json count_json_records

Looks like that is the case:

check row count and row count with valid json

Finally, let us make a combo query taking the VIN of the car and inspection data as pure JSON:

SELECT Id
    , VIN
    , JSON_QUERY(FullResult, '$.data[*]' WITH ARRAY WRAPPER) AS InspectionResultJSON
FROM VehicleInspectionResults

While the JSON_QUERY function has been available since SQL Server 2016, the native JSON data type is new to version 17 or SQL Server 2025. Let us try another query returning only the specific system measurements. The convenience comes from the fact that we can use the JSON key simply as a property in the query:

SELECT Id
    , VIN
    , JSON_QUERY(FullResult, '$.data[*].SystemMeasurement' WITH ARRAY WRAPPER) AS InspectionResultJSON
FROM VehicleInspectionResults

Conclusion

Using a modular data engineering approach, we created separate modules for extracting, transforming, and modeling source data, and finally stored it in SQL Server 2025 using the new JSON data type. The source code can be adapted to different source files, source data stores, and destination models.

Next Steps

Resources

Community

Subscribe