By: Maria Zakourdaev | Updated: 2018-07-13 | Comments | Related: More > Import and Export
SQL Server DBAs have many ways to bulk import data into a database table. We can use the BCP utility, the bulk copy tool that can load big amounts of data from csv/text files into a SQL Server database table. Another option for importing flat files would be the Import/Export Wizard. This wizard is very easy to use and customizable, but also only limited to flat file formats. We also can bulk import data using the BULK INSERT command. BULK INSERT has the same functionality as “BCP in”. The difference is that the BULK INSERT command reads the data from the SQL Server engine where the BCP utility is an external tool that does not have to be executed from the SQL Server machine.
All the above methods are quite fast and customizable. We can choose field and row terminators, data types, character formats and for the BULK INSERT command we can use a format file and skip columns.
These days, there is another format, other than csv and txt files, which is being used everywhere, from browser-server communications to simply being a useful database data type. The JSON format (Java Script Object Notation) is a very common data file that contains attributes – value pairs as an XML alternative.
How can you load JSON files into a SQL Server table? What if we have a folder full of JSON files and you need to load all of them into your SQL Server database? Moreover, sometimes those files are huge, and you need to load only a subset of the columns and filter out some rows?
In this tip we will look at different approaches to loading JSON format files into SQL Server.
This is how you can prepare a small JSON file to play with the below solutions.
select * from msdb.sys.objects FOR JSON PATH
You should copy the result of the above query into a flat file and save it. I called mine small_json.json.
Import JSON Data into SQL Server with OPENROWSET and OPENJSON
The first solution that we have is an OPENROWSET method which is an ad-hoc alternative to a linked server. OPENROWSET is being executed as part of the INSERT…SELECT… statement and can return the whole file contents as a single clob when we use the option BULK. Starting with SQL Server 2016 we have a new function OPENJSON function that can parse the text and convert it into rows and columns.
SELECT import_data.* FROM OPENROWSET (BULK 'C:\Files\import_data\small_json.json', SINGLE_CLOB) as j CROSS APPLY OPENJSON(BulkColumn) WITH( name varchar (255), object_id varchar (255) , schema_id varchar (255) , parent_object_id varchar (255) , type varchar (255) , type_desc varchar (255) , create_date varchar (255) , modify_date varchar (255), is_ms_shipped varchar (255), is_published varchar (255) , is_schema_published varchar (255) ) AS import_data
The resultset can be loaded into any table, fully or only several columns using the INSERT INTO clause. We can also limit the resultset by filtering out unneeded rows by using a WHERE clause.
Unfortunately, OPENROWSET cannot accept variables when we want to dynamically loop over the directory with files and load all of them. The only way to achieve the required effect is to execute it as a dynamic SQL. I will be using my directory with 10 big json files.
This script does the following:
- Creates a temporary Dir table
- Uses xp_DirTree procedure that can list all files in the folder to populate the Dir table
- Creates the destination Dim_Countries table with PRIMARY KEY that ignores duplicates.
- Iterates over the file list and puts into Dim_Countries table data from Country column filtering out data on AdClicks column.
- At the end we have distinct list of Coutries in the Dim_Countries table
DROP TABLE IF EXISTS Dir; CREATE TABLE Dir (id int identity(1,1),subdir varchar(250),d int, f int); INSERT INTO Dir EXEC xp_DirTree 'C:\Files\import_data\',1,1; DECLARE @File varchar(250), @FileId int, @MaxId int, @cmd varchar(max) SELECT @FileId = MIN(id), @MaxId = MAX(id) FROM Dir DROP TABLE IF EXISTS Dim_Countries; CREATE TABLE Dim_Countries (Country varchar(250) NOT NULL PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)); WHILE @FileId IS NOT NULL BEGIN SELECT @File = 'C:\Files\import_data\' + subdir FROM Dir WHERE id = @FileId SET @cmd = ' INSERT INTO Dim_Countries SELECT Country FROM OPENROWSET (BULK '''+ @File+''', SINGLE_CLOB) as j CROSS APPLY OPENJSON(BulkColumn) WITH(  varchar (255),  varchar (255) ,  varchar (255) ,  varchar (255), Country varchar (255) ,  varchar (255) ,  varchar (255),  varchar (255),  varchar (255),  varchar (255),  varchar (255),  varchar (255),  varchar (255),  varchar (255), AdClicks varchar (255),  varchar (255),  varchar (255),  varchar (255),  varchar (255)) AS import_data WHERE AdClicks > 0' EXEC (@cmd) SELECT @FileId = MIN(id) FROM Dir WHERE id > @FileId END SELECT * FROM Dim_Countries
Import JSON Data into SQL Server with a Python Script
There is an additional option that is available starting with SQL Server 2017 that can help us to work with JSON files. We can use the Python JSON library to load the JSON files, fully or partially.
To enable this functionality, you will need to use sp_configure as follows:
EXEC sp_configure 'external scripts enabled',1
In order to read our small JSON file, we will use sp_execute_external_script procedure with language set to Python. Inside the Python script I have used the JSON library to read JSON files and pandas library to format and return the resultset back to SQL Server. To add metadata to our resultset, I have used an EXECUTE command option WITH RESULT SET and have listed all columns.
This script does the following:
- Uses Python open function to get a file object from friends.json file
- Uses json.load to deserialize JSON data from a file object
- Puts the result into jvalues variable which is a Python data structure called a list (list is a collection of data types that is changeable, allow duplicate members and has an order).
- The resultset is being converted into the OutputDataSet DataFrame object which is the default variable that is being used to return the data to SQL Server from the Python runtime.
exec sp_execute_external_script @language =N'Python', @script=N' import pandas as pd import json jvalues = list() data = json.load(open("C:\\Files\\small_json.json", ''r'')) jvalues = list() for val in data: jvalues.append(val) dataset = pd.DataFrame(data=jvalues) OutputDataSet=dataset ' WITH RESULT SETS (( name varchar (255), object_id varchar (255), schema_id varchar (255), parent_object_id varchar (255), type varchar (255), type_desc varchar (255), create_date varchar (255), modify_date varchar (255), is_ms_shipped varchar (255), is_published varchar (255), is_schema_published varchar (255) ))
If I need to take all files from the specific directory, Python has easy method to loop over directory files. The below script also takes only the Country column and limits the number of inserted rows by filtering on the AdClicks column.
This script does the following:
- Loops over my directory of files using os.listdir (method to list files in directory),
- Line by line in each JSON file checks AdClicks field
- If the above column is > 0 takes out only Country field
- Puts the result into jvalues variable which is a set (set is a collection data types that has no duplicate elements).
- The resultset is being converted into the OutputDataSet variable and is being returned to sql server client connection.
- The result of the below script is distinct list of countries across all files in the directory where AdClicks field was more than 0.
DROP TABLE IF EXISTS Dim_Countries; CREATE TABLE Dim_Countries (Country varchar(250) ) INSERT INTO Dim_Countries EXEC sp_execute_external_script @language =N'Python', @script=N' import pandas as pd import json path = "C:\\Files\\import_data\\" files = os.listdir(path) jvalues = set() for file in files: data = json.load(open(path + file, ''r'')) for val in data: if val["AdClicks"] > 0: jvalues.add(val["Country"]) dataset = pd.DataFrame(data=list(jvalues)) OutputDataSet=dataset ' select * from Dim_Countries
Load Speed Comparisons – OPENROWSET(BULK..) vs Python
To loop over 10 files, 34 MB each (100000 items) returning a distinct Country column and filtering by the AdClicks column took 19 seconds with Python and 46 seconds using the OPENROWSET command. If we move the removing duplicates part from the INSERT to be executed using SELECT DISTINCT after the script, it takes 40 seconds.
Which one is better, Python or OPENROWSET?
When we need to work with JSON files we can use either OPRROWSET(BULK…) or Python external procedures to load JSON data into the database tables. Both methods work well as we saw in the above tests. The Python script is slightly faster. Both methods can load a subset of the data columns and filter out unneeded rows. We can do any data manipulation logic, but OPENROWSET(BULK…) applies it after the data has been loaded and this produces some transaction log overhead.
Also, the Python script does not use SQL Server resources at all because it is being executed outside of the database engine, more like Full Text Search. Just keep in mind that it is still uses the same server resources, like CPU and memory.
- Check out these other articles:
Last Updated: 2018-07-13
About the author
View all my tips