Build a standard import and export process with SQL Server 2017 and R
In any organization, data is stored in various locations and in several formats. Importing and exporting files and data is a very routine task in SQL Server. Developers use different mechanisms and tools as well as customized code to import and export different formats of data programmatically into and out of SQL Server either temporarily or permanently. Typically ETL tools are employed for this purpose.
But during the development phase, developers may not have ETL packages in place for importing and exporting data. Therefore, at times, they resort to manual methods or develop custom code for this. Maintaining different types of codebase to import/export data from various file formats requires more resources for developing, maintaining and operating the codebase. A better solution would be to have a common data gateway in the form of a library that can facilitate import/export of data into SQL Server, which would hide the underlying implementation complexity of importing/exporting data for each type of data source.
The rio package in R-services integrated with SQL Server 2017 provides an ability to address the problem. We will demonstrate how simple it is to work with various data formats with R services in T-SQL in this tip.
R comprises of a powerful tool of open source libraries and 'rio' is one amongst them. rio simplifies the process of importing and exporting data into and out of R in SQL Server. It allows almost all common data formats to be read or exported with the same function. Besides making import-export easy, it also uses R as a simple data parsing and conversion utility.
Before we begin, ensure system has the required software components like SQL Server 2017, SSMS, SQL Server R services (in-database), etc. Subsequently, to ensure rio is fully functional, we need to install this package. If you are new to R and installing R packages on SQL Server, consider following this link on how to install packages on R server.
Once the package is installed, open a new query window on SSMS and execute the code mentioned below. Let's go over the steps of the script.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) print (packageVersion("rio")) '
The sp_execute_external_script is the store procedure that helps to execute external scripts (R in this case) inside the database engine.
- @language indicates the script language, which is R.
- @script specifies the R script that will be executed and its data type is nvarchar(max).
The library function references the rio package and function “packageVersion” returns the version of the rio package and is printed using the print function.
Successful execution of the code validates that the package has been installed correctly and is accessible to SQL Server.
Since the required environment has been set, letís go ahead and see how it works.
Import the Data into SQL Server
One of the popular file formats is the flat file format like CSV, TSV, delimited, etc. Let’s get started with the CSV file format. You can download a sample CSV file from here or feel free to use any CSV file that you have created or you already have in your system.
Below is the CSV file that we will be using in this tip. It is named CSVFile.csv and is saved in directory C:\temp\CSVFile.csv.
Once the CSV file is ready, execute the below code in SSMS.
EXECUTE sp_execute_external_script @language = N’R’, @script = N’ Library(rio) csvfile <- import(“C:\\temp\\CSVFile.csv”) print(csvfile)
In this code, apparently apart from referencing the rio package, we are using the 'import' function to read the CSV file followed by printing it in the output section using the print function.
Note: We need to use '\\'instead of '\'while providing the location of the CSV file. This is because '\'alone is considered an escape character.
Export Data from SQL Server
Now let’s try to output the same CSV data that we read in above into a better format than just printing it on the output console. In the below code, we are defining the output schema by using the 'WITH RESULT SETS' option and emitting output from the R-Script to T-SQL code by assigning the imported data to OutputDataSet data frame.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) CSVFile <- import("C:\\temp\\CSVFile.csv") OutputDataSet <- CSVFile ' WITH RESULT SETS((SepalLength numeric, SepalWidth numeric, PetalLength numeric, PetalWidth numeric, Species nvarchar(10)))
Import an Excel File
Let’s switch to one another popular file format - Excel. We can read an Excel file same way as a CSV file. Execute the below code in order to import the Excel file and data can be seen in the result window. The Excel file is named ExcelFile.xlsx and is saved in directory C:\temp\ExcelFile.csv in the system. You can use any sample Excel data or can download the file from here.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) excelfile <- import("C:\\temp\\ExcelFile.xlsx") print(excelfile) '
Import JSON File
Now let’s try to work with one of the leading formats for storing and exchanging data typically over web, which is JSON. You can refer to any sample JSON file or can download one from here. Below is the JSON data file that we will be referring to in this tip.
Once the below code is executed, it will import the JSON data and structure and the JSON data can be queried using JSON$columns.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) JSONFile <- import("C:\\temp\\JSONFile.JSON") print(JSONFile$columns) '
Import Data from a GZip File
The rio package not only helps to read uncompressed/unzipped files, but it also allows direct import from compressed (.gzip) files. Zip files makes data storage and transfer way more efficient and faster by keeping related files together.
You can take any sample CSV file and zip it in the format of <filename>.csv.gz or can download it from here. Execute below code to import zip file in R.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) gzipFile <- import("C:\\temp\\CSVFile.csv.gz") print(gzipFile) '
These were a few examples of how data with different file formats can be imported with the exactly the same code using the rio package in R. For more references on supported formats for the import function, please refer to the below chart.
Now let’s move forward and see how the export functionality works with rio in R in T-SQL.
Using rio to Convert Data Between File Formats
The rio package also allows you to easily convert data from one format to another.
Execute the below code in SSMS, the CSV flat file is the source file here. We have used the same CSV file as above and we are trying to export it to HTML format by using the export function.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) CSVInput <- import("C:\\temp\\CSVFile.csv") HTMLOutput <- export(CSVInput,"C:\\temp\\HTMLOutputFile.html") '
Once the code is executed, data from CSV will be written into HTML format and 'HtmlOutputFile' file will be created at this location - C:\temp. The R exported data can be seen by clicking on the HTML file that gets created after executing the code.
Using rio to Convert from TXT to HTML Format
The functionality explained in the above step can be achieved using the CONVERT function too. However, it comes with a limitation. Let’s say there is a need of performing data manipulation and calculations on the imported file before we actually want to write it to the required format. The convert function will not provide that option, as it imports from the source file and exports to the destination file format instantly, without leaving any room for data transformation.
Execute the code below that uses the 'convert' function to transform the source file format into the destination file format (csv to html in this case).
Using an INSERT INTO Command with rio
We have reached the last leg of this tip where we will see how to pull data into a SQL Server table from any one of the file formats in R using the rio package in T-SQL.
For this, let's first quickly create a table in SQL Server named CSVImports. We will be using the CSV flat file used above or feel free to use any other format for this exercise. Columns of this newly created table should match the columns of the CSV file.
Execute the below code to load the data into the CSVImports table. Here the code is same as what was used above. Since we have made use of the 'INSERT' statement, the clause 'WITH RESULT SETS' is not relevant, so it is not used in the code. Also, if you look carefully, because of the pre-defined schema of the SQL table, the output is in good shape.
INSERT INTO [CSVImports] EXECUTE sp_execute_external_script @language = N'R', @script = N' library(rio) CSVFile <- import("C:\\temp\\CSVFile.csv") OutputDataSet <- CSVFile ' SELECT * FROM [CSVImports]
In this way, developers can use a common function from the rio package to read data from a variety of data formats, without the need to write custom and different code for different data sources / formats. Also, once the data is ready, it can be exported to the intended and supported formats with the exact same code, which reduces code duplication and complexity. This combined functionality enables developers to build a common data gateway for controlling data traffic in and out of SQL Server.
- Install the rio package in R services and try importing and exporting files (rio supported formats) into SQL Server.
- Try out some manipulation/transformation on the data that you have imported into R and export it into any desired/supported formats like html/xlsx, etc.
- Consider exploring other options available in rio from this link.
About the author
View all my tips