Read and Write Excel files in real-time with R in SQL Server 2017
By: Siddharth Mehta | Comments | Related: More > SQL Server 2017
Excel is one of the dominating file formats used by business analysts to store and maintain their data and analysis. Transactional data, analytical data, operational data and other varieties of data are hosted in diverse types of data repositories. The general practice is to load data from file-based source systems like Excel into data repositories, as this data can be either master or transactional data that may be required to be used in conjunction with other data in data repositories.
This approach has couple of drawbacks. First, this approach would be useful if the Excel file(s) from which data is loaded, would not be used further after data is loaded in data repositories i.e. migrate data from Excel file to database. If the data continues to persist and even keep changing in the Excel sheet, there would be a persistent need to keep synchronizing the data from the Excel sheet with its copy in the database. This leads to data duplication as well as data discrepancies or data dependencies at times.
In this tip, we will learn how to not only read, but also write to Excel files in real-time with just SQL Server 2017 and R. By real-time it means that whenever there is a need for the data hosted in an Excel file, the data would be read into the logic without permanently staging the data in SQL Server tables.
R contains several libraries for interacting with Excel documents, and we will be using one of these libraries for this. SQL Server 2017 ships with an option to install and use the R language with T-SQL. In this tip we will use just T-SQL and R along with a freely available R library for the above-mentioned purpose.
Before we start the actual implementation to interact with an Excel file, we need an Excel file with some sample data. Though you may need Microsoft Excel to author an Excel file, you do not need Excel installed on the machine where SQL Server or Machine Learning Server (R Server) is installed.
Step 1 - The first step is to setup the machine with the required software components i.e. SQL Server 2017 or any higher version of SQL Server installed with SSMS, and R installed as well as configured on this instance of SQL Server. If you need to refer the installation steps, you can follow the installation section of the R tutorial.
Step 2 - We will need to install a R package named – openxlsx. You can read the instructions from this link on how to install packages on a R server.
Step 3- Once you have installed these packages, the first step is to verify whether the library has been installed correctly and is accessible in the T-SQL code to the SQL Server instance.
Let’s try to understand this code.
- Sp_execute_external_script is the store procedure using which external scripts can be executed in SQL Server.
- @language parameter signifies the script language being used, which is R in this case.
- @script parameter contains the actual R script which will be executed
- The library function references the openxlsx package for use with rest of the code.
If the below code executes successfully, that means your package is correctly installed, configured and accessible to SQL Server.
execute sp_execute_external_script @language = N'R', @script = N' library(openxlsx) '
Step 4 - Now it’s time to read and create an Excel file, which we can read in the above code. Our intention is to read any Excel file with some data in it, so use any sample Excel file that you may have or create a new one with minimal data. Below is the Excel file we will be using in this tip. The file is named SampleExcel.xlsx.
Step 5 - Place the file in a directory that is accessible to SQL Server and Machine Learning Server (R Server). Execute the below code to read data from this file.
execute sp_execute_external_script @language = N'R', @script = N' library(openxlsx) DataFrame1 <- read.xlsx(xlsxFile = "C:\\Temp\\SampleExcel.xlsx", sheet = 1, skipEmptyRows = FALSE) OutputDataSet <- DataFrame1 ' with result sets ((Id int, Value varchar(2)))
In this code we are using the read.xlsx function from the openxlsx package and referencing the sample file that we created in the previous step. Make sure to use “\\” as single slash would be considered as an escape character. We are passing parameters to read data from first sheet. The output is assigned to a data frame, which we are then assigned to the output dataset. Finally, this output dataset is fitted to the schema mentioned in the result sets and we will get to see the result as shown below once the code is executed.
Step 6 - In the above setup we successfully read the data from an Excel file without loading it in SQL Server. We loaded that data as we needed the data via code i.e. in Real-Time. This helps reduce repeated data loading and can decouple storage and compute. Now let’s say that we read the data, performed some calculations, and now we intend to update this data or create a new dataset in a new file. Openxlsx library supports a variety of data write use-cases too on Excel files. We will learn how to create a new Excel file and modify some settings in the next step.
Step 7 - We will extend the code we created in Step 5 and create a new Excel file. Type the below code and execute to create a new Excel file with data.
execute sp_execute_external_script @language = N'R', @script = N' library(openxlsx) DataFrame1 <- read.xlsx(xlsxFile = "C:\\Temp\\SampleExcel.xlsx", sheet = 1, skipEmptyRows = FALSE) wb <- createWorkbook() addWorksheet(wb, "Siddharth Mehta") writeData(wb, "Siddharth Mehta", DataFrame1, startCol = 2, startRow = 3, rowNames = TRUE) saveWorkbook(wb, "C:\\Temp\\CopyOfSampleExcel.xlsx", overwrite = TRUE)'
Step 8 - Let’s understand the above code. After loading data from an Excel file into the data frame, the next line creates a workbook object using the createWorkbook function. Then using this object, we added a new worksheet to this workbook using the addWorksheet function and we gave it a name in the function parameter. After that we wrote data to this worksheet and passed the data frame as a parameter which we created from the Excel file that we read in the earlier step. We also provided options to start writing data from the second column and third row. Finally, in the last step we used the saveWorkbook function to save this workbook object and provided the filename and path to save the file.
Step 9 - Once you execute the above code, the new Excel file should get created. On opening this Excel file, it should look as shown below.
- This tip demonstrates a very tiny portion of the Excel file read and write options supported by the openxlsx library. Try exploring other options available in this library. You can read the openxlsx library documentation from here.
About the author
View all my tips