Ingest data from a web page into SQL Server 2017 using R
This is the era of big data and data exists in every form in structured and unstructured format. One of the biggest mediums of data source is the open web where innumerable web pages host volumes of data. It is very possible that one may be browsing different sites for interesting information and one may come across some information on a web page that is of high interest. Copying data from the web page and adding it to a data repository is one method of doing it, but if this process of reading data from the web is automated, then it can become a regular data source. This then opens the possibility of using the entire web as your data source. The primary challenge in this process has been the dependency on application programmers to develop code that reads a web page and inserts data into a SQL Server table. In this tip we will develop code that can read data from web page just with T-SQL and R.
R libraries can be used from T-SQL to ingest data from the web into SQL Server which we will show in this tip.
Installing In-Database Machine Learning Services
R is a statistical programming language that draws its power from thousands of open-source libraries that adds huge volumes of functionality. SQL Server 2017 In-Database Machine Learning Services supports execution of R scripts from T-SQL. In this tip, we will develop a working example to read data from the web and ingest it into SQL Server.
To ingest data from a web page, first we need a web page which has data we want to use. Every web page does not have data in the form of an html table, but that does not mean the data is unstructured. It still has an html structure / pattern around the data. For this example, I have used the MSSQLTips Authors web page which contains a list of all the contributing authors on MSSQLTips.com. Below is the URL and a glimpse of the webpage showing my picture. If you analyze this page, you will find that all authors have an “Image”, “Name”, “Description”, and some other attributes like “Author Since” and “Tip Contributions”. This is a structure or a pattern in itself, though not a tabular structure.
Let’s study the HTML structure in which the authors are listed. If you right-click the page and click “View Source”, you will be able to view the source of the page. Depending upon which browser you are using, you will find the source of the page as shown below. I searched for my name and I found three occurrences as highlighted below. Let’s say, the one highlighted in Orange is the Author Name we are interested in fetching in our T-SQL code. If you carefully observe, the “div” of all the author names have a class of “large-7” and wrapped in a hyperlink tag.
Now let’s try to extract the list of authors in T-SQL using the code below. The sp_execute_external_script is used to execute R / Python Scripts in SQL Server 2017. The language parameter specifies the language being used is R. The script parameter specifies the R script to be executed. “rvest” is one of the R packages that can work with HTML / XML Data. Here we are going to use this package, hence we are using the library function to load this package. If your R installation does not have this package you can install it by executing the command in R console, which is commented with a # sign below.
We are reading the HTML of the webpage by using the read_html function by passing the web URL to this function and assigning the result to the authorshtml variable. This variable contains the HTML, which is filtered in the next statement. %>% is the pipe operator in R, which passes the output from the expression of the left side as the input to the expression on the right side. In the below code, the output of authorhtml is passed to html_nodes function to filter elements that match the pattern of “div and large-7” appearing in the HTML. The output of this is passed to the html_nodes function to filter the hyperlink HTML elements which has the tag “a”. The output is passed to the html_text function to extract the text of this hyperlink tag, which should be the name of the author.
The final output is assigned to the authors variable. SQL Server mandates that any data exchanges between T-SQL and R should be in the form of a data frame, so we are using the data.frame function to create a data frame using the data in authors variable and assigning it to the default output data set i.e. OutputDataSet. As the R script is emitting output, to collect the data in T-SQL we need to specify the definition of the output using the “with result sets” statement. Here we are just going to read a single field Authors, so we have created a single field in the result sets and given it a data type of varchar(max). You can limit the size of this field, but if you are unsure of the max length of the author name or not sure if the length may change in the future, you can use the length as max.
execute sp_execute_external_script @language = N'R', @script = N' #install.packages("rvest") library(rvest) authorshtml <- read_html("https://www.mssqltips.com/sql-server-mssqltips-authors/") authors <- authorshtml %>% html_nodes("div .large-7") %>% html_nodes("a") %>% html_text() OutputDataSet <- data.frame(authors) ' with result sets ((Authors varchar(max)))
After executing the code, you should find the output as below. The list of the authors is available in a tabular single field as the output of this code. You can wrap this code in a view / stored procedure / function and may even use it inside a SQL Server Agent Job to refresh data in your Authors table where you may want to frequently read the data from web and refresh it with latest data. The interesting part is that we are able to ingest data from a web page using just 4 lines of code in R script and less than 10 lines of total code.
In this way we can open the gates to ingest data from the web into SQL Server using In-Database Machine Learning Services and R in SQL Server 2017, without any dependency on application components or complex scripting in SQL Server.
- Consider ingesting data from different web-pages where you may find data of interest using the above code example.
About the author
View all my tips