Ingest data from a web page into SQL Server 2017 using R

By:   |   Comments (6)   |   Related: > SQL Server 2017


Problem

This is the era of big data and data exists in every form in structured and unstructured format. One of the biggest mediums of data source is the open web where innumerable web pages host volumes of data. It is very possible that one may be browsing different sites for interesting information and one may come across some information on a web page that is of high interest. Copying data from the web page and adding it to a data repository is one method of doing it, but if this process of reading data from the web is automated, then it can become a regular data source. This then opens the possibility of using the entire web as your data source. The primary challenge in this process has been the dependency on application programmers to develop code that reads a web page and inserts data into a SQL Server table. In this tip we will develop code that can read data from web page just with T-SQL and R.

Solution

R libraries can be used from T-SQL to ingest data from the web into SQL Server which we will show in this tip.

Installing In-Database Machine Learning Services

R is a statistical programming language that draws its power from thousands of open-source libraries that adds huge volumes of functionality. SQL Server 2017 In-Database Machine Learning Services supports execution of R scripts from T-SQL. In this tip, we will develop a working example to read data from the web and ingest it into SQL Server.

To ingest data from a web page, first we need a web page which has data we want to use. Every web page does not have data in the form of an html table, but that does not mean the data is unstructured. It still has an html structure / pattern around the data. For this example, I have used the MSSQLTips Authors web page which contains a list of all the contributing authors on MSSQLTips.com. Below is the URL and a glimpse of the webpage showing my picture. If you analyze this page, you will find that all authors have an “Image”, “Name”, “Description”, and some other attributes like “Author Since” and “Tip Contributions”. This is a structure or a pattern in itself, though not a tabular structure.

Sample Webpage - Description: Sample Webpage containing author list

Let’s study the HTML structure in which the authors are listed. If you right-click the page and click “View Source”, you will be able to view the source of the page. Depending upon which browser you are using, you will find the source of the page as shown below. I searched for my name and I found three occurrences as highlighted below. Let’s say, the one highlighted in Orange is the Author Name we are interested in fetching in our T-SQL code. If you carefully observe, the “div” of all the author names have a class of “large-7” and wrapped in a hyperlink tag.

Page Source - Description: HTML Source code of the web page

Now let’s try to extract the list of authors in T-SQL using the code below. The sp_execute_external_script is used to execute R / Python Scripts in SQL Server 2017. The language parameter specifies the language being used is R. The script parameter specifies the R script to be executed. “rvest” is one of the R packages that can work with HTML / XML Data. Here we are going to use this package, hence we are using the library function to load this package. If your R installation does not have this package you can install it by executing the command in R console, which is commented with a # sign below.

We are reading the HTML of the webpage by using the read_html function by passing the web URL to this function and assigning the result to the authorshtml variable. This variable contains the HTML, which is filtered in the next statement. %>% is the pipe operator in R, which passes the output from the expression of the left side as the input to the expression on the right side. In the below code, the output of authorhtml is passed to html_nodes function to filter elements that match the pattern of “div and large-7” appearing in the HTML. The output of this is passed to the html_nodes function to filter the hyperlink HTML elements which has the tag “a”. The output is passed to the html_text function to extract the text of this hyperlink tag, which should be the name of the author.

The final output is assigned to the authors variable. SQL Server mandates that any data exchanges between T-SQL and R should be in the form of a data frame, so we are using the data.frame function to create a data frame using the data in authors variable and assigning it to the default output data set i.e. OutputDataSet. As the R script is emitting output, to collect the data in T-SQL we need to specify the definition of the output using the “with result sets” statement. Here we are just going to read a single field Authors, so we have created a single field in the result sets and given it a data type of varchar(max). You can limit the size of this field, but if you are unsure of the max length of the author name or not sure if the length may change in the future, you can use the length as max.

execute sp_execute_external_script 
@language = N'R',
@script = N'
 
#install.packages("rvest")
library(rvest)
authorshtml <- read_html("https://www.mssqltips.com/sql-server-mssqltips-authors/")
authors <- authorshtml %>% html_nodes("div .large-7") %>% html_nodes("a") %>% html_text() 
OutputDataSet <- data.frame(authors)
 
'
with result sets ((Authors varchar(max)))
   

After executing the code, you should find the output as below. The list of the authors is available in a tabular single field as the output of this code. You can wrap this code in a view / stored procedure / function and may even use it inside a SQL Server Agent Job to refresh data in your Authors table where you may want to frequently read the data from web and refresh it with latest data. The interesting part is that we are able to ingest data from a web page using just 4 lines of code in R script and less than 10 lines of total code.

Ingested Data from Web Page - Description: Ouput of code that ingests data from web page and shows it in T-SQL.

In this way we can open the gates to ingest data from the web into SQL Server using In-Database Machine Learning Services and R in SQL Server 2017, without any dependency on application components or complex scripting in SQL Server.

Next Steps
  • Consider ingesting data from different web-pages where you may find data of interest using the above code example.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Siddharth Mehta Siddharth Mehta is an Associate Manager with Accenture in the Avanade Division focusing on Business Intelligence.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, August 14, 2017 - 10:09:18 AM - Siddharth Mehta Back To Top (64817)

Hi BV_Data,

The error message says that your R insallation does not have Rvest. You will have to install the same.

 

Regards,

Siddharth


Monday, August 14, 2017 - 9:27:05 AM - BV_Data Back To Top (64815)

I am getting this error: Not sure what I am doing wrong.

 

Msg 39004, Level 16, State 20, Line 0

A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004.

Msg 39019, Level 16, State 1, Line 0

An external script error occurred: 

Error in library(rvest) : there is no package called 'rvest'

Calls: source -> withVisible -> eval -> eval -> library

 

Error in ScaleR.  Check the output for more information.

Error in eval(expr, envir, enclos) : 

  Error in ScaleR.  Check the output for more information.

Calls: source -> withVisible -> eval -> eval -> .Call

Execution halted

Msg 11536, Level 16, State 1, Line 1

EXECUTE statement failed because its WITH RESULT SETS clause specified 1 result set(s), but the statement only sent 0 result set(s) at run time


Thursday, August 3, 2017 - 5:35:15 PM - Francesco Mantovani Back To Top (63879)

It's not working for me because "SQL Server was unable to communicate with the LaunchPad service."

If I try to run Administrative Tools > Services > SQL Server launchpad it says: Error 1053: The service did not respond to the start or control request in a time  fashion.


Thursday, August 3, 2017 - 1:35:46 PM - bernie black Back To Top (63860)

Nice tip !  Excellent.


Thursday, August 3, 2017 - 8:54:37 AM - Robert Sterbal Back To Top (63847)

 Did you develop code to collect tabular data as well? Something that returns Author Name, twittter handle, blog url and other links?

 


Tuesday, July 11, 2017 - 10:46:08 PM - Daniel Farina Back To Top (59267)

Hi Siddharth!

Great tip! I just wonder if it is possible to ingest web data over a https connection on a site that requieres logging in. For example, what if I need to ingest the emails of my gmail account?

Thanks!

 















get free sql tips
agree to terms