Web Screen Scraping with Python to Populate SQL Server Tables

By: Rick Dobson | Comments (11) | Related: > SQL Server 2017

Problem

My company manually tracks Internet web sites for competitive market intelligence and other data that help us to serve our customers better. I was recently tasked with developing automated procedures for extracting information from web sites to our SQL Server. Please present some sample Python and T-SQL scripts working together to illustrate how to scrape web pages to populate SQL Server tables.

Solution

SQL Server is great at storing data. Web sites are great at sharing data with anyone that views their pages in a browser. Web browsers read the html code at a url and display content embedded in the html. Web sites also sometimes offer an API for extracting data to facilitate the automatic extraction of web site content. For example, MSSQLTips.com demonstrated in a prior tip how to use the Google Finance API to extract historical stock prices with Python to SQL Server.

Unfortunately for those wishing to download data from web pages into SQL Server, not all web sites offer an API to facilitate programmatic download of content. Additionally, even when an API is offered, it may not provide easy programmatic access to all elements at the web site. Web scraping offers capabilities for extracting data from a web site when there is no programmatic support for downloading content.

This tip introduces the topic of web scraping with Python and how to use it to populate tables in SQL Server from web sites when there is no API to download data. The example for this tip centers on collecting information about tips published in MSSQLTips.com by different authors. However, you should keep in mind that the same general techniques can be applied to any substantive domain, such as prices for products from different retailers, financial and economic data, baseball player batting averages, and temperatures in different cities.

Libraries to use for web scraping with Python

A prior tip shows step-by-step instructions for installing Python 3.6 for use by all users on a server. The same tip also includes a simple example for getting started with Python programming and incorporating output from a Python program into SQL Server. While support for external scripting languages, such as Python and R, was introduced into recent versions of SQL Server, such as SQL Server 2016 and SQL Server 2017, the Python installation steps in the prior tip will work with any version of SQL Server that supports the xp_cmdshell extended stored procedure.

There are two key libraries to reference when web scraping with Python 3.6. The Requests library is a built-in Python library that lets your code specify a url and receive a response from the site. If your code connects to the url successfully and the web site is operational, Python will get a response from the web site just like a browser does when you point a browser at a url. The text from your Python request to the url is coded as html text. You can inspect the returned html text and compare it to a browser display of url content to get a feel for how data is encoded in the html.

The BeautifulSoup library is currently one of the most popular external libraries for parsing with Python the html returned by a programmatic request to a web site. Your code can use BeautifulSoup library objects and their methods for selectively excerpting html with embedded data returned from a request to a url. The following screen shot shows the process that I used to install the BeautifulSoup library on a computer that already had Python 3.6 installed. Notice that I used an admin command prompt because I wanted to install BeautifulSoup for use by all users on a computer.

Visually inspecting a source page for web scraping

This tip demonstrates how to extract tip titles, dates, and urls for tips by selected contributing authors to the MSSQLTips.com web site. The source for this information will be obtained from author profile pages at the MSSQLTips.com site; each contributing author has an author profile page listing tips by an author. The following two screen shots show excerpts from the author profile page for Jeremy Kadlec - one of the co-founders of MSSQLTips.com. There are many authors who contribute tips to the web site for SQL Server administrators, developers, and even general users.

The hyperlinks on the author profile pages have two main sections aside from user support and sponsor/partner promotion. The first main author section starts with the author's name in a html h1 heading format. After the author's name, the first section provides a general description of the author along with special accomplishments, such as awards for reaching a certain number of contributed articles. The next screen shot presents an excerpt from the first main section for Jeremy Kadlec. Notice especially the prominent display of the author name in a h1 format.

Scraping information on Jeremy Kadlec's Header

The second main section for Jeremy Kadlec appears in the screen shot below. This section enumerates the titles for articles contributed by an author. The tip titles are derived from text in a hyperlink after the web location referenced in the link's href attribute. Additionally, tip titles are followed by month, day, and year when a tip initially appeared on the web site.

Scraping information on Jeremy Kadlec's Detailed Tips

The following Python script shows a very simple and short code sample to download the html for the web page from which the preceding two screen shots were derived. This code sample is from the Integrated Development Language Environment (IDLE) application that installs as part of Python 3.6. Comment lines start with a hash (#) sign.

The first line of code imports the BeautifulSoup library for use within the script. While nice to have when your plan is to extract selective content from web pages, this library is not strictly necessary in this case because all this script does is download the whole page's html for visual inspection.

The next line of code imports the requests built-in Python library. The Python requests object has a get method which allows you to specify a url address for a web page. The get method can return a response just like when you insert a url as an address in a browser. The address used in the code sample below is for Jeremy Kadlec's author profile page at the MSSQLTips.com site. You can confirm this by verifying that the get argument has the same value as the address box in the preceding two screen shots.

The response from the get method (r) has several properties - one of which is the text property. The code sample below assigns the text property value to the data variable that is, in turn, printed. This step makes the page's html available for inspection.

#import BeautifulSoup
from bs4 import BeautifulSoup

#import request
import requests

#specify a web page for inspection
r=requests.get('https://www.mssqltips.com/sqlserverauthor/38/jeremy-kadlec/')
data = r.text

#return whole page for inspection
print (data)

The following screen shot shows an excerpt from the printed output generated by the preceding script with added comments. The excerpted text appears in a blue font and added comments appear in a black font.

Notice that the author name - Jeremy Kadlec - appears towards the top of the excerpt between opening and closing h1 tags. As it turns out, this is the only h1 tag pair on the author profile page. Therefore, you can readily extract an author's name by retrieving the text between opening and closing h1 tags on the response page from an author profile url.
The hyperlink reference address, tip title, and publication date for each tip appears within opening and closing li tags. The black comments towards the bottom of the excerpt mark these field values for Jeremy Kadlec's first two tips appearing on his author profile page.
The information for each tip starts with a hyperlink tag pair (<a> and </a>). A hyperlink reference (href value) points to the web location for a tip. All tip href values start with /sqlservertip/, which directs the hyperlink follower to an internal location within the MSSQLTips.com site.
- When specifying the url for a title from outside the site, you should prefix the internal hyperlink address with the name for the website and optionally a protocol for connecting to the web site
- For example, the url corresponding to the first tip should be represented as /sqlservertip/1130/sql-server-business-intelligence-tools/
The tip title appears as plain text after the hyperlink reference.
A tip's publication date appears within parentheses after the closing hyperlink tag (</a>) within each li tag pair. The publication date in parenthesis is text within the opening and closing li tags for a tip.

Parsing the html for a source page with Python

The Python script below indicates an approach to using the BeautifulSoup external library to parse author name, text from the li tags for tips, and hyperlink references from the html for an author's profile page on the MSSQLTips.com web site. The script specifically references the author profile for Jeremy Kadlec. The hyperlinks extracted are for the whole page whether or not they specifically pertain to a tip by the author for an author profile page. Once the hyperlinks are imported within SSMS, T-SQL code can be used to extract just the hyperlinks for tips - namely, those starting with /sqlservertip/.

This script starts similarly to the one for printing an author profile page for visual inspection. Changes to facilitate the parsing of the html for a page start after the text property of the response to a Python request is transferred to the variable named data. This variable is used as a parameter for the BeautifulSoup function library call. The other parameter to the function library designates the html.parser. There is another BeautifulSoup parser for xml documents. The output from the call to the function library is saved in another variable named soup.

The remainder of the script extracts three sets of data from the soup variable.

The findAll function for the h1 tag extracts all h1 tags in the author profile page. There is just one opening and closing h1 tag pair per author profile page. The text between the opening and closing h1 tags is the author name. Therefore, the print command after the first findAll function adds a single line to the output from the script with contents like: <h1> Author Name</h1>
Next, a second findAll function extracts all li tags from the author profile page. There is one pair of li tags for each tip by an author. The li tags extracted by the second findAll function contain the tip title and its publication date, which are in the text property value for each li tag pair. Therefore, the print command at the end of this code block adds one line with a title and date for each tip to the output.
The third findAll function extracts all the anchor tags. Recall that anchor tags denote hyperlinks, and there are hyperlinks for several different purposes within an author profile page. The print command in this code block prints the href value for each hyperlink on an author page. A subset of the href values are for tips; the returned href values will be filtered subsequently within a T-SQL script to extract just hyperlinks for tips from the author highlighted on the author profile page.

#import BeautifulSoup
from bs4 import BeautifulSoup

#import request
import requests

#retrieve html of target page
#and prepare for page parsing by BeautifulSoup
r=requests.get('https://www.mssqltips.com/sqlserverauthor/38/jeremy-kadlec/')
data = r.text
soup = BeautifulSoup(data, "html.parser")

#return author name
for h1 in soup.findAll('h1'):
        print (h1.get)

#returns text for tip title followed by date
for li in soup.findAll('li'):
        print (li.text)

#return url for links on a page
for link in soup.findAll('a'):
        print (link.get('href'))

Two additional nearly identical scripts were created to retrieve title information for two other MSSQLTips.com authors (Tim Wiseman and Rick Dobson). The only difference between all three scripts is the argument for the get method of the requests object. You can find the complete list of MSSQLTips.com authors at /sql-server-mssqltips-authors/. Then, all you need to do is to follow an author's link to discover the url for an author's profile page.

For your easy reference, the author profile page urls for Tim Wiseman and Rick Dobson appear below. Notice that the author profile urls are different by both name and a number preceding the author name. By the way, I chose Tim Wiseman as an author because he also programs in both Python and T-SQL. You may find his work at MSSQLTips.com and elsewhere of value especially if you are interested in using SQL Server and Python together.

https://www.mssqltips.com/sqlserverauthor/86/tim-wiseman/

https://www.mssqltips.com/sqlserverauthor/57/rick-dobson/

The three Python scripts were saved in the c:\python_programs folder. The script names were: jeremy_kadlec_screen_scrape.py, tim_wiseman_screen_scrape.py, and rick_dobson_screen_scrape.py.

Transferring data field values from parsed html to a SQL Server table

A prior tip demonstrated in detail how to retrieve output from Python scripts for use in SSMS. The approach illustrated in the prior tip uses the xp_cmdshell extended stored procedure and a Windows batch file that runs a Python script with the Python interpreter (python.exe).

Because the current tip uses three different Python scripts, the process was updated slightly. Instead of using a separate Windows batch file for the output from each of the three Python scripts, a single Windows batch file (mssqltips_titles_for_authors_screen_scrape.bat) was used with a parameter that points at a path with a Python script file. The code for the Windows batch file appears below. When running the batch file, assign the Python script name and path to the parameter (%1).

cd C:\Program Files\Python36python.exe %1

The next script shows a commented T-SQL script for extracting the title information for Jeremy Kadlec's tips into a temp table within SQL Server using the Python script and Windows batch files presented above.

Here's a walkthrough of key script code segments.

The batch of code from EXEC sp_configure 'xp_cmdshell', 1 RECONFIGURE through EXEC sp_configure 'xp_cmdshell', 0 RECONFIGURE runs the jeremy_kadlec_screen_scrape.py script file in the c:\python_programs\ path and stores the output in the #Result temp table.
The next block of code extracts the author's name and stores the value in the @author local variable; this local variable is used to denote which titles are authored by which author. So far, there is just one author, but this can be easily changed (see the next script for more on this).
Next, a couple of sequence objects (CountBy1_for_titles and CountBy1_for_links) are created for matching rows from two temporary tables.
- One of these temp tables (#titles) stores tip title and publication date.
- The other temp table (#title_links) stores the href value for the hyperlink that you can follow to a tip.
- The NEXT VALUE FOR function generates sequence instance values within each temp table to facilitate matching the titles and tip hyperlink href values based on the order in which they appear on the source page.
- As a result, the two temp tables can be joined by sequence instance values to unite title and publication date with hyperlink href value for each tip.
The join by sequence values in the next to last code block populates the #titles_date_link_author temp table with tip title, publication date, url for the tip, and author name for the tip.
The final line of the script below prints all the rows from the #titles_date_link_author table. This line is optional especially if you will be populating the temp table with other sets of title information by different authors for inclusion in the #titles_date_link_author temp table.

-- for use by Python dev team members


-- enable the xp_cmdshell stored procedure
EXEC sp_configure 'xp_cmdshell', 1
RECONFIGURE

--------------------------------------------------------------------------------------

-- create a fresh temp table (#Result) for storing unprocessed result set from 
-- running xp_cmdshell for Windows batch that invokes
-- python script
if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#Result')
)
BEGIN
  DROP TABLE #Result; 
END

CREATE TABLE #Result
(
  line varchar(500)
)

--------------------------------------------------------------------------------------

-- insert python script output into #Result temp table
INSERT #Result exec xp_cmdshell 
 'C:\python_programs\mssqltips_titles_for_authors_screen_scrape.bat "c:\python_programs\jeremy_kadlec_screen_scrape.py"'

--------------------------------------------------------------------------------------

-- disable the xp_cmdshell stored procedure
EXEC sp_configure 'xp_cmdshell', 0
RECONFIGURE
GO

 -- extract author name from h1 tags
 -- and assign to local variable
declare @author varchar(50) = 
(
select 
substring
(
 line
,charindex('<h1>',line)+4
,(charindex('</h1>',line)) - (charindex('<h1>',line)+4)
)
from #Result
where line like '%<h1>%'
)

-- create sequence objects for matching
-- #titles table and #title_links tables
begin try
drop sequence dbo.CountBy1_for_titles
end try
begin catch
print 'dbo.CountBy1_for_titles did not exist'
end catch
begin try
drop sequence dbo.CountBy1_for_links
end try
begin catch
print 'CountBy1_for_links did not exist'
end catch

CREATE SEQUENCE CountBy1_for_titles
    START WITH 1  
    INCREMENT BY 1 ;  

CREATE SEQUENCE CountBy1_for_links
    START WITH 1  
    INCREMENT BY 1 ;  

-- conditionally drop a temp table (#titles) for storing 
-- title and publication date values from 
-- result set from running xp_cmdshell for Windows batch that invokes
-- python script

if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#titles')
)
BEGIN
  DROP TABLE #titles; 
END

-- titles with dates and source_row_number
select 
 NEXT VALUE FOR CountBy1_for_titles AS source_row_number
,substring
(
 line
,1
,patindex('%([1-9]%',line)-2
)  [title]

,cast
(
 substring
 (
  line
  ,patindex('%([1-9]%',line)+1,charindex(')'
  ,line,patindex('%([1-9]%',line)) - patindex('%([1-9]%',line)-1
 )
as date) [date]
into #titles
from #Result
where 
charindex('(',line) > 0
and charindex(')',line) > 0
and 
isdate
(
 substring
 (
   line
  ,patindex('%([1-9]%',line)+1,charindex(')'
  ,line,patindex('%([1-9]%',line)) - patindex('%([1-9]%',line)-1
 )
) = 1

if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#title_links')
)
BEGIN
  DROP TABLE #title_links; 
END

-- links with source_row_number
select
NEXT VALUE FOR CountBy1_for_links AS source_row_number
,'https://www.MSSQLTips.com'+line [link for title]
into #title_links
from #Result
where line like '/sqlservertip/%'

-- create #titles_date_link_author and save join 
-- between #titles and #title_links in it
if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#titles_date_link_author')
)
BEGIN
  DROP TABLE #titles_date_link_author; 
END

select
#titles.title
,#titles.date
,#title_links.[link for title]
,@author [author for title]
into #titles_date_link_author
from #titles
left join #title_links
on #titles.source_row_number = #title_links.source_row_number

-- this line is included to show the result set,
-- but you may not want it especially if you
-- plan to concatenate title information for 
-- several different authors
select * from #titles_date_link_author

The preceding script is necessarily unique from scripts for tips by other authors besides Jeremy Kadlec for at least a couple of reasons.

First, you should not drop and recreate the #titles_date_link_author table when you are inserting a new result set for a different author in the #titles_date_link_author table. Instead, just insert rows for the new author to the rows already in the table for any previous authors.
Second, the Python script name for the result set will be different. For example, the script name for adding title information for Tim Wiseman to the #titles_date_link_author table must change. Use c:\python_programs\tim_wiseman_screen_scrape.py as the parameter value for the mssqltips_titles_for_authors_screen_scrape.bat file. Recall the parameter value was c:\python_programs\jeremy_kadlec_screen_scrape.py in the preceding script.

The following script shows the two changes implemented in a separate T-SQL script for tip title information for Tim Wiseman. Additionally, there is no trailing select statement to display all rows in the #titles_date_link_author table.

-- for use by Python dev team members


-- enable the xp_cmdshell stored procedure
EXEC sp_configure 'xp_cmdshell', 1
RECONFIGURE

--------------------------------------------------------------------------------------

-- create a fresh temp table (#Result) for storing unprocessed result set from 
-- running xp_cmdshell for Windows batch that invokes
-- python script
if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#Result')
)
BEGIN
  DROP TABLE #Result; 
END

CREATE TABLE #Result
(
  line varchar(500)
)

--------------------------------------------------------------------------------------

-- insert python script output into #Result temp table
INSERT #Result exec xp_cmdshell 
 'C:\python_programs\mssqltips_titles_for_authors_screen_scrape.bat "c:\python_programs\tim_wiseman_screen_scrape.py"'

--------------------------------------------------------------------------------------

-- disable the xp_cmdshell stored procedure
EXEC sp_configure 'xp_cmdshell', 0
RECONFIGURE
GO

 -- extract author name from h1 tags
 -- and assign to local variable
declare @author varchar(50) = 
(
select 
substring
(
 line
,charindex('<h1>',line)+4
,(charindex('</h1>',line)) - (charindex('<h1>',line)+4)
)
from #Result
where line like '%<h1>%'
)

-- create sequence objects for matching
-- #titles table and #title_links tables
begin try
drop sequence dbo.CountBy1_for_titles
end try
begin catch
print 'dbo.CountBy1_for_titles did not exist'
end catch
begin try
drop sequence dbo.CountBy1_for_links
end try
begin catch
print 'CountBy1_for_links did not exist'
end catch

CREATE SEQUENCE CountBy1_for_titles
    START WITH 1  
    INCREMENT BY 1 ;  

CREATE SEQUENCE CountBy1_for_links
    START WITH 1  
    INCREMENT BY 1 ;  

-- conditionally drop a temp table (#titles) for storing 
-- title and publication date values from 
-- result set from running xp_cmdshell for Windows batch that invokes
-- python script

if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#titles')
)
BEGIN
  DROP TABLE #titles; 
END

-- titles with dates and source_row_number
select 
 NEXT VALUE FOR CountBy1_for_titles AS source_row_number
,substring
(
 line
,1
,patindex('%([1-9]%',line)-2
)  [title]

,cast
(
 substring
 (
  line
  ,patindex('%([1-9]%',line)+1,charindex(')'
  ,line,patindex('%([1-9]%',line)) - patindex('%([1-9]%',line)-1
 )
as date) [date]
into #titles
from #Result
where 
charindex('(',line) > 0
and charindex(')',line) > 0
and 
isdate
(
 substring
 (
   line
  ,patindex('%([1-9]%',line)+1,charindex(')'
  ,line,patindex('%([1-9]%',line)) - patindex('%([1-9]%',line)-1
 )
) = 1


-- hyperlinks for titles with source_row_number
if exists (
 select * from tempdb.dbo.sysobjects o
 where o.xtype in ('U') 
 and o.id = object_id(N'tempdb..#title_links')
)
BEGIN
  DROP TABLE #title_links; 
END

select

NEXT VALUE FOR CountBy1_for_links AS source_row_number
,'https://www.MSSQLTips.com'+line [link for title]
into #title_links
from #Result
where line like '/sqlservertip/%'

-- create #titles_date_link_author and save join 
-- between #titles and #title_links in it
insert into #titles_date_link_author
--
select
#titles.title
,#titles.date
,#title_links.[link for title]
,@author [author for title]
from #titles
left join #title_links
on #titles.source_row_number = #title_links.source_row_number

Finally, please consider looking in the download file set for this tip because it contains a script file that populates a #titles_date_link_author table with tip title information for Jeremy Kadlec, Tim Wiseman, and Rick Dobson. The following three screen shots show tip title information for the most recent 10 tips for each author as of the time this tip was prepared (early October 2017). You can verify the tip title information by visiting the MSSQLTips.com author profile page for each author.

Next Steps

Install Python on your computer if it is not already installed. See this tip for detailed instructions on how to install Python in a Windows environment so that all users on a computer can run Python script files.
If you are concerned about maintaining a very high level of security on your SQL Server instance, review another tip on how to securely configure SQL Server when running Python scripts with the xp_cmdshell extended stored procedure.
Next, download the files for this tip into a c:\python_programs directory on your computer for running SQL Server.
Then, run the T-SQL script for getting title information for Jeremy Kadlec. Confirm that the title information from the result set matches the MSSQLTips.com author profile page content for Jeremy Kadlec.
After that, run the script for returning tip title information for Jeremy Kadlec, Tim Wiseman, and Rick Dobson. Confirm that there are at least 280 tips in the #titles_date_link_author table; 280 is the correct number of tips as of the date that this tip was submitted for publication.
Next, change the list of authors by adding a new author to the list of authors for which you return tip title information. This involves
- Creating a new Python script file to generate output for the new author
- Adding a new code segment to the end of the T-SQL script for title tip information for Jeremy Kadlec, Tim Wiseman, and Rick Dobson
- Pointing the new Python script file segment at the new author

About the author

Rick Dobson is an author and an individual trader. He is also a SQL Server professional with decades of T-SQL experience that includes authoring books, running a national seminar practice, working for businesses on finance and healthcare development projects, and serving as a regular contributor to MSSQLTips.com. He has been growing his Python skills for more than the past half decade -- especially for data visualization and ETL tasks with JSON and CSV files. His most recent professional passions include financial time series data and analyses, AI models, and statistics. He believes the proper application of these skills can help traders and investors to make more profitable decisions.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

Wednesday, December 15, 2021 - 7:42:18 PM - Tim Edwards	Back To Top (89591)
Thanks, Greg, I really appreciate it! And Rick, great tip as always, I look forward to digging into this more!

Wednesday, December 15, 2021 - 6:07:16 PM - Greg Robidoux	Back To Top (89590)
Hi Tim, Thanks for letting us know the zip file was not there. Should be fixed now. -Greg

Wednesday, December 15, 2021 - 5:55:19 PM - Tim Edwards	Back To Top (89589)
I see this "Next, download the files for this tip into a c:\python_programs directory on your computer for running SQL Server." in the next steps section, but I don't see a link to actually download the files associated with this tip. Am I missing something?

Tuesday, November 27, 2018 - 1:48:56 PM - Adrian Clark	Back To Top (78337)
Hi, I couldn't locate the code for "mssqltips_titles_for_authors_screen_scrape.bat". Please could you let me know how to access this. Regards, Adrian