How to Extract URLs from HTML using Stored Procedure in SQL Server


By:   |   Updated: 2020-12-10   |   Comments (3)   |   Related: More > T-SQL


Problem

I need to extract all URLs from a source file such as an email, a web page or just text files. In this tip I will present a SQL Server stored procedure you can use to extract all URLs from a file.

Solution

The solution involves creating a T-SQL stored procedure in the SQL Server master database, called dbo.usp_extract_links, that accepts a single file name with the full path parameter and lists the URL links that exists in it.

Here is the stored procedure logic:

  1. The procedure creates a global temporary table that will store the input file contents.
  2. The procedure executes a dynamic T-SQL BULK INSERT statement that populates a global temporary table.
  3. The procedure opens a cursor that loops over the lines of the global temporary table.
  4. The procedure scans each line for a URL link (a href= tag value).
  5. If such value exists, it is stored in a table variable. If not, the procedure advances the cursor to the next loop iteration.
  6. When the loop ends, the table variable holds the URL link values.
  7. The global table is dropped and the cursor is deallocated.
  8. The table variable contents are displayed as output.

SQL Server Stored Procedure to Extract URL from Source File

-- ====================================================================================
-- Author:		Eli Leiba
-- Create date: 2020-10-08
-- Description:	extracts the URL Links from a source web (html) file 
-- Result is the URL list
-- ====================================================================================
CREATE PROCEDURE dbo.usp_extract_links (@htmfile nvarchar(3000))
AS
BEGIN
   DECLARE @bulkinstsql varchar (1000), @currline nvarchar(3000), @res nvarchar(2000), @link nvarchar(1000)
   DECLARE @P1 int, @P2 int
   DECLARE @RESTABLE TABLE (id INT identity, URLlink nvarchar(1000))
   
   CREATE TABLE ##htmlTable (dateline nvarchar(3000))
   
   SET @bulkinstsql = 'BULK INSERT ##htmlTable from ' +'''' + @htmfile + '''';
   
   EXEC (@bulkinstsql)
   
   DECLARE c_line cursor for select detLine from ##htmlTable
   
   OPEN c_line 
   
   FETCH NEXT FROM c_line INTO @currline  

   WHILE @@FETCH_STATUS = 0  
   BEGIN
      SET @link = NULL;
      SET @P1 = CHARINDEX ('href=', @currline,1) 
      IF (@p1 = 0) SET @link = null
      ELSE
      BEGIN
         SET @res = SUBSTRING (@currline, @P1 + 6, len (@currline) - @P1 - 7)
         SET @P2 = CHARINDEX ('>', @RES,1)
         IF @p2 = 0 SET @link = null
         ELSE
         BEGIN
            SELECT @link = substring (@res, 1, @p2 - 2)
         END
      END
      IF @link is NOT NULL insert into @RESTABLE values (@link)
      FETCH NEXT FROM c_line INTO @currline 
   END

   CLOSE c_line 
   DEALLOCATE c_line

   DROP table ##htmlTable

   SELECT * from @RESTABLE
END
GO

Sample Stored Procedure Execution

Below is sample code from a web page.  I saved the web page source code in an html file called sample.htm in c:\sqltips directory.

<!DOCTYPE html>
<html class="no-js" lang="en">
<head>
<meta http-equiv="Content-Language" content="en-us" />
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />
<link rel="SHORTCUT ICON" href="/favicon.ico" />
<meta content="width=device-width, initial-scale=1.0" name="viewport" />
<title>Join the MSSQLTips SQL Server Community</title>
<meta name="description" content="Signup for the MSSQLTips.com newsletter and get free SQL Server tips delivered to you." />
<link rel="canonical" href="https://www.mssqltips.com/get-free-sql-server-tips/" />
<link rel="stylesheet" href="/css/foundation-minimized.css">
<link rel="stylesheet" href="/css/app.css">
</head>
<body style="background-color:white;">
<a name="top"></a>
<div class="sticky">
<nav class="top-bar" data-topbar role="navigation" ">
<ul class="title-area">
<li class="name">
<a href="https://www.mssqltips.com/"><img alt="mssqltips logo" src="/images/mslogo_bw_89x45-2.png"></a>
</li>
<li class="toggle-topbar menu-icon"><a href="#"><span>Menu</span></a></li>
</ul>
<section class="top-bar-section">
<ul class="right">
<li class="has-dropdown"><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="/sql-server-tutorials/">Beginner</a>
<ul class="dropdown">
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/1948/what-is-sql-server/">What is SQL Server?</a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href=" https://www.mssqltips.com/sqlservertutorial/9214/sql-server-101-tutorial-outline-and-overview/">SQL Server 101</a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/6422/sql-server-concepts/">SQL Server Concepts</a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/6429/sql-server-download-quick-links/">SQL Server Download Links</a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertutorial/3700/my-first-sql-server-business-intelligence-project-tutorial/">My First BI Project </a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/2677/sql-server-career-planning/">SQL Server Career Plan</a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sql-server-webcasts/">Webinars</a></li>
<li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sql-server-categories/">All Categories</a></li>
</ul>
</li>
</body>
</html>
			

The procedure is executed as follows:

exec dbo.usp_extract_links 'c:\sqltips\sample.htm'			

Here are the results:

sql server
Next Steps
  • You can create and compile this stored procedure in your master database and use it as a simple TSQL tool for extracting URL links from web page source HTML and JS code.
  • The procedure was tested for SQL Server 2019, but should work with most versions of SQL Server.


Last Updated: 2020-12-10


get scripts

next tip button



About the author
MSSQLTips author Eli Leiba Eli Leiba is a senior application DBA, a teacher and a senior database consultant with 19 years of RDBMS experience.

View all my tips



Comments For This Article




Saturday, December 12, 2020 - 1:19:33 PM - Eli Leiba Back To Top (87905)
Yes , This issue is not handled by this procedure and the procedure needs modifications to support it.

Eli

Friday, December 11, 2020 - 6:23:07 PM - Tim Kehoe Back To Top (87902)
Two caveats: 1)a misspelling in the SP dateline should be detline 2) the html line which contains 9214 has a space before http hence the LTRIM in alternative code to using a cursor shown below. I included a CROSS APPLY to make the WHERE operand look simpler. The WHERE clause eliminates 2 rows that do not seem useful and includes the get-free-sql-server-tips link.

SELECT Link FROM ##htmlTable
CROSS APPLY
(SELECT SUBSTRING(DATELINE,CHARINDEX('href="',dateline,1)+6,
CHARINDEX('"',dateline,CHARINDEX('href="',dateline,1)+6) - (CHARINDEX('href="',dateline,1)+6)) AS Link) AS Links
WHERE dateline LIKE '%href="%' and LTRIM(Link) LIKE 'http%'

Friday, December 11, 2020 - 5:32:36 AM - Einav Hacohen Back To Top (87900)
This good simple solution, but what if the link is spreading over two lines?


download





Recommended Reading

SQL Server Cursor Example

Using MERGE in SQL Server to insert, update and delete at the same time

Rolling up multiple rows into a single row and column for SQL Server data

SQL Server Loop through Table Rows without Cursor

SQL Server DROP TABLE IF EXISTS Examples














get free sql tips
agree to terms