By: Eli Leiba | Comments (3) | Related: > TSQL
Problem
I need to extract all URLs from a source file such as an email, a web page or just text files. In this tip I will present a SQL Server stored procedure you can use to extract all URLs from a file.
Solution
The solution involves creating a T-SQL stored procedure in the SQL Server master database, called dbo.usp_extract_links, that accepts a single file name with the full path parameter and lists the URL links that exists in it.
Here is the stored procedure logic:
- The procedure creates a global temporary table that will store the input file contents.
- The procedure executes a dynamic T-SQL BULK INSERT statement that populates a global temporary table.
- The procedure opens a cursor that loops over the lines of the global temporary table.
- The procedure scans each line for a URL link (a href= tag value).
- If such value exists, it is stored in a table variable. If not, the procedure advances the cursor to the next loop iteration.
- When the loop ends, the table variable holds the URL link values.
- The global table is dropped and the cursor is deallocated.
- The table variable contents are displayed as output.
SQL Server Stored Procedure to Extract URL from Source File
-- ==================================================================================== -- Author: Eli Leiba -- Create date: 2020-10-08 -- Description: extracts the URL Links from a source web (html) file -- Result is the URL list -- ==================================================================================== CREATE PROCEDURE dbo.usp_extract_links (@htmfile nvarchar(3000)) AS BEGIN DECLARE @bulkinstsql varchar (1000), @currline nvarchar(3000), @res nvarchar(2000), @link nvarchar(1000) DECLARE @P1 int, @P2 int DECLARE @RESTABLE TABLE (id INT identity, URLlink nvarchar(1000)) CREATE TABLE ##htmlTable (detline nvarchar(3000)) SET @bulkinstsql = 'BULK INSERT ##htmlTable from ' +'''' + @htmfile + ''''; EXEC (@bulkinstsql) DECLARE c_line cursor for select detLine from ##htmlTable OPEN c_line FETCH NEXT FROM c_line INTO @currline WHILE @@FETCH_STATUS = 0 BEGIN SET @link = NULL; SET @P1 = CHARINDEX ('href=', @currline,1) IF (@p1 = 0) SET @link = null ELSE BEGIN SET @res = SUBSTRING (@currline, @P1 + 6, len (@currline) - @P1 - 7) SET @P2 = CHARINDEX ('>', @RES,1) IF @p2 = 0 SET @link = null ELSE BEGIN SELECT @link = substring (@res, 1, @p2 - 2) END END IF @link is NOT NULL insert into @RESTABLE values (@link) FETCH NEXT FROM c_line INTO @currline END CLOSE c_line DEALLOCATE c_line DROP table ##htmlTable SELECT * from @RESTABLE END GO
Sample Stored Procedure Execution
Below is sample code from a web page. I saved the web page source code in an html file called sample.htm in c:\sqltips directory.
<!DOCTYPE html> <html class="no-js" lang="en"> <head> <meta http-equiv="Content-Language" content="en-us" /> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" /> <link rel="SHORTCUT ICON" href="/favicon.ico" /> <meta content="width=device-width, initial-scale=1.0" name="viewport" /> <title>Join the MSSQLTips SQL Server Community</title> <meta name="description" content="Signup for the MSSQLTips.com newsletter and get free SQL Server tips delivered to you." /> <link rel="canonical" href="https://www.mssqltips.com/get-free-sql-server-tips/" /> <link rel="stylesheet" href="/css/foundation-minimized.css"> <link rel="stylesheet" href="/css/app.css"> </head> <body style="background-color:white;"> <a name="top"></a> <div class="sticky"> <nav class="top-bar" data-topbar role="navigation" "> <ul class="title-area"> <li class="name"> <a href="https://www.mssqltips.com/"><img alt="mssqltips logo" src="/images/mslogo_bw_89x45-2.png"></a> </li> <li class="toggle-topbar menu-icon"><a href="#"><span>Menu</span></a></li> </ul> <section class="top-bar-section"> <ul class="right"> <li class="has-dropdown"><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="/sql-server-tutorials/">Beginner</a> <ul class="dropdown"> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/1948/what-is-sql-server/">What is SQL Server?</a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href=" https://www.mssqltips.com/sqlservertutorial/9214/sql-server-101-tutorial-outline-and-overview/">SQL Server 101</a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/6422/sql-server-concepts/">SQL Server Concepts</a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/6429/sql-server-download-quick-links/">SQL Server Download Links</a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertutorial/3700/my-first-sql-server-business-intelligence-project-tutorial/">My First BI Project </a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sqlservertip/2677/sql-server-career-planning/">SQL Server Career Plan</a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sql-server-webcasts/">Webinars</a></li> <li><a onclick="_gaq.push(['_trackEvent', 'hmenu', 'beginner', 'click']);" href="https://www.mssqltips.com/sql-server-categories/">All Categories</a></li> </ul> </li> </body> </html>
The procedure is executed as follows:
exec dbo.usp_extract_links 'c:\sqltips\sample.htm'
Here are the results:
Next Steps
- You can create and compile this stored procedure in your master database and use it as a simple TSQL tool for extracting URL links from web page source HTML and JS code.
- The procedure was tested for SQL Server 2019, but should work with most versions of SQL Server.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips