Powerful Text Extraction with SQL Server REGEXP

Problem

I need to extract text from string stored in a database table. However, they need to match a certain pattern, and the location can change between every row. Furthermore, it’s possible there are multiple matches inside one single string. Is it possible to do this in SQL Server?

Solution

In SQL Server 2025, support for regular expressions (regex) in T-SQL is introduced (as well as in Azure SQL DB and other SQL Server-related cloud products). With regex, more powerful pattern matching is possible than with the previous patterns that was available in SQL Server, such as those in the LIKE operator and the PATINDEX function. There are already quite some tips on the regex abilities in SQL Server 2025:

In this tip, we’ll explore the new REGEX_SUBSTR function and use it to extract parts of text using regex. If you don’t have access to SQL Server 2025 (which was in preview at the time of writing), you can use an instance of Azure SQL DB or Fabric SQL DB as well.

REGEXP_SUBSTR Function Syntax

The new function has the following syntax:

--MSSQLTips.com (TSQL)
REGEXP_SUBSTR(string_expression, regular expression [, start [, occurrence [, flags [, group ]]]])

The following parameters are mandatory:

string expression, which is the text in which you want to search
regular expression, which is the pattern you want to match

And there are several optional parameters:

start – this is the starting point of the search. This allows you to skip several characters in the text. It starts counting from 1, which means the first character is 1. If start is bigger than the length of the string, NULL is returned. The default is 1.
occurrence – if there are multiple matches in the string for the regex, it species which occurrence you want the function to return. The default is 1.
flags – these are the same options as explained in the tip REGEXP_LIKE Function in SQL Server 2025 (in the section “The Optional Flag Parameter”).
group – this parameter is used to specify which capture group (subexpression) to use. The default is 0, which means the entire regex is used.

The REGEXP_SUBSTR function is similar to the SUBSTRING function, but aside from the regex support, the REGEXP_SUBSTR is different because it doesn’t have an explicit start and length like SUBSTRING. Rather, REGEXP_SUBSTR will return an expression of any length and any position, as long as it matches the regex pattern.

Extract Text using REGEXP_SUBSTR

To demonstrate this new function, we’ll use the same sample data as in the tip SQL Server 2025 REGEXP_COUNT Function to Count Occurrences in Text (you can download the SQL script with INSERT statement here). This sample data contains AI-generated restaurant reviews that might contain zip codes, while there are different formats for the zip codes available.

To have bigger chunks of texts with possibly multiple zip codes inside it, we’ll concatenate the rows together into 10 long strings:

--MSSQLTips.com (TSQL)
WITH cte_src AS
(
    SELECT
         grp = ReviewID % 10
        ,ReviewText
    FROM dbo.Reviews
)
, cte_longreviews AS
(
    SELECT
         grp
        ,ReviewText_Long = STRING_AGG(ReviewText,' ')
    FROM cte_src
    GROUP BY grp
)
SELECT * FROM cte_longreviews;

Script

Very similar to how we extracted text in the tip Extract Text using Regular Expressions with SQL Server 2025 Function REGEXP_INSTR, we will use REGEXP_COUNT first to determine how many zip codes we have in a piece of text. Once we know how many occurrences we have for each row, we can use a tally table (also called numbers table) to iterate over each occurrence and extract it with our new function.

--MSSQLTips.com (TSQL)
DECLARE @zipcode_regex VARCHAR(50) = '\d{5}([-]|\s*)?(\d{4})?';
WITH cte_src AS
(
    SELECT
         grp = ReviewID % 10
        ,ReviewText
    FROM dbo.Reviews
)
, cte_longreviews AS
(
    SELECT
         grp
        ,ReviewText_Long = STRING_AGG(ReviewText,' ')
    FROM cte_src
    GROUP BY grp
)
, cte_cnt AS
(
    SELECT
         grp
        ,ReviewText_Long
        ,cnt = REGEXP_COUNT(ReviewText_Long, @zipcode_regex)
    FROM cte_longreviews
)
, cte_tally AS
(
    SELECT iterator = [value] FROM GENERATE_SERIES(1,100)
)
SELECT
     c.grp
    ,c.ReviewText_Long
    ,extracted_zip = REGEXP_SUBSTR(c.ReviewText_Long,@zipcode_regex,1,t.iterator)
FROM cte_cnt    c
JOIN cte_tally  t ON t.iterator <= c.cnt;

Let’s break this script down:

After grouping and concatenating the reviews into 10 long strings, we count the occurrences of zip codes using a regex pattern.
Then we create a tally table using the GENERATE_SERIES table-valued function. This common table expression (CTE) returns all numbers between 1 and 100 (we assume there are no more than 100 zip codes in one single piece of text.
We join the 10 rows with text against this iterator, resulting in 10 x 100 = 1000 rows, but we filter it down to the number of occurrences for that row. For example, suppose row #1 has 3 zip codes, we will have 3 rows from the iterator table (containing 1, 2 and 3).
Finally, we use the iterator value from the join with the tally table to extract that specific occurrence using the REGEXP_SUBSTR function.

This solution is a bit more elegant than the script using REGEXP_INSTR, since REGEXP_SUBSTR finds the boundaries of the matched pattern itself which means one less step is needed.

Using Groups to Extract a Part of a Pattern

The regex_substr documentation gives the following example to extract the last 4 digits of a credit card number:

--MSSQLTips.com (TSQL)
SELECT test = REGEXP_SUBSTR('1234 5678 1234 5678','\d{4}$');

However, this only works when the expression only contains the credit card number:

If it only works when the expression is just the credit card number, you don’t need a regex function, you can use RIGHT(mycolumn,4) to get the exact same result. Let’s modify the regex using groups so we can extract the last 4 digits even if other text is present. The following regex matches a credit card number (with spaces between the groups of digits): (\d{4}) (\d{4}) (\d{4}) (\d{4}). A matching pattern will have 4 groups of each 4 digits. The 5th parameter of REGEXP_SUBSTR allows us to specify which group we want to retrieve. The SQL becomes:

--MSSQLTips.com (TSQL)
SELECT REGEXP_SUBSTR('Hello, 1234 5678 1234 5678 is my credit card number.' --> string to be searched
                     ,'(\d{4}) (\d{4}) (\d{4}) (\d{4})' --> regex pattern contain 4 subexpressions
                     ,1 --> start
                     ,1 --> occurrence
                     ,'' --> flag
                     ,4); --> extract the 4th matched substring

extract last 4 digits of credit card number in text

Since, we need the last optional parameter (groups), we also need to specify the other parameters. We’re only searching for the first occurrence since the start of the string, but in the previous section of this tip we showed you how you can search for multiple occurrences using a tally table. For the flag parameter, an empty string is provided which tells SQL Server to use the default flag ‘c’, which is case-sensitive.

Keep in mind this is just an example, don’t share your credit card number.

Next Steps

Check out the regex_substr documentation, as it has some nice examples like extracting the domain of an e-mail address.
The tips Install SQL Server 2025 Demo Environment in Azure and Install AdventureWorks Database for SQL Server 2025 explain how you can start experimenting with SQL Server 2025.
You can find all SQL Server 2025 related tips in this overview.
Download sample data.

Koen Verbeeck

Koen Verbeeck is a seasoned business intelligence consultant with over a decade of experience with the Microsoft Data Platform. He holds several certifications, including Azure Data Engineer. He’s a prolific writer, with over 375 articles on technologies such as Microsoft Fabric, SSIS, ADF, SSAS, SSRS, MDS, Power BI, Snowflake and Azure services. He has spoken at various events such as PASS, SQLBits, dataMinds Connect and many others. He frequently delivers educational webinars on MSSQLTips.com. For his efforts, Koen has been awarded the Microsoft MVP data platform award for many years.

MSSQLTips Awards:
- Leadership Award (200+ Tips) – 2021
- Author of the Year – 2014/2020/2022
- Author Contender – 2024/2025