SQL Server 2016 Regular Expressions with the R Language

By:   |   Comments (1)   |   Related: > SQL Server 2016


Problem

It is well-known that T-SQL does not have native Regular Expression (RegEx) support for string related queries.  For example, you cannot search records with a regular expression in WHERE clause. I know I can write a CLR function / stored procedure with a RegEx do the work, but is there any other way that I can query a table with regular expressions in T-SQL code?

Solution

SQL Server 2016 has integrated the R language, i.e. T-SQL can call an R script directly and return the result set consumable by the SQL Server database engine directly. With the power of the R language and its native support for RegEx, we can now query a table with RegEx through R.

We will look at a few examples to demonstrate how to use RegEx in processing queries against tables.

Here are a few requirements:

  1. Find records with phone numbers which may have different formats
  2. Extract those phone numbers
  3. De-duplicate words that are in sequence, such as "hello hello world world" should be "hello world" after de-duplication.
  4. Merge multiple new lines to one line

Test Environment Setup

We will create a table with a few sample records as shown below:

use TestDB -- assume this is our testing database
go
if object_id('dbo.tblRegEx', 'U') is not null
 drop table dbo.tblRegEx

create table dbo.tblRegEx (id int identity, a varchar(300), b  varchar(300) );
go
insert into dbo.tblRegEx (a, b)
values ('hello hello hello world',  'my phone number is 321-111-1111')
, ( 'this this is is really fun','this number (604) 991-9111 is my cell phone')
, ( 'no duplicate here', 'no phone number here either, just my license# 111 111 2222')
, ( 'multiple blank lines
  
 
--this is 6th line', '222 333-4444 is my cell phone');

This is a very simple table with 3 columns, 2 of which are of varchar data type and will be used for our RegEx operation.

sample data

Quick Explanation of Test Data

  1. First two records (id=1, 2) have their [a] column with duplicate words.
  2. Three records have a phone number in the [b] column, but with different formats, i.e. 321-111-1111, (604) 991-9111 and 222 333-4444.
  3. For id=3 record, in its [b] column, we have a string of "111 111 2222", which looks like a phone number, but it is not as it misses a hyphen between 111 and 2222.
  4. The [a] column for the id = 4 record, has a value with 4 continuous empty lines, see the below screenshot with "Results to Text" turned on.
two emtpy lines in the data

T-SQL Code with R

RegEx is a powerful tool yet hard to master, when used properly, it provides an elegant and efficient solution. In the R language, RegEx is a maturely implemented feature which we can take advantage of.

We will first list the T-SQL code and the execution results, and then a detailed explanation of each section of the script.

USE TestDB 

--1. Find the records that contains phone number
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern = "\\b\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b"
outData <- subset(inData, grepl(pattern, b, perl = T))'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( as object dbo.tblRegEx);

--2. Extract the phone number
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern =  "(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)";
m <- regexpr(pattern, inData$b, perl = T)
n <- rep(NA,length(inData$b))
n[m != -1] <- regmatches(inData$b, m)
inData$c <- n;
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( (id int, a varchar(300), b varchar(300), phone varchar(20)) );


-- 3. Remove duplicate words
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern <-"\\b(\\w+\\s*)(\\1\\s*)+";
inData$a <- gsub(pattern, "\\1", inData$a, perl = T );
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( as object dbo.tblRegEx);


-- 4. Merge multiple empty lines to one line
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern = "(?m)(\\s*\\r?\\n){2,}"; 
# m <- regexpr(pattern, x , perl = T)
inData$a <- gsub(pattern, "\r\n\r\n", inData$a, perl = T );
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( as object dbo.tblRegEx);

Each T-SQL statement does two things:

  1. Read the table dbo.tblRegEx and pass the result set into R via the inData variable.
  2. Inside the R script, do the RegEx operation on the dataset of inData.

RegEx Explained

The following is a brief explanation of RegEx (i.e. pattern variable) used in the script

Note, in R, for a regular backslash \ in other language, such as C#, it needs to be doubled in R, i.e. \\.

  1. \\b\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b: phone number format
    • \\b: word boundary
    • \\(?: character ( once or none as ? means once or none
    • \\d{3}: followed by 3 digits
    • \\)?: followed by character ) once or none as ? means once or none
    • [-\\s]: followed by either a hyphen - or a blank space, i.e. \\s
    • \\d{3}: followed by 3 digits
    • -: followed by a hyphen -
    • \\d{4}: followed by 4 digits
    • \\b: word boundary
  2. (\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b): extract the phone number. This is almost identical to previous one, the outside () is to capture the match
  3. \\b(\\w+\\s*)(\\1\\s*)+: repeatable words format
    • \\b: word boundary
    • (\\w+\\s*): a word followed by one or more spaces
    • (\\1\\s*): \\1 means the exact copy of previous finding by (\\w+\\s*)
    • +: means repeating one or more times
  4. (?m)(\\s*\\r?\\n){2,}: two or more continuous blank lines
    • (?m): set the string to multiple line mode, so each line is analyzed via the RegEx pattern
    • (\\s*\\r?\\n): \\s*: zero or more blank spaces, \\r is carriage return, ? means 0 or 1 time, \\n is newline
    • {2,}:the previous match will be repeated one or more times.

Query Result

The following is the result of running the four R embedded T-SQL statements

query results

We can see four data results just as expected:

  • First data result, we get exactly 3 records that have phone numbers.
  • Second result, we indeed extract phone numbers out, and we add an additional column [phone] for these phone numbers.
  • Third result, we eliminate the duplicate words.
  • Fourth result, we merge the multiple empty lines to one, and this is not very clear, so we will switch to "Results to Text" to look at it

To look at the result of id=4 row for merge multiple empty lines to one line, we need to first dump the processed records into a temp table and then select the row with id = 4.

USE TestDB
select * into #t from dbo.tblRegEx where 1=0

insert into #t (id, a, b)
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern = "(?m)(\\s*\\r?\\n){2,}"; 
inData$a <- gsub(pattern, "\r\n\r\n", inData$a, perl = T );
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
--with result sets ( as object dbo.tblRegEx);
go
-- turn on "Results to Text" option
select a from #t where id=4
select a from dbo.tblRegEx where id =4

The result is as follows

merged empty lines

Summary

With the help of embedded R in SQL Server 2016, we can now take advantage of regular expressions to process some string related query and manipulation operations. This saves the overhead to write CLR objects to achieve the same result. This can be very handy for some business cases.

In this tip, we look at some examples about using RegEx in R, and hope this will help you to broaden your view on what R integration in SQL Server 2016 can help us in some niche scenarios.

Next Steps

Understanding RegEx can help solve lots of sophisticated issues that otherwise are too complicated to solve. It is worthwhile to spend some time to understand what RegEx is and what it can do.

You may try to do the following exercise:

If I have a column that contains IP address, i.e. in the format like 10.211.19.111, or a column that contains email address, can you validate such column to see whether the value is valid with R regular expressions?



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Wednesday, March 29, 2017 - 10:24:00 AM - Partha Back To Top (53889)

 

Very nice! Thanks for posting.

Partha















get free sql tips
agree to terms