Learn more about SQL Server tools

mssqltips logo
giveaway
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

SQL Server 2016 Regular Expressions with the R Language


By:   |   Read Comments (1)   |   Related Tips: More > SQL Server 2016

Attend a SQL Server Conference for FREE >> click to learn more


Problem

It is well-known that T-SQL does not have native Regular Expression (RegEx) support for string related queries.  For example, you cannot search records with a regular expression in WHERE clause. I know I can write a CLR function / stored procedure with a RegEx do the work, but is there any other way that I can query a table with regular expressions in T-SQL code?

Solution

SQL Server 2016 has integrated the R language, i.e. T-SQL can call an R script directly and return the result set consumable by the SQL Server database engine directly. With the power of the R language and its native support for RegEx, we can now query a table with RegEx through R.

We will look at a few examples to demonstrate how to use RegEx in processing queries against tables.

Here are a few requirements:

  1. Find records with phone numbers which may have different formats
  2. Extract those phone numbers
  3. De-duplicate words that are in sequence, such as "hello hello world world" should be "hello world" after de-duplication.
  4. Merge multiple new lines to one line

Test Environment Setup

We will create a table with a few sample records as shown below:

use TestDB -- assume this is our testing database
go
if object_id('dbo.tblRegEx', 'U') is not null
 drop table dbo.tblRegEx

create table dbo.tblRegEx (id int identity, a varchar(300), b  varchar(300) );
go
insert into dbo.tblRegEx (a, b)
values ('hello hello hello world',  'my phone number is 321-111-1111')
, ( 'this this is is really fun','this number (604) 991-9111 is my cell phone')
, ( 'no duplicate here', 'no phone number here either, just my license# 111 111 2222')
, ( 'multiple blank lines
  
 
--this is 6th line', '222 333-4444 is my cell phone');

This is a very simple table with 3 columns, 2 of which are of varchar data type and will be used for our RegEx operation.

sample data

Quick Explanation of Test Data

  1. First two records (id=1, 2) have their [a] column with duplicate words.
  2. Three records have a phone number in the [b] column, but with different formats, i.e. 321-111-1111, (604) 991-9111 and 222 333-4444.
  3. For id=3 record, in its [b] column, we have a string of "111 111 2222", which looks like a phone number, but it is not as it misses a hyphen between 111 and 2222.
  4. The [a] column for the id = 4 record, has a value with 4 continuous empty lines, see the below screenshot with "Results to Text" turned on.
two emtpy lines in the data

T-SQL Code with R

RegEx is a powerful tool yet hard to master, when used properly, it provides an elegant and efficient solution. In the R language, RegEx is a maturely implemented feature which we can take advantage of.

We will first list the T-SQL code and the execution results, and then a detailed explanation of each section of the script.

USE TestDB 

--1. Find the records that contains phone number
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern = "\\b\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b"
outData <- subset(inData, grepl(pattern, b, perl = T))'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( as object dbo.tblRegEx);

--2. Extract the phone number
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern =  "(\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b)";
m <- regexpr(pattern, inData$b, perl = T)
n <- rep(NA,length(inData$b))
n[m != -1] <- regmatches(inData$b, m)
inData$c <- n;
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( (id int, a varchar(300), b varchar(300), phone varchar(20)) );


-- 3. Remove duplicate words
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern <-"\\b(\\w+\\s*)(\\1\\s*)+";
inData$a <- gsub(pattern, "\\1", inData$a, perl = T );
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( as object dbo.tblRegEx);


-- 4. Merge multiple empty lines to one line
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern = "(?m)(\\s*\\r?\\n){2,}"; 
# m <- regexpr(pattern, x , perl = T)
inData$a <- gsub(pattern, "\r\n\r\n", inData$a, perl = T );
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
with result sets ( as object dbo.tblRegEx);

Each T-SQL statement does two things:

  1. Read the table dbo.tblRegEx and pass the result set into R via the inData variable.
  2. Inside the R script, do the RegEx operation on the dataset of inData.

RegEx Explained

The following is a brief explanation of RegEx (i.e. pattern variable) used in the script

Note, in R, for a regular backslash \ in other language, such as C#, it needs to be doubled in R, i.e. \\.

  1. \\b\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b: phone number format
    • \\b: word boundary
    • \\(?: character ( once or none as ? means once or none
    • \\d{3}: followed by 3 digits
    • \\)?: followed by character ) once or none as ? means once or none
    • [-\\s]: followed by either a hyphen - or a blank space, i.e. \\s
    • \\d{3}: followed by 3 digits
    • -: followed by a hyphen -
    • \\d{4}: followed by 4 digits
    • \\b: word boundary
  2. (\\(?\\d{3}\\)?[-\\s]\\d{3}-\\d{4}\\b): extract the phone number. This is almost identical to previous one, the outside () is to capture the match
  3. \\b(\\w+\\s*)(\\1\\s*)+: repeatable words format
    • \\b: word boundary
    • (\\w+\\s*): a word followed by one or more spaces
    • (\\1\\s*): \\1 means the exact copy of previous finding by (\\w+\\s*)
    • +: means repeating one or more times
  4. (?m)(\\s*\\r?\\n){2,}: two or more continuous blank lines
    • (?m): set the string to multiple line mode, so each line is analyzed via the RegEx pattern
    • (\\s*\\r?\\n): \\s*: zero or more blank spaces, \\r is carriage return, ? means 0 or 1 time, \\n is newline
    • {2,}:the previous match will be repeated one or more times.

Query Result

The following is the result of running the four R embedded T-SQL statements

query results

We can see four data results just as expected:

  • First data result, we get exactly 3 records that have phone numbers.
  • Second result, we indeed extract phone numbers out, and we add an additional column [phone] for these phone numbers.
  • Third result, we eliminate the duplicate words.
  • Fourth result, we merge the multiple empty lines to one, and this is not very clear, so we will switch to "Results to Text" to look at it

To look at the result of id=4 row for merge multiple empty lines to one line, we need to first dump the processed records into a temp table and then select the row with id = 4.

USE TestDB
select * into #t from dbo.tblRegEx where 1=0

insert into #t (id, a, b)
exec sp_execute_external_script @language=N'R'
, @script = N'
pattern = "(?m)(\\s*\\r?\\n){2,}"; 
inData$a <- gsub(pattern, "\r\n\r\n", inData$a, perl = T );
outData <- inData;'
, @input_data_1 = N'select id, a, b from dbo.tblRegEx'
, @input_data_1_name = N'inData'
, @output_data_1_name=N'outData'
--with result sets ( as object dbo.tblRegEx);
go
-- turn on "Results to Text" option
select a from #t where id=4
select a from dbo.tblRegEx where id =4

The result is as follows

merged empty lines

Summary

With the help of embedded R in SQL Server 2016, we can now take advantage of regular expressions to process some string related query and manipulation operations. This saves the overhead to write CLR objects to achieve the same result. This can be very handy for some business cases.

In this tip, we look at some examples about using RegEx in R, and hope this will help you to broaden your view on what R integration in SQL Server 2016 can help us in some niche scenarios.

Next Steps

Understanding RegEx can help solve lots of sophisticated issues that otherwise are too complicated to solve. It is worthwhile to spend some time to understand what RegEx is and what it can do.

You may try to do the following exercise:

If I have a column that contains IP address, i.e. in the format like 10.211.19.111, or a column that contains email address, can you validate such column to see whether the value is valid with R regular expressions?



Last Update:


signup button

next tip button



About the author
MSSQLTips author Jeffrey Yao Jeffrey Yao is a senior SQL Server consultant, striving to automate DBA work as much as possible to have more time for family, life and more automation.

View all my tips
Related Resources





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Wednesday, March 29, 2017 - 10:24:00 AM - Partha Back To Top

 

Very nice! Thanks for posting.

Partha


Learn more about SQL Server tools