SQL Server Regex CLR Function


By:   |   Updated: 2020-08-20   |   Comments (5)   |   Related: More > User Defined Functions


Problem

You would like to extend the capability of a scalar function to be apply to apply regex expressions in your SQL Server database. Advantages to implementing this in a scalar function can help utilize regex search patterns to identify, clean and parse existing SQL Server data.

Solution

For this solution we will be utilizing C# to build an assembly and load it into our SQL Server database. Once the assembly is loaded into the database, we can create a scalar function that will use the logic we create to apply our regex expressions. We will be utilizing the csc.exe .Net compiler as a lightweight means of converting our source code into dll’s.

Exercise Files

You can download these files here.

  • Compile.bat
  • Regex Evaluate.sql
  • Regex Init.sql
  • RegexEvaluator.cs
  • RegexEvaluator.dll (will be generated)

Application Steps

  1. Check .Net Version
  2. Update Compile.bat
  3. Examine RegexEvaluator
  4. Compile RegexEvaluator
  5. Load and Test Assembly with Regex Init.Sql
  6. Run simple operations
  7. Run complex operations

1 - Check .Net Version

First thing we need to establish is the .Net version running on our workstation. This will be used to compile our C# code into a dll assembly which we will load into our SQL Server database. To do this navigate to the following file path on your windows machine: C:\Windows\Microsoft.NET\Framework. Here we can see the various versions of .NET installed on the work station. For this example, I will use 4.0.30319.

,net library folders

2 - Update Compile.bat

Now that we know exactly which version of .Net we are going to use, we can update the bat file that will be used to compile the assembly. Open the Compile.bat file and update the file path to the folder version of your .Net framework.

compile.bat file

The csc.exe is the c sharp compiler exe that will compile the source code into a .dll, that is what we will load into our SQL Server database in step #4.

3 - Examine the RegexEvaluator.cs script

Before we compile this code let’s take a quick look at the contents.

using System; 
using System.Collections.Generic; 
using System.Text.RegularExpressions; 
  
public class RegexEvaluator 
{ 
   public static string EvaluateRegex(string pattern, string evalString) 
   { 
     Regex rg = new Regex(pattern);  
     string retval = ""; 
     MatchCollection matches = rg.Matches(evalString);  
        for (int count = 0; count < matches.Count; count++) 
        { 
        retval +=  matches[count].Value;  
        } 
     return retval; 
   } 
}

Here we have a very simple C# method EvaulateRegex, which accepts two string parameters a regex pattern (pattern) and a string to evaluate with that pattern (evalString). Using the Regular Expressions Library, we can apply the pattern to the string. The results from that match are then iterated over and each match is appended to the string value "retval". An example executable version of this code can be found here.

4 - Compile RegexEvaluator to RegexEvaulator.dll

Now that we understand the logic inside of our script, we can use the bat file in the exercise files to compile the contents of script. Run "Compile.bat" by double clicking the bat file and you should see RegexEvaluator.dll generate in your working directory. This is what we will load into our SQL server database.

5 - Examine and Run

Next step we will use the Regex Init.sql script to create a database and load the newly generated assembly into. There are some security concepts that need to be taken into account if your loading CLRs into a production environment which you can read more about here.

For a bit more context on these permission levels you can read the Microsoft doc page here. I have summarized the security concerns below.

  • When creating assemblies in EXTERNAL_ACCESS or UNSAFE it is recommended that the assembly be signed in the master database, a login created and external access to the assembly granted to that login. Additionally, that the TRUSTWORTHY database property not be set ON. This does complicate the implementation process a bit but it protects against security context impersonation concerns.
  • When code in an assembly runs under the SAFE permission set, it can only do computation and data access within the server through the in-process managed provider.
CREATE Database RegexCLRExample;
GO

USE [RegexCLRExample];
GO

CREATE ASSEMBLY RegexEvaluator from 'C:\MSSQL Tips\Creating a Regex CLR Scalar Function in SQLServer\RegexEvaluator.dll' WITH PERMISSION_SET = SAFE  --CHANGE THIS TO YOUR DIRECTORY
GO 

CREATE FUNCTION RegexEvaluator(@pattern NVARCHAR(MAX), @evalString NVARCHAR(MAX)) 
RETURNS NVARCHAR(MAX)   
AS EXTERNAL NAME RegexEvaluator.RegexEvaluator.EvaluateRegex; 
--           ASSEMBLY.CLASS___.METHOD
GO  

The first two segments of this code create a test database and assign the context for our session.

Next, we create the Assembly from the dll file we compiled earlier, we set the permission to SAFE, which is the most restrictive permission set possible. Other options include External_Access and UNSAFE, additional details for these parameters can be found here.

Finally, we create a function called RegexEvaluator, which again like our C# implementation accepts two parameters @pattern and @evalString. A quick note that mapping exact data types between C# and SQL Server can seem a bit tedious, for example if you change the parameters to VARCHAR(MAX) you will get an error complaining about data type mismatches between the assembly and function. A guide can be found here which does a good job of explaining the data mapping between the two environments. I made a comment to drive home the naming convention when loading the assembly.

6 - Run simple operations

Now let’s execute some single scalar function executions of our new regex scalar function. See each code execution with commented explanations below.

SELECT DBO.RegexEvaluator('\d','123abc') as EvaluatedString -- ALL DIGITS FROM STRING 
SELECT DBO.RegexEvaluator('\w{3} \d{2} \d{4}','Mon Oct 01 2018 00:00:00 GMT-0400 (Eastern Daylight Time)') as EvaluatedString --HAS 3 WORD CHARS (SPACE) 2 DIGITS (SPACE) 4 DIGITS 
SELECT DBO.RegexEvaluator('(ok){3,}','okokok! cya') as EvaluatedString --HAS 3 OR MORE REPETITIONS OF "ok" 
GO 

7 - Run complex operations

Lastly let’s apply our new scalar function in a more complex manner, in the following example I will load a temp table with the pattern, evalString and an explanation column to return a data set showing more executions of our new function. This is a more traditional usage of a scalar function where you are passing data stored in your database tables to the function.

IF OBJECT_ID('TEMPDB..#REGEX_TESTS') IS NOT NULL DROP TABLE #REGEX_TESTS 
CREATE TABLE #REGEX_TESTS 
( 
   ID INT IDENTITY(1,1), 
   Pattern NVARCHAR(MAX), 
   EvalString NVARCHAR(MAX), 
   Explanation NVARCHAR(MAX) 
) 
  
INSERT INTO #REGEX_TESTS 
VALUES('(?<=The )(.*?)(?=fox)','The red fox','Extract Text between two string values, in this case between The(space) and fox') 
  
INSERT INTO #REGEX_TESTS 
VALUES('[0-9]','1asd2fds','Extract numbers from a string of text') 
  
INSERT INTO #REGEX_TESTS 
VALUES('\d\d\D\d\d\D\d\d\d\d','XX7354734506-11-2015ASXfdsfds','Matches a digit digit -non digit digit digit non digit digit digit digit digit string, in this case a mm-dd-yyyy date field') 
  
INSERT INTO #REGEX_TESTS 
VALUES('[aeiou]','abcdefghij','Extract only vowels from a string of letters') 
  
INSERT INTO #REGEX_TESTS 
VALUES('[a-l]','caayt','Extract only letters that fall between a and l') 
  
INSERT INTO #REGEX_TESTS 
VALUES('^\d\w{4}.$','0qwer.','Find exactly 4 word characters starting with a digit character ending in a period') 
  
SELECT EvalString as OriginalString, Pattern, DBO.RegexEvaluator(Pattern,EvalString) AS Results, Explanation FROM #REGEX_TESTS 
  
DROP TABLE #REGEX_TESTS 
query results

The last execution in Regex Evaluate.sql is an example of parsing phone numbers with the new scalar function to remove punctuation and return strictly the 10-digit phone number.

IF OBJECT_ID('TEMPDB..#REGEX_PHONENUMS') IS NOT NULL DROP TABLE #REGEX_PHONENUMS 
CREATE TABLE #REGEX_PHONENUMS 
( 
 ID INT IDENTITY(1,1), 
 Pattern NVARCHAR(MAX), 
 EvalString NVARCHAR(MAX), 
 Explanation NVARCHAR(MAX) 
) 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('\d','(203)899-0231','Simple digit extract') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('\d','203-899-0231','Simple digit extract') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('\d','2038990231','Simple digit extract') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('\d','(203) 899-0231','Simple digit extract') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('\d','203 899 0231','Simple digit extract') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('\d','203.899.0231','Simple digit extract') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('((\d{3,}(?=.)))','203.899.0231x123','Digit extract with negative lookahead to exclude periods') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('((\d{3,}(?=.)))','203.899.0231 x123','Digit extract with negative lookahead to exclude periods') 
  
INSERT INTO #REGEX_PHONENUMS 
VALUES('((\d{3,}(?=.)))','(203)899-0231x123','Digit extract with negative lookahead to exclude periods') 
  
SELECT EvalString as OriginalString,Pattern,DBO.RegexEvaluator(Pattern,EvalString) AS Results, Explanation FROM #REGEX_PHONENUMS 
query results

That’s it! To recap we wrote some source code in C#, saved it to a .cs file, compiled that .cs file into a .dll utilizing csc.exe. Then we loaded that dll into SQL server and built a scalar function from it.

Next Steps


Last Updated: 2020-08-20


get scripts

next tip button



About the author
MSSQLTips author Ian Fogelman Ian Fogelman is professional DBA with a hunger and interest for data science and application development. Ian enjoys leveraging cloud-based technologies and data to solve and automate tasks. Ian holds a MS in computer science with a concentration in software development. Ian is also a member of his local SQL Server user group and a member of his local data science user group in Columbus Georgia.

View all my tips



Comments For This Article




Tuesday, December 01, 2020 - 9:47:53 PM - Dave Maiden Back To Top (87868)
I've implemented this pattern in both SQL and SSRS numerous times and forgetting the security implications I don't understand why you return a string (also forgetting the fact you don't use the StringBuilder class) when you can return a collection of key value pairs or just values (with an optional index - something the string_split function in t-sql could do with for sorting). If you need any further assistance please feel free to get in touch and I'll be happy to help.

Saturday, August 29, 2020 - 10:25:51 AM - Izhar Azati Back To Top (86389)
Hello Ian,
The SQL 2017/2019 version is not a future version!, if you are not an expert in certificate signed (which cost money) it is impossible to add CLR to the server, in Azure the situation is even tougher.
"Microsoft recommends that all assemblies be signed by a certificate or asymmetric key with a corresponding login that has been granted UNSAFE ASSEMBLY permission in the master database." see: "CLR strict security" in https://docs.microsoft.com/en-us/sql/t-sql/statements/create-assembly-transact-sql?view=sql-server-ver15

Thursday, August 27, 2020 - 12:15:49 PM - Ian Back To Top (86370)
Good Afternoon Izhar Azati,
Are you referring to the security model detailed in the post or the CLR integrations in future releases of SQL server in general?

Saturday, August 22, 2020 - 4:18:11 AM - Izhar Azati Back To Top (86345)
In the latest versions it is no longer practical to use CLR, the security requirements are impractical and Microsoft has actually killed CLR in SQL.

Thursday, August 20, 2020 - 2:30:56 AM - Grzegorz Lyp Back To Top (86333)
Much more useful is CLR function that returns table with all matches and position in match collection.


download





Recommended Reading

Valid Email Address Check with TSQL














get free sql tips
agree to terms