By: Ian Fogelman | Comments (5) | Related: > Functions User Defined UDF
Problem
You would like to extend the capability of a scalar function to be apply to apply regex expressions in your SQL Server database. Advantages to implementing this in a scalar function can help utilize regex search patterns to identify, clean and parse existing SQL Server data.
Solution
For this solution we will be utilizing C# to build an assembly and load it into our SQL Server database. Once the assembly is loaded into the database, we can create a scalar function that will use the logic we create to apply our regex expressions. We will be utilizing the csc.exe .Net compiler as a lightweight means of converting our source code into dll’s.
Exercise Files
You can download these files here.
- Compile.bat
- Regex Evaluate.sql
- Regex Init.sql
- RegexEvaluator.cs
- RegexEvaluator.dll (will be generated)
Application Steps
- Check .Net Version
- Update Compile.bat
- Examine RegexEvaluator
- Compile RegexEvaluator
- Load and Test Assembly with Regex Init.Sql
- Run simple operations
- Run complex operations
1 - Check .Net Version
First thing we need to establish is the .Net version running on our workstation. This will be used to compile our C# code into a dll assembly which we will load into our SQL Server database. To do this navigate to the following file path on your windows machine: C:\Windows\Microsoft.NET\Framework. Here we can see the various versions of .NET installed on the work station. For this example, I will use 4.0.30319.
2 - Update Compile.bat
Now that we know exactly which version of .Net we are going to use, we can update the bat file that will be used to compile the assembly. Open the Compile.bat file and update the file path to the folder version of your .Net framework.
The csc.exe is the c sharp compiler exe that will compile the source code into a .dll, that is what we will load into our SQL Server database in step #4.
3 - Examine the RegexEvaluator.cs script
Before we compile this code let’s take a quick look at the contents.
using System; using System.Collections.Generic; using System.Text.RegularExpressions; public class RegexEvaluator { public static string EvaluateRegex(string pattern, string evalString) { Regex rg = new Regex(pattern); string retval = ""; MatchCollection matches = rg.Matches(evalString); for (int count = 0; count < matches.Count; count++) { retval += matches[count].Value; } return retval; } }
Here we have a very simple C# method EvaulateRegex, which accepts two string parameters a regex pattern (pattern) and a string to evaluate with that pattern (evalString). Using the Regular Expressions Library, we can apply the pattern to the string. The results from that match are then iterated over and each match is appended to the string value "retval". An example executable version of this code can be found here.
4 - Compile RegexEvaluator to RegexEvaulator.dll
Now that we understand the logic inside of our script, we can use the bat file in the exercise files to compile the contents of script. Run "Compile.bat" by double clicking the bat file and you should see RegexEvaluator.dll generate in your working directory. This is what we will load into our SQL server database.
5 - Examine and Run
Next step we will use the Regex Init.sql script to create a database and load the newly generated assembly into. There are some security concepts that need to be taken into account if your loading CLRs into a production environment which you can read more about here.
For a bit more context on these permission levels you can read the Microsoft doc page here. I have summarized the security concerns below.
- When creating assemblies in EXTERNAL_ACCESS or UNSAFE it is recommended that the assembly be signed in the master database, a login created and external access to the assembly granted to that login. Additionally, that the TRUSTWORTHY database property not be set ON. This does complicate the implementation process a bit but it protects against security context impersonation concerns.
- When code in an assembly runs under the SAFE permission set, it can only do computation and data access within the server through the in-process managed provider.
CREATE Database RegexCLRExample; GO USE [RegexCLRExample]; GO CREATE ASSEMBLY RegexEvaluator from 'C:\MSSQL Tips\Creating a Regex CLR Scalar Function in SQLServer\RegexEvaluator.dll' WITH PERMISSION_SET = SAFE --CHANGE THIS TO YOUR DIRECTORY GO CREATE FUNCTION RegexEvaluator(@pattern NVARCHAR(MAX), @evalString NVARCHAR(MAX)) RETURNS NVARCHAR(MAX) AS EXTERNAL NAME RegexEvaluator.RegexEvaluator.EvaluateRegex; -- ASSEMBLY.CLASS___.METHOD GO
The first two segments of this code create a test database and assign the context for our session.
Next, we create the Assembly from the dll file we compiled earlier, we set the permission to SAFE, which is the most restrictive permission set possible. Other options include External_Access and UNSAFE, additional details for these parameters can be found here.
Finally, we create a function called RegexEvaluator, which again like our C# implementation accepts two parameters @pattern and @evalString. A quick note that mapping exact data types between C# and SQL Server can seem a bit tedious, for example if you change the parameters to VARCHAR(MAX) you will get an error complaining about data type mismatches between the assembly and function. A guide can be found here which does a good job of explaining the data mapping between the two environments. I made a comment to drive home the naming convention when loading the assembly.
6 - Run simple operations
Now let’s execute some single scalar function executions of our new regex scalar function. See each code execution with commented explanations below.
SELECT DBO.RegexEvaluator('\d','123abc') as EvaluatedString -- ALL DIGITS FROM STRING SELECT DBO.RegexEvaluator('\w{3} \d{2} \d{4}','Mon Oct 01 2018 00:00:00 GMT-0400 (Eastern Daylight Time)') as EvaluatedString --HAS 3 WORD CHARS (SPACE) 2 DIGITS (SPACE) 4 DIGITS SELECT DBO.RegexEvaluator('(ok){3,}','okokok! cya') as EvaluatedString --HAS 3 OR MORE REPETITIONS OF "ok" GO
7 - Run complex operations
Lastly let’s apply our new scalar function in a more complex manner, in the following example I will load a temp table with the pattern, evalString and an explanation column to return a data set showing more executions of our new function. This is a more traditional usage of a scalar function where you are passing data stored in your database tables to the function.
IF OBJECT_ID('TEMPDB..#REGEX_TESTS') IS NOT NULL DROP TABLE #REGEX_TESTS CREATE TABLE #REGEX_TESTS ( ID INT IDENTITY(1,1), Pattern NVARCHAR(MAX), EvalString NVARCHAR(MAX), Explanation NVARCHAR(MAX) ) INSERT INTO #REGEX_TESTS VALUES('(?<=The )(.*?)(?=fox)','The red fox','Extract Text between two string values, in this case between The(space) and fox') INSERT INTO #REGEX_TESTS VALUES('[0-9]','1asd2fds','Extract numbers from a string of text') INSERT INTO #REGEX_TESTS VALUES('\d\d\D\d\d\D\d\d\d\d','XX7354734506-11-2015ASXfdsfds','Matches a digit digit -non digit digit digit non digit digit digit digit digit string, in this case a mm-dd-yyyy date field') INSERT INTO #REGEX_TESTS VALUES('[aeiou]','abcdefghij','Extract only vowels from a string of letters') INSERT INTO #REGEX_TESTS VALUES('[a-l]','caayt','Extract only letters that fall between a and l') INSERT INTO #REGEX_TESTS VALUES('^\d\w{4}.$','0qwer.','Find exactly 4 word characters starting with a digit character ending in a period') SELECT EvalString as OriginalString, Pattern, DBO.RegexEvaluator(Pattern,EvalString) AS Results, Explanation FROM #REGEX_TESTS DROP TABLE #REGEX_TESTS
The last execution in Regex Evaluate.sql is an example of parsing phone numbers with the new scalar function to remove punctuation and return strictly the 10-digit phone number.
IF OBJECT_ID('TEMPDB..#REGEX_PHONENUMS') IS NOT NULL DROP TABLE #REGEX_PHONENUMS CREATE TABLE #REGEX_PHONENUMS ( ID INT IDENTITY(1,1), Pattern NVARCHAR(MAX), EvalString NVARCHAR(MAX), Explanation NVARCHAR(MAX) ) INSERT INTO #REGEX_PHONENUMS VALUES('\d','(203)899-0231','Simple digit extract') INSERT INTO #REGEX_PHONENUMS VALUES('\d','203-899-0231','Simple digit extract') INSERT INTO #REGEX_PHONENUMS VALUES('\d','2038990231','Simple digit extract') INSERT INTO #REGEX_PHONENUMS VALUES('\d','(203) 899-0231','Simple digit extract') INSERT INTO #REGEX_PHONENUMS VALUES('\d','203 899 0231','Simple digit extract') INSERT INTO #REGEX_PHONENUMS VALUES('\d','203.899.0231','Simple digit extract') INSERT INTO #REGEX_PHONENUMS VALUES('((\d{3,}(?=.)))','203.899.0231x123','Digit extract with negative lookahead to exclude periods') INSERT INTO #REGEX_PHONENUMS VALUES('((\d{3,}(?=.)))','203.899.0231 x123','Digit extract with negative lookahead to exclude periods') INSERT INTO #REGEX_PHONENUMS VALUES('((\d{3,}(?=.)))','(203)899-0231x123','Digit extract with negative lookahead to exclude periods') SELECT EvalString as OriginalString,Pattern,DBO.RegexEvaluator(Pattern,EvalString) AS Results, Explanation FROM #REGEX_PHONENUMS
That’s it! To recap we wrote some source code in C#, saved it to a .cs file, compiled that .cs file into a .dll utilizing csc.exe. Then we loaded that dll into SQL server and built a scalar function from it.
Next Steps
- Check out these other tips regarding CLRs:
- Try some of your favorite regex patterns to apply with this function
- Manipulate the C# code to return different result sets, experiment with different implementations of the basic concept
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips