Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

Free SQL Server Webcast > Building Really Fast SQL Server VMs
 

SQL Server Regular Expressions for Data Validation and Cleanup


By:   |   Read Comments (10)   |   Related Tips: 1 | 2 | More > Functions - User Defined UDF

Problem

You need to provide data validation at the server level for complex strings like phone numbers, email addresses, etc. You may also need to do data cleanup / standardization before moving it from source to target. Although SQL Server provides a fair number of string functions, the code developed with these built-in functions can become complex and hard to maintain or reuse.

Solution

.NET Base Class Library provides the System.Text.RegularExpressions namespace which contains classes with access to the .NET Framework regular expressions engine. Using these classes you can use .NET languages in addition to the T-SQL programming language to create database objects like stored procedures and functions to operate on strings to retrieve and update data in SQL Server databases.

For example, a regular expression can be applied to a text field, to search for a first pattern, then modify the text field to conform to a second pattern. Regular expressions use a language of their own, specifically designed for string processing, with some of the most popular listed below:

  • Internet Email: \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*
  • Internet URL: http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
  • US Phone Number: ((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}
  • US Social Security Number: \d{3}-\d{2}-\d{4}
  • US ZIP Code: \d{5}(-\d{4})?
  • URI: \b(\S+)://([^:]+)(?::(\S+))?\b

Describing how these regular expressions are constructed and work is beyond the scope of this article.

 

In the tip presented here we will show the code necessary to develop a SQL CLR C# function that validates an input string, and in the case of finding a match, outputs a transformed string. (To learn more about creating a CLR function, refer to this tip.) To be more concrete, the function takes an input string believed to contain a character representation of a phone number; if the assumption is true, the function will output a string in a compact form, containing only the digits.

Although pretty straightforward, the code sample below has comments to improve clarity. 

This function takes three parameters:

  • InputString
  • MatchPattern
  • ReplacementPattern
using Microsoft.SqlServer.Server;
using System.Data.SqlTypes;
// namespace to work with regular expressions
using System.Text.RegularExpressions;
public class cls_RegularExpressions
{
  [SqlFunction]
  public static SqlString ReplaceMatch(
    SqlString InputString,
    SqlString MatchPattern,
    SqlString ReplacementPattern)
  {
    try
    {
      // input parameters must not be NULL
      if (!InputString.IsNull &&
          !MatchPattern.IsNull &&
          !ReplacementPattern.IsNull)
      {
      // check for first pattern match
      if (Regex.IsMatch(InputString.Value,
                        MatchPattern.Value))
        // match found, replace using second pattern and return result
        return Regex.Replace(InputString.Value,
                            MatchPattern.Value,
                            ReplacementPattern.Value);
      else
        // match not found, return NULL
        return SqlString.Null;
      }
      else
        // if any input paramater is NULL, return NULL
        return SqlString.Null;
    }
    catch
    {
      // on any error, return NULL
      return SqlString.Null;
    }
  }
};

A sample SQL Server run is shown below, where we pass in the phone number, the match pattern and the replacement pattern.  The first two examples pass the conversion, but the third example fails because of two "))" after the area code.

select dbo.ReplaceMatch
('(129).673-4192', '^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$', '$1$2$3')
go;
returned value: 1296734192
--------------------------
select dbo.ReplaceMatch
('(129.673-4192', '^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$', '$1$2$3')
go;
returned value: 1296734192
--------------------------
select dbo.ReplaceMatch
('(129)).673-4192', '^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$', '$1$2$3')
go;
returned value: NULL
--------------------

The advantages of the ReplaceMatch function are straightforward as it can be used inline of SQL statements and with a whole spectrum of regular expressions within your SQL Server (including Express/Compact) environment. You could also create a second function that will only check for a match and return a boolean value, without doing any pattern replacement.

Regular expressions provide a very powerful text-processing technology that every Developer/DBA should be at least familiar with.

Our tip here just attempts to get you started writing your own SQL CLR objects embedding this technology.

Next Steps
  • Compile, deploy, and use the ReplaceMatch function; enhance it to suit your needs.
  • Explore other uses of the ReplaceMatch function with different regular expressions.
  • Include using regular expressions and CLR in your SQL Server development toolkit.
  • Check other SQL CLR tips on this site.


Last Update:


next webcast button


next tip button



About the author





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Wednesday, December 30, 2015 - 11:01:13 AM - Oded Dror Back To Top

Hi there,

 I created the DLL but can you send the function dbo.ReplaceMatch you created

Thanks,

Ed Dror

 

 

 


Monday, March 14, 2011 - 4:30:11 PM - Eric Back To Top

Alex,

That is an informative article.  Regular expressions can be very useful for validation and even information extraction.  The nice thing about SQL Server is that the .NET framework can be utilized in so many different places.  I wrote a regular expression component for SQL Server Integration Services (SSIS) that can validate fields and extract backreferences from within a SSIS data flow.   If you are interested check out:

http://regextractor.codeplex.com

The RegExtractor is free and open source.  Enjoy!  But please provide me with feedback!

Eric


Thursday, March 03, 2011 - 4:41:03 AM - amol Back To Top

Nice article ...

Thanks,
-Amol
- http://www.sqlsuperfast.com

 


Thursday, March 03, 2011 - 3:09:28 AM - Dishdy Back To Top

AdamK,
This is a very nice web site regarding regular expressions and it is now in my favorites.
Do you know of any sources which does a similar job for SQL Server's PATINDEX function?
Thnks.


Thursday, March 03, 2011 - 2:56:23 AM - AdamK Back To Top

Regexes for email and URL are wrong. Full email regex is much more complicated (http://www.regular-expressions.info/email.html). URL does not have to start with http.


Thursday, March 03, 2011 - 2:42:48 AM - Dishdy Back To Top

So you're saying there is no equivalent PATINDEX of your list for emails, phone numbers, etc.?


Wednesday, February 09, 2011 - 2:40:23 PM - Alex Tocitu Back To Top

John,

Thanks for your comments.

Do you have any benchmarks and regex patterns that you've used against large sets of data?

What were you trying to accomplish?

Alex


Wednesday, February 09, 2011 - 2:07:14 PM - Alex Tocitu Back To Top

Mbourdon,
Thanks for your question;
Please call the function with parameters below
InputString; your string,
MatchPattern: '[\x00-\x1F]',
ReplacementPattern: ''
The function will return NULL
or a cleaned output if InputString contains ASCII 0-31 or 00-1F (HEX)


Wednesday, February 09, 2011 - 12:08:31 PM - John Fox Back To Top
Regular expressions can be great tools for data validation and manipulation. But they can also be CPU hogs, it takes CPU horsepower to evaluate expressions, especially as they become more and more complex. I'm not suggesting to not use them, but to be careful using them and watch your CPU overhead if they are used against large sets of data. Remember that they will be evaluated on the database machine, not on the application server itself. Consider the resource requirements and the optimal server placement, sometimes it might make more sense to place them on an application or web server instead of in the database. Here is a good starting article on how to use them. Be forewarned, there are many different regex engines, and they don't all work the same. Be sure to check the CLR documentation for specifics. http://www.regular-expressions.info/tutorial.html

Wednesday, February 09, 2011 - 11:17:41 AM - Mbourgon Back To Top

Alex, thanks for that.  How would you change that code (or maybe just how it's called) so that it would strip out, say, any low-ascii characters (under ASCII(31))?  I'd like to take a large string, pass it in, and get out clean data.  Everything but low-ascii would count as good.  Thanks.


Learn more about SQL Server tools