Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

PowerShell and T-SQL Regular Expression Examples for SQL Server


By:   |   Read Comments (1)   |   Related Tips: More > PowerShell

You're invited to this free live MSSQLTips webcast

The Right Database Monitoring Tools Make All the Difference
Tuesday, June 26, 2018 - click here to learn more and to register


Problem

In this tip, I cover some of the basics of RegEx with PowerShell and T-SQL for SQL Server, as well as addressing some of the questions I received for the webcast on protecting confidential data.

Solution

One of my colleagues, who's been coding for more years than I've been alive, once said that every time he had to use RegEx, he had to re-learn RegEx because RegEx was a tool he didn't use consistently. In this tip, I cover some of the basics of RegEx with PowerShell and T-SQL to help readers feel familiar with the basics of this powerful and precise tool when it comes to finding, catching, and sometimes replacing data or patterns within data. Borrowing from the Saxon Math approach to learning, I use many basic queries to help readers see the logic and patterns of how RegEx works with each. Like my colleague, we may have other questions, and with our solid foundation in this tool, we should be able to find our solution.

Since we're concerned about confidential data, RegEx with PowerShell provides us with a tool that can stop confidential data before it enters the database (see article about an application approach to preventing confidential data before entering a database). This solution prevents us from needing to use RegEx in T-SQL, which doesn't offer as much functionality or ease as PowerShell. That being written, in some cases, we may inherit databases that already have confidential data and knowing how to find these data in T-SQL can be useful, in addition that RegEx in T-SQL offers useful data validation - for instance, only accepting U.S. data with a zip code to either be five or nine digits.

Common RegEx Patterns

The below table shows some common PowerShell RegEx patterns with some practical applications of those patterns:

RegEx PatternDescription
[0-9]Any character from 0 through 9.
[^0-9]Any character that is not 0 through 9.
[4-5]Any character from 4 through 5.
[a-z]Any character from a through z.
[^a-z]Any character that is not a through z.
[y-z]Any character from y through z.
[^0-9a-z]Any character that is not a number or letter.
[@!]Either an @ or ! character
\dAny numeric character.
\d{3}Three consecutive numeric characters.
\d{1,3}One to three consecutive numeric characters (would include 1,2 or 3 numeric characters).
\DAny non numeric character.
\D{3}Three consecutive non numeric characters.

RegEx Examples using PowerShell

Let's look at some very simple examples of the above in PowerShell ISE and see how these apply. Keep in mind that when we use -match in PowerShell that it checks for the existence of a match. It only takes one match for something to be true and we should consider that when we're looking for confidential data because when we seek confidential data, we will want to look for items in a pattern.

## Check existence of one character of type ...
"The quick brown fox jumped over the lazy dogs." -match "[a-z]" ## true
"T" -match "[a-z]" ## true
"The quick brown fox jumped over the lazy dogs." -match "[e-g]" ## true
"Let's choose a sentence that doesn't have all letters in the alphabet and return false." -match "[y-z]" ## false
"1 The quick brown fox jumped over the lazy dogs." -match "[0-9]" ## true
"4 The quick brown fox jumped over the lazy dogs." -match "[0-2]" ## false
"4 The quick brown fox jumped over the lazy dogs." -match "[4-5]" ## true
"1freetextcolumn" -match "[^0-9a-z]" ## false
"1 freetextcolumn" -match "[^0-9a-z]" ## true
"1 The quick brown fox jumped over the lazy dogs." -match "[@!]" ## false

## Existence of multiple characters of type ...
"1 The quick brown fox jumped over the lazy dogs." -match "[0-9][0-9][0-9]" ## false
"123 The quick brown fox jumped over the lazy dogs." -match "[0-9][0-9][0-9]" ## true
"<3 TX" -match "[a-z][a-z]" ## true
"<3" -match "[a-z][a-z]" ## false

"1 The quick brown fox jumped over the lazy dogs." -match "\d{3}" ## false
"123 The quick brown fox jumped over the lazy dogs." -match "\d{3}" ## true
"123 The quick brown fox jumped over the lazy dogs." -match "\d{1,3}" ## true
"12 The quick brown fox jumped over the lazy dogs." -match "\d{1,3}" ## true
"<3 TX" -match "[a-z][a-z]" ## true
"<3" -match "[a-z][a-z]" ## false
"The quick brown fox jumped over the lazy dogs" -match "\D" ## true
"It" -match "\D{3}" ## false
"000" -match "\D" ## false

## Let's take what we've learned and see if we can catch (1) an SSN and a (2) credit card
"The quick brown fox 000-00-0000 jumped over the lazy dogs." -match "\d{3}\D\d{2}\D\d{4}" ## true
"The quick brown fox 0000-0000-0000-0000 jumped over the lazy dogs." -match "\d{4}\D\d{4}\D\d{4}\D\d{4}" ## true

In our final two examples, we found both an SSN format and a credit card format, using patterns of consecutive digits with non digits in between them. This will help us especially if we receive text from a form where users can enter in anything and we want to prevent them from entering certain data, or if we're mining files for various patterns.

Since we've determined how to match patterns, we can expand our code to replace these patterns, using -replace. As a note, PowerShell can use .NET's RegEx library (System.Text.RegularExpressions) in an object oriented structure similar to what developers would see if it was coded in C# (see the application RegEx article for an example). In our first examples of replace, we find matches and replace the matches without keeping any of the matches. In the second portion of the example, we keep a portion of the match by wrapping the part of the match we want to keep in parentheses and passing the kept portion as a parameter, demarcated with `$ symbols and wrapped in curly { } brackets. Note that if we want to keep multiple parts of a match (see the final example), each part we keep will be incremented in number.

## Replacing patterns
"123The quick brown fox jumped over the lazy dogs." -replace "[0-9][0-9][0-9]",""
## Result: The quick brown fox jumped over the lazy dogs.
"<3 TX" -replace "[a-z][a-z]","Texas"
## Result: <3 Texas
"Let's remove all the spaces." -replace "[^a-z0-9]",""
## Result: Letsremoveallthespaces
"T1M3R8Y94UII9I7ED5" -replace "\D{3}","_"
## Result: T1M3R8Y94_9I7ED5

## What if we want to keep some of the match?
"It only takes OOO to win tic-tac-toe?  I don't think so!" -replace "[O][O]([O])","XX`${1}"
## Result: It only takes XXO to win tic-tac-toe?
"13 non-numbers!" -replace "\D{9}(\D{4})"," runn`${1}"
## Result: 13 runners!
"AAA11AAAAA1111111111AA" -replace "(\d{2})\d{2}(\d{2})\d{3}(\d{1})","`${1}XX`${2}XXX`${3}"
## Result: AAA11AAAAA11XX11XXX1AA

Finally, we may want to strip certain confidential data from files before we import them. We can use RegEx to find patterns, remove these patterns, and import the remaining data. In the below example, we will take the below four lines, add them to a file called C:\files\ssn.txt and then run code that mines a file for an SSN pattern, line by line, replacing patterns of SSNs in the format of 000-00-0000 with blanks:

ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0000
ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0001
ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0002
ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0003

Code:

## Code to run after values from above are added to file.txt
$file = "C:\files\ssn.txt"
$ext = $file.Substring($file.LastIndexOf("."))
$loc = $file.Substring(0,($file.LastIndexOf("\")+1))
$name = $file.Substring($file.LastIndexOf("\")+1).Replace($ext,"")

$valid = $loc + $name + "_clean" + $ext

New-Item $valid -ItemType file

$read = New-Object System.IO.StreamReader($file)
$cleaned = New-Object System.IO.StreamWriter($valid)

while (($line = $read.ReadLine()) -ne $null)
{
    $line = $line -replace "\d{3}-\d{2}-\d{4}",""
    $cleaned.WriteLine($line)
    $cleaned.Flush()
}

$read.Close()
$read.Dispose()
$cleaned.Close()
$cleaned.Dispose()

SQL Server T-SQL RegEx Examples

Now that we have some basics of RegEx in PowerShell, let's look at T-SQL and we'll see some similarities. The below table shows some common T-SQL RegEx patterns, as well as some examples with those patterns.

RegEx PatternDescription
[0-9]Any character from 0 through 9.
[^0-9]Any character that is not 0 through 9.
[8-9]Any character from 8 through 9.
[a-z]Any character from a through z.
[^a-z]Any character that is not a through z.
[r-w]Any character from r through w.
[^0-9a-z]Any character that is not a number or letter.
[@-]Either an @ or - character
[!#]Either an ! or # character

Now, let's build a test table in our test environment, populate it with some data, and query these data using RegEx tools. The below queries have commented questions and answers and we can see the patterns of querying with RegEx. For instance, if we use '%[a-z]%, we're looking for any row with any alpha character. Note what returns from the data set and what does not return. The same is true for numbers and symbols.

CREATE TABLE TSQL_RegExTable(
 RegExColumn VARCHAR(100)
)

INSERT INTO TSQL_RegExTable
VALUES ('1 The quick brown fox jumped over the lazy dogs.')
 , ('The quick brown fox jumped over the lazy dogs.')
 , ('This sentence does not have every letter of the alphabet.')
 , ('123777')
 , ('@you')
 , ('[email protected]')
 , ('*and')
 , ('One')
 , ('One1')
 , ('m')
 , ('4')
 , ('-')

-- How many rows have one alpha character?
SELECT RegExColumn AS OneAlpha
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[a-z]'
-- Answer: 1

-- How many rows have only one alpha character in between h and j?
SELECT RegExColumn HThroughJAlpha
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[h-j]'
-- Answer: 0

-- How many rows start with any alpha character?
SELECT RegExColumn StartWithAlpha
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[a-z]%'
-- Answer: 6

-- How many rows have an alpha character somewhere in them?
SELECT RegExColumn AlphaInThem
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '%[a-z]%'
-- Answer: 9

-- How many rows have the alpha character z somewhere in them?
SELECT RegExColumn AlphaHasZ
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '%[z]%'
-- Answer: 2

-- How many rows have do not start with an alpha character?
SELECT RegExColumn NotStartWithAlpha
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[^a-z]%'
-- Answer: 6

-- How many rows have only one numeric character?
SELECT RegExColumn OneNumber
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[0-9]'
-- Answer: 1

-- How many rows have only one numeric character between 7 and 8?
SELECT RegExColumn NumberBetweenSevenAndEight
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[7-8]'
-- Answer: 0

-- How many rows start with any numeric character?
SELECT RegExColumn StartWithNumber
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[0-9]%'
-- Answer: 3

-- How many rows have any numeric character in them?
SELECT RegExColumn NumberInThem
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '%[0-9]%'
-- Answer: 4

-- How many rows have do not start with a numeric character?
SELECT RegExColumn NotStartWithNumber
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[^0-9]%'
-- Answer: 9

-- How many rows start with the character @ or -?
SELECT RegExColumn StartWithAtOrMinusCharacter
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[@-]%'
-- Answer: 2

-- How many rows have an @ or - character in them?
SELECT RegExColumn AtOrMinusCharacterInThem
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '%[@-]%'
-- Answer: 3

-- How many rows have neither one alpha nor one numeric character?
SELECT RegExColumn OneCharacterNotAlphaNotNumeric
FROM TSQL_RegExTable
WHERE RegExColumn LIKE '[^0-9a-z]'
-- Answer: 1

DROP TABLE TSQL_RegExTable

Now that we see the basics, let's apply the above logic to SSNs, as an example of finding confidential data using RegEx. In the below example, we will assume that SSNs are either stored as 9 characters of nine numeric characters in a row, or 11 characters in a pattern of three numbers, a non number, two numbers, a non number, and four numbers:

CREATE TABLE SSNTable(
 SSN VARCHAR(11),
 TextField VARCHAR(500)
)

INSERT INTO SSNTable
VALUES ('000-00-0000','The quick brown fox 000-00-0000 jumped over the lazy dogs.')
 , ('000-00-0001','000-00-0001 The quick brown fox jumped over the lazy dogs.')
 , ('000-00-0002',' The quick brown fox jumped over the lazy dogs. 000-00-0002')
 , ('000000003',' The quick brown fox jumped over the 000.00.0002 lazy dogs.')

-- How many SSN values of nine digits in a row?
SELECT SSN
FROM SSNTable
WHERE SSN LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
-- Answer: 1

-- How many SSNs with three digits, a dash, two digits, a dash, and four digits?
SELECT *
FROM SSNTable
WHERE SSN LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
-- Answer: 3

-- How many TextFields with three digits, a dash, two digits, a dash, and four digits only?
SELECT *
FROM SSNTable
WHERE TextField LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]'
-- Answer: 0

-- How many TextFields with three digits, a dash, two digits, a dash, and four digits somewhere in them?
SELECT *
FROM SSNTable
WHERE TextField LIKE '%[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]%'
-- Answer: 3

-- How many TextFields with three digits, a non digit, two digits, a non digit, and four digits somewhere in them?
SELECT  *
FROM SSNTable
WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%'
-- Answer: 4

DROP TABLE SSNTable

At this point, we've been able to identify what values might be (or have a high probability of being) SSNs (and we could adjust for other confidential data). If we wanted to remove these values, we could either delete them or insert the opposite results into a different table and drop the existing table. A few T-SQL developers will tell you that T-SQL isn't a Regex-Replace language, even though this is possible in T-SQL and a little more complex than some developers may want. If you can keep confidential data, like SSNs, from ever reaching your data set, that will save you time. The below code will replace an SSN in the pattern of 000-00-000 or in a pattern of 000.00.0000:

SELECT TextField
 , REPLACE(TextField,SUBSTRING(TextField,PATINDEX('%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%',TextField),11),'')
FROM SSNTable
WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%'

We look for the specific pattern, catch it where we find a match, and replace it with nothing. While this works with this pattern, if we were to use similar logic with nine numeric characters in a row, depending on the text field, it may replace a phone number, account number, or some other number exceeding nine digits. In that case, the logic would only need to identify a numeric character nine digits in a row, and in cases like this, we would find we could get higher performance by stopping it on the front end.

What source control tools permit you to use RegEx?

RegEx can be done with most scripting or object oriented languages, for instance, the below Python code will remove 000-00-0000 formatted SSNs from a string:

import re

def stripSSNs(str):
   str = re.sub("\d{3}-\d{2}-\d{4}","",str)
   return str;

print stripSSNs("The quick brown 000-00-0000 fox jumped over the lazy dogs.")

Any source control tool that permits the use of scripting and/or object oriented languages will provide you with access to RegEx. If you want to identify possible variables or stores of confidential data in your actual code (for instance, a stored procedure calling an SSN column), you can loop through all of your local files from source control with code searching for those keywords. The below PowerShell function, as an example, does this with the folder "C:\Path\" and the pattern of "Social" keyword:

Function FindWord ($githubfolder, $pattern)
{
    $cnt = Get-ChildItem $githubfolder -Include @("*.sql", "*.cs", "*.xml", "*.ps1") -Recurse | Select-String -pattern $pattern
    if ($cnt.Count -gt 0)
    {
        $cnt
    }
}

FindWord -githubfolder "C:\Path\" -pattern "Social"

Can you use RegEx with SSIS?

Yes, SSIS allows for both T-SQL (Execute SQL Task) and C# (Script Task).

Have you found a way to check if just one [piece of confidential data] exists and stop instead of it continuing to scan the entire table?

Yes, I'll use SSNs as an example; note how in the first query, I check if one exists, and return SSN Found, and in the other I'm selecting everything (the total is 2001 values in the example - if we keep adding values, we'll really see the performance between the two separate). Turn on the Execution Plan and see the results for yourself:

CREATE TABLE SSNTable(
 TextField VARCHAR(500)
)

INSERT INTO SSNTable
VALUES ('The quick brown fox jumped over the lazy dogs.')
GO 1000

INSERT INTO SSNTable
VALUES ('The quick brown fox 000-00-0000 jumped over the lazy dogs.')
 
INSERT INTO SSNTable
VALUES ('The quick brown fox jumped over the lazy dogs.')
GO 1000

SELECT *
FROM SSNTable

CREATE NONCLUSTERED INDEX IX_TextSSN ON SSNTable (TextField)

SET STATISTICS IO ON
SET STATISTICS TIME ON
SET NOCOUNT ON

PRINT 'Does it exist?'

IF EXISTS(
 SELECT 1
 FROM SSNTable
 WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%'
)
SELECT 'SSN Found'

PRINT 'Let''s select the whole thing, shall we?'

SELECT TextField
FROM SSNTable
WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%'


SET STATISTICS IO OFF
SET STATISTICS TIME OFF
SET NOCOUNT OFF
DROP TABLE SSNTable

Results from a query-to-query comparison:

Does it exist? (20% of Actual Execution Plan)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Table 'SSNTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 69 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Let's select the whole thing, shall we? (80% of Actual Execution Plan)

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Table 'SSNTable'. Scan count 1, logical reads 18, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 15 ms,  elapsed time = 60 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
Next Steps
  • Test the PowerShell RegEx.
  • Test the T-SQL RegEx.
  • Consider what part of your applications will RegEx be very useful and how can you re-use the logic.


Last Update:


next webcast button


next tip button



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

View all my tips
Related Resources





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Tuesday, January 13, 2015 - 12:32:05 AM - Darek Back To Top

Hi there. Thanks for the article. One remark... If you use StreamReaders, you don't have to call both Close and Dispose. Close calls Dispose under the hood. Cheers.


Learn more about SQL Server tools