By: Tim Smith | Comments (1) | Related: > PowerShell
Problem
In this tip, I cover some of the basics of RegEx with PowerShell and T-SQL for SQL Server, as well as addressing some of the questions I received for the webcast on protecting confidential data.
Solution
One of my colleagues, who's been coding for more years than I've been alive, once said that every time he had to use RegEx, he had to re-learn RegEx because RegEx was a tool he didn't use consistently. In this tip, I cover some of the basics of RegEx with PowerShell and T-SQL to help readers feel familiar with the basics of this powerful and precise tool when it comes to finding, catching, and sometimes replacing data or patterns within data. Borrowing from the Saxon Math approach to learning, I use many basic queries to help readers see the logic and patterns of how RegEx works with each. Like my colleague, we may have other questions, and with our solid foundation in this tool, we should be able to find our solution.
Since we're concerned about confidential data, RegEx with PowerShell provides us with a tool that can stop confidential data before it enters the database (see article about an application approach to preventing confidential data before entering a database). This solution prevents us from needing to use RegEx in T-SQL, which doesn't offer as much functionality or ease as PowerShell. That being written, in some cases, we may inherit databases that already have confidential data and knowing how to find these data in T-SQL can be useful, in addition that RegEx in T-SQL offers useful data validation - for instance, only accepting U.S. data with a zip code to either be five or nine digits.
Common RegEx Patterns
The below table shows some common PowerShell RegEx patterns with some practical applications of those patterns:
RegEx Pattern | Description |
[0-9] | Any character from 0 through 9. |
[^0-9] | Any character that is not 0 through 9. |
[4-5] | Any character from 4 through 5. |
[a-z] | Any character from a through z. |
[^a-z] | Any character that is not a through z. |
[y-z] | Any character from y through z. |
[^0-9a-z] | Any character that is not a number or letter. |
[@!] | Either an @ or ! character |
\d | Any numeric character. |
\d{3} | Three consecutive numeric characters. |
\d{1,3} | One to three consecutive numeric characters (would include 1,2 or 3 numeric characters). |
\D | Any non numeric character. |
\D{3} | Three consecutive non numeric characters. |
RegEx Examples using PowerShell
Let's look at some very simple examples of the above in PowerShell ISE and see how these apply. Keep in mind that when we use -match in PowerShell that it checks for the existence of a match. It only takes one match for something to be true and we should consider that when we're looking for confidential data because when we seek confidential data, we will want to look for items in a pattern.
## Check existence of one character of type ... "The quick brown fox jumped over the lazy dogs." -match "[a-z]" ## true "T" -match "[a-z]" ## true "The quick brown fox jumped over the lazy dogs." -match "[e-g]" ## true "Let's choose a sentence that doesn't have all letters in the alphabet and return false." -match "[y-z]" ## false "1 The quick brown fox jumped over the lazy dogs." -match "[0-9]" ## true "4 The quick brown fox jumped over the lazy dogs." -match "[0-2]" ## false "4 The quick brown fox jumped over the lazy dogs." -match "[4-5]" ## true "1freetextcolumn" -match "[^0-9a-z]" ## false "1 freetextcolumn" -match "[^0-9a-z]" ## true "1 The quick brown fox jumped over the lazy dogs." -match "[@!]" ## false ## Existence of multiple characters of type ... "1 The quick brown fox jumped over the lazy dogs." -match "[0-9][0-9][0-9]" ## false "123 The quick brown fox jumped over the lazy dogs." -match "[0-9][0-9][0-9]" ## true "<3 TX" -match "[a-z][a-z]" ## true "<3" -match "[a-z][a-z]" ## false "1 The quick brown fox jumped over the lazy dogs." -match "\d{3}" ## false "123 The quick brown fox jumped over the lazy dogs." -match "\d{3}" ## true "123 The quick brown fox jumped over the lazy dogs." -match "\d{1,3}" ## true "12 The quick brown fox jumped over the lazy dogs." -match "\d{1,3}" ## true "<3 TX" -match "[a-z][a-z]" ## true "<3" -match "[a-z][a-z]" ## false "The quick brown fox jumped over the lazy dogs" -match "\D" ## true "It" -match "\D{3}" ## false "000" -match "\D" ## false ## Let's take what we've learned and see if we can catch (1) an SSN and a (2) credit card "The quick brown fox 000-00-0000 jumped over the lazy dogs." -match "\d{3}\D\d{2}\D\d{4}" ## true "The quick brown fox 0000-0000-0000-0000 jumped over the lazy dogs." -match "\d{4}\D\d{4}\D\d{4}\D\d{4}" ## true
In our final two examples, we found both an SSN format and a credit card format, using patterns of consecutive digits with non digits in between them. This will help us especially if we receive text from a form where users can enter in anything and we want to prevent them from entering certain data, or if we're mining files for various patterns.
Since we've determined how to match patterns, we can expand our code to replace these patterns, using -replace. As a note, PowerShell can use .NET's RegEx library (System.Text.RegularExpressions) in an object oriented structure similar to what developers would see if it was coded in C# (see the application RegEx article for an example). In our first examples of replace, we find matches and replace the matches without keeping any of the matches. In the second portion of the example, we keep a portion of the match by wrapping the part of the match we want to keep in parentheses and passing the kept portion as a parameter, demarcated with `$ symbols and wrapped in curly { } brackets. Note that if we want to keep multiple parts of a match (see the final example), each part we keep will be incremented in number.
## Replacing patterns "123The quick brown fox jumped over the lazy dogs." -replace "[0-9][0-9][0-9]","" ## Result: The quick brown fox jumped over the lazy dogs. "<3 TX" -replace "[a-z][a-z]","Texas" ## Result: <3 Texas "Let's remove all the spaces." -replace "[^a-z0-9]","" ## Result: Letsremoveallthespaces "T1M3R8Y94UII9I7ED5" -replace "\D{3}","_" ## Result: T1M3R8Y94_9I7ED5 ## What if we want to keep some of the match? "It only takes OOO to win tic-tac-toe? I don't think so!" -replace "[O][O]([O])","XX`${1}" ## Result: It only takes XXO to win tic-tac-toe? "13 non-numbers!" -replace "\D{9}(\D{4})"," runn`${1}" ## Result: 13 runners! "AAA11AAAAA1111111111AA" -replace "(\d{2})\d{2}(\d{2})\d{3}(\d{1})","`${1}XX`${2}XXX`${3}" ## Result: AAA11AAAAA11XX11XXX1AA
Finally, we may want to strip certain confidential data from files before we import them. We can use RegEx to find patterns, remove these patterns, and import the remaining data. In the below example, we will take the below four lines, add them to a file called C:\files\ssn.txt and then run code that mines a file for an SSN pattern, line by line, replacing patterns of SSNs in the format of 000-00-0000 with blanks:
ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0000 ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0001 ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0002 ValueOne,ValueTwo,ThreeIsALongerColumn,23.65,8065551212,000-00-0003
Code:
## Code to run after values from above are added to file.txt $file = "C:\files\ssn.txt" $ext = $file.Substring($file.LastIndexOf(".")) $loc = $file.Substring(0,($file.LastIndexOf("\")+1)) $name = $file.Substring($file.LastIndexOf("\")+1).Replace($ext,"") $valid = $loc + $name + "_clean" + $ext New-Item $valid -ItemType file $read = New-Object System.IO.StreamReader($file) $cleaned = New-Object System.IO.StreamWriter($valid) while (($line = $read.ReadLine()) -ne $null) { $line = $line -replace "\d{3}-\d{2}-\d{4}","" $cleaned.WriteLine($line) $cleaned.Flush() } $read.Close() $read.Dispose() $cleaned.Close() $cleaned.Dispose()
SQL Server T-SQL RegEx Examples
Now that we have some basics of RegEx in PowerShell, let's look at T-SQL and we'll see some similarities. The below table shows some common T-SQL RegEx patterns, as well as some examples with those patterns.
RegEx Pattern | Description |
[0-9] | Any character from 0 through 9. |
[^0-9] | Any character that is not 0 through 9. |
[8-9] | Any character from 8 through 9. |
[a-z] | Any character from a through z. |
[^a-z] | Any character that is not a through z. |
[r-w] | Any character from r through w. |
[^0-9a-z] | Any character that is not a number or letter. |
[@-] | Either an @ or - character |
[!#] | Either an ! or # character |
Now, let's build a test table in our test environment, populate it with some data, and query these data using RegEx tools. The below queries have commented questions and answers and we can see the patterns of querying with RegEx. For instance, if we use '%[a-z]%, we're looking for any row with any alpha character. Note what returns from the data set and what does not return. The same is true for numbers and symbols.
CREATE TABLE TSQL_RegExTable( RegExColumn VARCHAR(100) ) INSERT INTO TSQL_RegExTable VALUES ('1 The quick brown fox jumped over the lazy dogs.') , ('The quick brown fox jumped over the lazy dogs.') , ('This sentence does not have every letter of the alphabet.') , ('123777') , ('@you') , ('[email protected]') , ('*and') , ('One') , ('One1') , ('m') , ('4') , ('-') -- How many rows have one alpha character? SELECT RegExColumn AS OneAlpha FROM TSQL_RegExTable WHERE RegExColumn LIKE '[a-z]' -- Answer: 1 -- How many rows have only one alpha character in between h and j? SELECT RegExColumn HThroughJAlpha FROM TSQL_RegExTable WHERE RegExColumn LIKE '[h-j]' -- Answer: 0 -- How many rows start with any alpha character? SELECT RegExColumn StartWithAlpha FROM TSQL_RegExTable WHERE RegExColumn LIKE '[a-z]%' -- Answer: 6 -- How many rows have an alpha character somewhere in them? SELECT RegExColumn AlphaInThem FROM TSQL_RegExTable WHERE RegExColumn LIKE '%[a-z]%' -- Answer: 9 -- How many rows have the alpha character z somewhere in them? SELECT RegExColumn AlphaHasZ FROM TSQL_RegExTable WHERE RegExColumn LIKE '%[z]%' -- Answer: 2 -- How many rows have do not start with an alpha character? SELECT RegExColumn NotStartWithAlpha FROM TSQL_RegExTable WHERE RegExColumn LIKE '[^a-z]%' -- Answer: 6 -- How many rows have only one numeric character? SELECT RegExColumn OneNumber FROM TSQL_RegExTable WHERE RegExColumn LIKE '[0-9]' -- Answer: 1 -- How many rows have only one numeric character between 7 and 8? SELECT RegExColumn NumberBetweenSevenAndEight FROM TSQL_RegExTable WHERE RegExColumn LIKE '[7-8]' -- Answer: 0 -- How many rows start with any numeric character? SELECT RegExColumn StartWithNumber FROM TSQL_RegExTable WHERE RegExColumn LIKE '[0-9]%' -- Answer: 3 -- How many rows have any numeric character in them? SELECT RegExColumn NumberInThem FROM TSQL_RegExTable WHERE RegExColumn LIKE '%[0-9]%' -- Answer: 4 -- How many rows have do not start with a numeric character? SELECT RegExColumn NotStartWithNumber FROM TSQL_RegExTable WHERE RegExColumn LIKE '[^0-9]%' -- Answer: 9 -- How many rows start with the character @ or -? SELECT RegExColumn StartWithAtOrMinusCharacter FROM TSQL_RegExTable WHERE RegExColumn LIKE '[@-]%' -- Answer: 2 -- How many rows have an @ or - character in them? SELECT RegExColumn AtOrMinusCharacterInThem FROM TSQL_RegExTable WHERE RegExColumn LIKE '%[@-]%' -- Answer: 3 -- How many rows have neither one alpha nor one numeric character? SELECT RegExColumn OneCharacterNotAlphaNotNumeric FROM TSQL_RegExTable WHERE RegExColumn LIKE '[^0-9a-z]' -- Answer: 1 DROP TABLE TSQL_RegExTable
Now that we see the basics, let's apply the above logic to SSNs, as an example of finding confidential data using RegEx. In the below example, we will assume that SSNs are either stored as 9 characters of nine numeric characters in a row, or 11 characters in a pattern of three numbers, a non number, two numbers, a non number, and four numbers:
CREATE TABLE SSNTable( SSN VARCHAR(11), TextField VARCHAR(500) ) INSERT INTO SSNTable VALUES ('000-00-0000','The quick brown fox 000-00-0000 jumped over the lazy dogs.') , ('000-00-0001','000-00-0001 The quick brown fox jumped over the lazy dogs.') , ('000-00-0002',' The quick brown fox jumped over the lazy dogs. 000-00-0002') , ('000000003',' The quick brown fox jumped over the 000.00.0002 lazy dogs.') -- How many SSN values of nine digits in a row? SELECT SSN FROM SSNTable WHERE SSN LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' -- Answer: 1 -- How many SSNs with three digits, a dash, two digits, a dash, and four digits? SELECT * FROM SSNTable WHERE SSN LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]' -- Answer: 3 -- How many TextFields with three digits, a dash, two digits, a dash, and four digits only? SELECT * FROM SSNTable WHERE TextField LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]' -- Answer: 0 -- How many TextFields with three digits, a dash, two digits, a dash, and four digits somewhere in them? SELECT * FROM SSNTable WHERE TextField LIKE '%[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]%' -- Answer: 3 -- How many TextFields with three digits, a non digit, two digits, a non digit, and four digits somewhere in them? SELECT * FROM SSNTable WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%' -- Answer: 4 DROP TABLE SSNTable
At this point, we've been able to identify what values might be (or have a high probability of being) SSNs (and we could adjust for other confidential data). If we wanted to remove these values, we could either delete them or insert the opposite results into a different table and drop the existing table. A few T-SQL developers will tell you that T-SQL isn't a Regex-Replace language, even though this is possible in T-SQL and a little more complex than some developers may want. If you can keep confidential data, like SSNs, from ever reaching your data set, that will save you time. The below code will replace an SSN in the pattern of 000-00-000 or in a pattern of 000.00.0000:
SELECT TextField , REPLACE(TextField,SUBSTRING(TextField,PATINDEX('%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%',TextField),11),'') FROM SSNTable WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%'
We look for the specific pattern, catch it where we find a match, and replace it with nothing. While this works with this pattern, if we were to use similar logic with nine numeric characters in a row, depending on the text field, it may replace a phone number, account number, or some other number exceeding nine digits. In that case, the logic would only need to identify a numeric character nine digits in a row, and in cases like this, we would find we could get higher performance by stopping it on the front end.
What source control tools permit you to use RegEx?
RegEx can be done with most scripting or object oriented languages, for instance, the below Python code will remove 000-00-0000 formatted SSNs from a string:
import re def stripSSNs(str): str = re.sub("\d{3}-\d{2}-\d{4}","",str) return str; print stripSSNs("The quick brown 000-00-0000 fox jumped over the lazy dogs.")
Any source control tool that permits the use of scripting and/or object oriented languages will provide you with access to RegEx. If you want to identify possible variables or stores of confidential data in your actual code (for instance, a stored procedure calling an SSN column), you can loop through all of your local files from source control with code searching for those keywords. The below PowerShell function, as an example, does this with the folder "C:\Path\" and the pattern of "Social" keyword:
Function FindWord ($githubfolder, $pattern) { $cnt = Get-ChildItem $githubfolder -Include @("*.sql", "*.cs", "*.xml", "*.ps1") -Recurse | Select-String -pattern $pattern if ($cnt.Count -gt 0) { $cnt } } FindWord -githubfolder "C:\Path\" -pattern "Social"
Can you use RegEx with SSIS?
Yes, SSIS allows for both T-SQL (Execute SQL Task) and C# (Script Task).
Have you found a way to check if just one [piece of confidential data] exists and stop instead of it continuing to scan the entire table?
Yes, I'll use SSNs as an example; note how in the first query, I check if one exists, and return SSN Found, and in the other I'm selecting everything (the total is 2001 values in the example - if we keep adding values, we'll really see the performance between the two separate). Turn on the Execution Plan and see the results for yourself:
CREATE TABLE SSNTable( TextField VARCHAR(500) ) INSERT INTO SSNTable VALUES ('The quick brown fox jumped over the lazy dogs.') GO 1000 INSERT INTO SSNTable VALUES ('The quick brown fox 000-00-0000 jumped over the lazy dogs.') INSERT INTO SSNTable VALUES ('The quick brown fox jumped over the lazy dogs.') GO 1000 SELECT * FROM SSNTable CREATE NONCLUSTERED INDEX IX_TextSSN ON SSNTable (TextField) SET STATISTICS IO ON SET STATISTICS TIME ON SET NOCOUNT ON PRINT 'Does it exist?' IF EXISTS( SELECT 1 FROM SSNTable WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%' ) SELECT 'SSN Found' PRINT 'Let''s select the whole thing, shall we?' SELECT TextField FROM SSNTable WHERE TextField LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%' SET STATISTICS IO OFF SET STATISTICS TIME OFF SET NOCOUNT OFF DROP TABLE SSNTable
Results from a query-to-query comparison:
Does it exist? (20% of Actual Execution Plan) SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. Table 'SSNTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 69 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. Let's select the whole thing, shall we? (80% of Actual Execution Plan) SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. Table 'SSNTable'. Scan count 1, logical reads 18, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 15 ms, elapsed time = 60 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.
Next Steps
- Test the PowerShell RegEx.
- Test the T-SQL RegEx.
- Consider what part of your applications will RegEx be very useful and how can you re-use the logic.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips