Regular Expressions Business Case Examples with T-SQL
By: Tim Smith
From credit cards to identifiers, regular expressions can simplify finding information in specific formats. In this final part, we’ll look at a few applied examples of using regular expressions to uncover information within our data. We will combine everything we’ve learned to look at validating date and time information within varchar data, credit card numbers within varchar data, urls and email addresses within varchar data.
Populate Special Use Case Data
We will start by creating a new table and populate it with new data, while also adding some of our previous alphabetic data:
CREATE TABLE tbBizExamples( BizVarchar VARCHAR(1000) ) ---- New data: while this links to do exist now, do not click on any link - these are examples only INSERT INTO tbBizExamples VALUES ('2017-01-01 03:01:30.700') , ('2017-01-01') , ('2017-01-01 03:01') , ('http://dontexist.url/se2639de') , ('http://dontexist.url/dkek2284') , ('http://dontexist.url/82jdj392') , ('[email protected]') , ('[email protected]') , ('[email protected]') , ('4444 4444 4444 4444') , ('4444-4444-4444-4444') , ('4444x4444x4444x4444') , ('4444144441444414444') ---- Existing values from alphareg INSERT INTO tbBizExamples SELECT Alphabetic FROM alphareg
Using Regex to Find Valid Date Patterns with T-SQL
In our first example, we’ll look for dates in the format starting with YYYYXMMXDD where X is a separator of some kind, such as a space, dash, letter, or anything except a number. We’re also excluding dates that either don’t currently exist, or didn’t exist since computers have existed (in other words, 999 A.D. is not a valid date, since computers didn’t exist then).
SELECT * FROM tbBizExamples WHERE BizVarchar LIKE '[1-2][0-9][0-9][0-9][^0-9][0-1][0-9][^0-9][0-3][0-9]%'
Finding dates in the following format of YYYYSeparatorMMSeparatorDDPossibleEnd
Notice some of the patterns we restricted based on the tools we’ve learned so far:
- The year format of YYYY must start with either a 1 or a 2. This is because at this point in human history, this is the only range of dates that computers have existed (1900s, 2000s).
- The month format of MM must start with either a 0 or 1. This is because we have twelve months, so the lowest month is 01 and the highest month is 12.
- The day format of DD must start with a 0 through 3 range. This is because the lowest day possible is 01 and the highest day is 31.
- Since numbers are not valid separators in this example, we exclude them ([^0-9]).
- Finally, we may or may not be seeking the specific time, so we allow any character at the end (%), so that we also get values which may have the exact time included.
Depending on what format we consider valid, we would adjust our query or add to it, as we may be seeking multiple formats. It’s also possible that we want to look at dates such as 900 A.D. and if this was included in our criteria, we would make the appropriate adjustments to include a format such as YYY-DD-MM or another derivative.
Using Regex to Find Valid Credit Card Patterns with T-SQL
In the next example, we’ll apply regular expressions to look for a 16 digit credit card that begins with a number from 1 to 6 as a possibility follows the format of a separator between each set of 4 digits, with the separator not being a number.
SELECT * FROM tbBizExamples WHERE BizVarchar LIKE '[1-6][0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]'
Three results from our query filtering for certain formatted credit cards.
The 19 character limit means that any number larger or smaller than this character limit is excluded. In addition, we exclude any number where a separator is a number, as that could be a 19-digit number rather than a credit card number. Depending on our rules, we can make appropriate adjustments to our query, as some credit cards may follow a specific format (while still being of the same length). We look at another possibility in these cases, by first ensuring the length is 19 using the LEN function and requiring that the first three characters are numbers, with the first being 1 through 6 allowed, with the last three characters also being numbers, and in the middle there is at least three sets of characters that are also numbers.
SELECT * FROM tbBizExamples WHERE LEN(BizVarchar) = 19 AND BizVarchar LIKE '[1-6][0-9][0-9]%[0-9][0-9][0-9]%[0-9][0-9][0-9]'
Four results from query.
Depending on what possibilities are allowed, this could assist us with formats of cards that may start, end, or contain a set of 3 digits somewhere in the card number, while preventing numbers that are too long or too short.
Using Regex to Find Valid SSN Patterns with T-SQL
We could also follow the same pattern as above using something like this to look for SSNs:
SELECT * FROM tbBizExamples WHERE LEN(BizVarchar) = 11 AND BizVarchar LIKE '%[0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9]%'
Using Regex to Find Valid Phone Number Patterns with T-SQL
Something like this could be used to check valid US and Canada phone numbers.
SELECT * FROM tbBizExamples WHERE REPLACE(BizVarchar,' ','') LIKE '%[0-9][0-9][0-9][^0-9][0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%' OR REPLACE(BizVarchar,' ','') LIKE '%[0-9][0-9][0-9][0-9][0-9][0-9][^0-9][0-9][0-9][0-9][0-9]%' OR REPLACE(BizVarchar,' ','') LIKE '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
Using Regex to Find Valid URL Patterns with T-SQL
Suppose that we wanted to block all messages with URLs that follow a format of alphanumericcombodoturlslashalphanumericcombo, which looks like something.url/something. We’re only looking at filtering these combinations here, as people could try other techniques to get around these rules (like using spaces or actually spelling out the words or even misspelling them while knowing users won’t catch it initially).
SELECT * FROM tbBizExamples WHERE BizVarchar LIKE '%[A-Z0-9][.][A-Z0-9]%[A-Z0-9][/][A-Z0-9]%'
Three results from query filtering for URLs.
In combinations with shorter URLs of 2 characters or longer URLs, like 4 characters, this would identify these as well. We’re intentionally not filtering with the http:// because this is often excluded when some try to get people to click on a link. It’s also possible that we exclude some legitimate messages that have this type of a combination for a reason like formatting their phone number a certain way (using a dot for area code and a / for the second separator), but being more strict may be something we prefer, even if excludes a few legitimate values.
Using Regex to Find Valid Email Patterns with T-SQL
Finally, if we want to find emails in the format of alphanumericcomboatsymbolalphanumericcombodoturl, which looks like [email protected], we can use the following filter which allows for specific combination of special characters at specific points and alpha numeric combinations:
SELECT * FROM tbBizExamples WHERE BizVarchar LIKE '%[A-Z0-9][@][A-Z0-9]%[.][A-Z0-9]%'
Three results from query.
Similar to URLs, people may choose to write their email addresses in creative ways, using actual words over characters. If these would be considered “legitimate” for our business rules, we would extend our rules to include this, such as something like the below query to catch an email using something [at] something.com or something (at symbol) something.com.
INSERT INTO tbBizExamples VALUES ('dontexistemail1 [at] emaildomain.url') , ('dontexistemail1 (at symbol) emaildomain.url') SELECT * FROM tbBizExamples WHERE BizVarchar LIKE '%[A-Z0-9][^A-Z0-9]%at%[^A-Z0-9][A-Z0-9]%[.][A-Z0-9]%'
Adding two values that may be additional combinations people use, but might include sentences in some cases
This leads to one forgotten point involving identifying data, such as emails: if the purpose of identifying data causes disruptions, we can expect users to change their behavior to avoid this. The above case might return sentences that aren’t emails and this is exactly what users want – they want us to get bad data that aren’t emails too. Why consumers do this is to protect their email from disruptions, so the reason why we’re finding emails may be important – we may be the direct cause of people using protective techniques.
- When applying regular expressions to business rules, we may have multiple possibilities for each rule so it’s important to detail what we will allow. For an example, if we want to find any possible URL written in any possible way, this could be an enormous effort, as people may spell out certain words, misspell certain words intentionally, use special characters for other things, use secret words that only the participants know (such as aaa for a slash and aab for a dot), etc.
- Remember that if we’re using data, such as emails or identifiers, for purposes of what a user considers disruptive (example: spam), we should expect that users will start adjusting their techniques, like with email, identifiers or url combinations. This is because we’re using tools to disrupt users, which causes them to change their behavior. The point here is that regular expressions can be helpful if serving useful purposes for users, but if the purpose is to disrupt users, they will adjust their technique in ways that invalidate our rules. This is one (of many) reasons people format their email in diverse ways.