Precise Alphabetic Filtering With Regular Expressions for T-SQL
By: Tim Smith
In our first part of alphabetic filtering with regular expressions, we learned that we can specify a range of characters and a specific length of characters which will return either that range and (or) length. We specify the range using brackets with the alphabetic characters of the range on each side and we can specify the length by stacking the ranges side-by-side.
In the below query, as a review, we are only looking for data rows with 2 alphabetic characters in the range of A through D for each character.
SELECT * FROM alphareg WHERE Alphabetic LIKE '[A-D][A-D]'
If a data row has three characters, it will not return. Likewise, if one of the alphabetic characters is an X, U or M, it won’t return, even if it has 2 alphabetic characters. In this next exercise, we’ll look at more precise filtering with alphabetic characters, especially if we need to do combinations that may only involve one character.
Regex to Find Specific Text Pattern
In some cases, we may want to filter on a range after a specific character, such as first character being a T and the second character being either an A or W. If we want the second character to be either an A or W, we don’t want to enter a range as that may return characters we don’t want, so we can specify both of these characters as an option side-by-side without the range using an hyphen; for instance, [AW] means the “second character is either an A or W.”
In the below query we do this and we’re only looking at data rows that start with these, so this query does not ask for a specific length.
SELECT * FROM alphareg WHERE Alphabetic LIKE '[T][AW]%'
Two results from the first query.
Two of our data rows have alphabetic characters that start with a T and have either an A or W that follow. We can also apply this by allowing any starting or ending characters, such as what we do in the below code example.
More Complex Regex to Find Specific Text Patterns
Now, we’re looking for a combination of words – searching for TWO and THREE and similar combinations by stating that:
- Any characters are allowed at first (initial %),
- Then a T is required [T] in the second character ,
- Then a H or W is required [HW] in third character,
- Then a O or R is required [OR] in fourth character,
- Then a E or space is required [E ] in fifth character,
- And then any character is allowed after that (last %).
SELECT * FROM alphareg WHERE Alphabetic LIKE '%[T][HW][OR][E ]%'
One result from the second query.
Since the data row TWO does not have a space following the O, it does not return. This latter query only returns one result, as only one result meets this criterion.
What if we wanted to take the second query and we require that the first character had to be an A, while allowing anything else while also requiring that we find a two or three with anything else also ending the string?
The below third query is almost identical to the second query, except that it requires that our first character is an A (which no data rows meets the criterion).
SELECT * FROM alphareg WHERE Alphabetic LIKE '[A]%[T][HW][OR][E ]%'
No results from the third query.
Like the third query above this, we can specify combinations with regular expressions, such as “give me data rows with the first two characters being alphabetic [A-Z][A-Z], then any combination %, then two bs in a row [B][B], with any other combination %.
In the below query, we run a similar query where we allow:
- Any alphabetic combination in the first character [A-Z]
- With any combination after that %
- And then require a T character [T]
- Followed by either an H or W [HW]
- Followed by any other character %.
SELECT * FROM alphareg WHERE Alphabetic LIKE '[A-Z]%[T][HW]%'
Seven results from the fourth query.
As we can see, any sentence starting with an alphabetic character and containing either the word “the”, “this”, or “two” returns. What we’ve seen is that we can specify exact characters over ranges, if we need to, with combinations of any character following or preceding (%). In the next section, we’ll look at applying case sensitivity with alphabetic characters.
- Two of our data rows from the fourth query have “this” in them with “this” being a word in the sentence, not starting the sentence. How would you write the fourth query to only return “this” as a word in a sentence (not beginning it) using any combination of alphabetic regular expressions we’ve learned so far?