By: Tim Smith | Comments (5) | Related: > Auditing and Compliance
Problem
After viewing your webcast on protecting confidential data, how can we prevent these data from entering our database through the application layer? In this tip we look at different approaches to stop confidential data from getting into the database.
Solution
One of the most effective ways to minimize potential legal issues with confidential data, such as SSNs, is to never allow it to enter our databases. We can request that customers not enter their SSNs, for an example, yet - unless we prevent it on the application side - we cannot control whether they do or not.
For this tip, we will cover the logic behind preventing data, such as SSNs, from entering our database. This same logic can be applied to payment information, tax information, personal information, etc. Per the Social Security Administration, I use all invalid SSNs (any SSN starting with 000 is, by default, invalid). Never use potentially valid SSNs in examples, both inside and outside your company.
To prevent SSNs from entering our database, we can take one of two approaches:
- Alert customers when we detect an SSN that they entered an SSN and prevent them from entering data until they remove it.
- Eliminate the SSN in the application logic from the data without telling the customer.
Regular Expressions to Check for SSNs
Let's do a quick review of regular expressions as they relate to this example, using SSNs:
- \d: A digit.
- \d{3}: Three digits.
- [^\d]: A non digit character, such as "a" or "-".
These are useful to know because SSNs often look like 000-00-000 or 000.00.0000, in other words, a pattern of three digits, a non digit character, two digits, a non digit character and four digits. In some cases, they may be nine digits in a row (i.e.: \d{9}), though as we'll see in this tip, we must be careful before assuming that nine consecutive digits equal an SSN.
For this example, we will be using C# and handle a free text form where users can enter any text, such as 000-00-0000. Since we want to prevent SSNs from entering our database and we want this issue handled on the application layer, we don't have a field asking for an SSN, as that would defeat the purpose. That means we should be wary of fields allowing customers to enter anything they want, such as free text fields. As a demonstration of why we don't need to worry about other fields, such as the phone field - if a customer enters their SSNs in the phone field, how would we know? Our application should only allow a 10 digit number (and nothing less) anyway, so even if a customer entered 0000000000 in a phone field, we wouldn't know that was a social security number because it's 10 digits. The customer would have to tell us in some other way.
Now that we know where we're going to wrap our business logic around, we must identify how an SSN might enter a free text form. Per the Social Security Administration link above this, SSNs are nine digits and are often expressed in a three digit separator two digit separator four digit format (000 separator 00 separator 0000). More often than not, the separator is a -, but we will handle any separator provided it is not a digit. How would we know that 00050050000 was an SSN anyway, unless otherwise stated, since it's an eleven digit number? Therefore, we are seeking either a (1) nine digit number, or (2) eleven characters with nine digits and two separators, one after the first three digits, and the second after the first five digits.
Detecting Confidential Data
Since free text forms allow strings, the following example, using a simple console application for testing using the System.Text.RegularExpression namespace, finds an SSN in a string - one seeking nine consecutive digits and the other 11 characters, with nine digits and two non digit separators in a 000-00-0000 format:
static void Main(string[] args) { List<string> textvalues = new List<string>(); textvalues.Add("The quick brown fox jumped 000-00-0000 over the lazy dogs."); textvalues.Add("The quick brown fox jumped over the lazy dogs 000-00-0000."); textvalues.Add("The quick brown 000000000 fox jumped over the lazy dogs."); textvalues.Add("The quick brown fox 0000000000 fox jumped over the lazy dogs."); textvalues.Add("ValueOne,ValueTwo,000-00-0000,8005551212,Durkah durkah"); foreach (string s in textvalues) { // Do we find a match of formatted SSNs? Match formatted_ssn = Regex.Match(s.ToLower().Trim(), @"\d{3}[^\d]\d{2}[^\d]\d{4}"); // Do we find a match of a straight nine digit SSN? Match straight_ssn = Regex.Match(s.ToLower().Trim(), @"[^\d]\d{9}[^\d]"); Console.WriteLine(s); Console.WriteLine("\t" + "Formatted SSNs: " + formatted_ssn.ToString()); Console.WriteLine("\t" + "Straight SSNs: " + straight_ssn.ToString()); } Console.ReadLine(); textvalues.Clear(); }
In the above example, using Regex we found two types of SSNs. I will note here that in some cases, a nine digit number may not be an SSN - for instance, some companies may use a nine digit number for an account number and routing numbers are generally nine digits long (for instance, see this example of routing numbers). Therefore, depending on our application and business, we may not want to check for a straight nine digit number.
Detecting and Removing Confidential Data from Input
Now that we found a match, our next step will be to remove the match, and (or) alert the customer that we don't accept SSNs. What we do here will depend on our company policy - for instance, we may eliminate the SSNs without alerting the customer, or we may prevent data from entering the database as long as it appears to contain SSNs. To do either, we will obtain the property value of the formatted_ssn and straight_ssn from the above code and either (1) strip the SSN if we find a value, or (2) alert the customer they entered an SSN and we do not take SSNs:
static void Main(string[] args) { List<string> textvalues = new List<string>(); textvalues.Add("The quick brown fox jumped 000-00-0000 over the lazy dogs."); textvalues.Add("The quick brown fox jumped over the lazy dogs 000-00-0000."); textvalues.Add("The quick brown 000000000 fox jumped over the lazy dogs."); textvalues.Add("The quick brown fox 0000000000 fox jumped over the lazy dogs."); textvalues.Add("ValueOne,ValueTwo,000-00-0000,8005551212,Durkah durkah"); foreach (string s in textvalues) { // Replace mathces of either 000[non-digit]00[non-digit]0000 structure or nine consecutive digits string final = Regex.Replace(s, @"\d{3}[^\d]\d{2}[^\d]\d{4}", ""); final = Regex.Replace(final, @"\d{9}", ""); Console.WriteLine("Before: " + s); Console.WriteLine("After: " + final + Environment.NewLine); } Console.ReadLine(); textvalues.Clear(); }
static void Main(string[] args) { List<string> textvalues = new List<string>(); textvalues.Add("The quick brown fox jumped 000-00-0000 over the lazy dogs."); textvalues.Add("The quick brown fox jumped over the lazy dogs 000-00-0000."); textvalues.Add("The quick brown 000000000 fox jumped over the lazy dogs."); textvalues.Add("The quick brown fox 0000000000 fox jumped over the lazy dogs."); textvalues.Add("ValueOne,ValueTwo,000-00-0000,8005551212,Durkah durkah"); foreach (string s in textvalues) { Match formatted_ssn = Regex.Match(s.ToLower().Trim(), @"\d{3}[^\d]\d{2}[^\d]\d{4}"); Match straight_ssn = Regex.Match(s.ToLower().Trim(), @"[^\d]\d{9}[^\d]"); // Did we find a formatted_ssn or straight_ssn value? Then we don't take it. if (formatted_ssn.Value != "" || straight_ssn.Value != "") { Console.WriteLine("We've detected that you entered an SSN. We do not take SSNs; please remove."); } // No SSNs entered, so we're good. else { Console.WriteLine(s); } } Console.ReadLine(); textvalues.Clear(); }
Testing for Confidential Data using a Console Application
Now that we have our tested logic in place - and for convenience, I highly recommend testing on a console application - we will move these logical approaches to their own methods and test it with a Windows form. We would apply this same logic to an ASP.NET form, Ruby on Rails form (logic is the same, the code will differ), etc. The below code is for demo purposes only; most applications have classes to handle data, in general, when it comes to accepting data and the logic would be added to these existing classes.
public class CatchData { public static string stripSSNs(string search) { string final = Regex.Replace(search, @"\d{3}[^\d]\d{2}[^\d]\d{4}", ""); final = Regex.Replace(final, @"\d{9}", ""); return final; } public static string preventSSNs(string entry) { Match formatted_ssn = Regex.Match(entry.ToLower().Trim(), @"\d{3}[^\d]\d{2}[^\d]\d{4}"); Match straight_ssn = Regex.Match(entry.ToLower().Trim(), @"[^\d]\d{9}[^\d]"); if (formatted_ssn.Value != "" || straight_ssn.Value != "") { return "We've detected an SSN. Please do not enter an SSN."; } // This additional else catches a string with a length of nine characters, which is only numbers; this is useful // because in the above if statement, we're looking for a nine digit character between non digit characters. else if (entry.Length == 9 && (Regex.Match(entry.ToLower().Trim(), @"\d{9}").Value != "")) { return "We've detected an SSN. Please do not enter an SSN."; } else { return entry; } } }
In our first example, if we input an SSN, it will simply strip it and return the value without the SSN. In the second example, it will generate an error if we input an SSN. Both examples output the result in a message box - in an actual application, these strings would be part of the parameters added to the database (i.e.: passing a string parameter to a stored procedure accepting VARCHARs using the namespace System.Data.SqlClient)
string c_entry = CatchData.stripSSNs(Convert.ToString(txt1.Text)); // Let's see what we get: MessageBox.Show(c_entry);
string c_entry = CatchData.preventSSNs(Convert.ToString(txt1.Text)); // Let's see what we get: MessageBox.Show(c_entry);
The above examples show a few ways in which we can prevent confidential data from entering our database, using business logic before data ever arrives in the database to either (1) strip the confidential data from the entry, or (2) alert the customer that they've entered confidential data we don't store. While this code functionally works, if you work solely on the database side of applications, I would suggest speaking with your middle-end/front-end developers about this, as you don't want to apply a process on top of their existing process, if they haven't checked that it works with what they have (or that it meets their performance standard). This example shows that we can minimize confidential data headaches with ways to prevent it from entering our database.
Next Steps
- What other data may you want to avoid storing and how would you catch these data?
- Touch up on your familiarity with regular expressions.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips