Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Prevent Confidential Data at the Application Layer from getting into SQL Server


By:   |   Read Comments (5)   |   Related Tips: More > Auditing and Compliance

Attend our free MSSQLTips Webcast - How to Simplify Routine SQL Server Administration Tasks


Problem

After viewing your webcast on protecting confidential data, how can we prevent these data from entering our database through the application layer?  In this tip we look at different approaches to stop confidential data from getting into the database.

Solution

One of the most effective ways to minimize potential legal issues with confidential data, such as SSNs, is to never allow it to enter our databases. We can request that customers not enter their SSNs, for an example, yet - unless we prevent it on the application side - we cannot control whether they do or not.

For this tip, we will cover the logic behind preventing data, such as SSNs, from entering our database. This same logic can be applied to payment information, tax information, personal information, etc. Per the Social Security Administration, I use all invalid SSNs (any SSN starting with 000 is, by default, invalid). Never use potentially valid SSNs in examples, both inside and outside your company.

To prevent SSNs from entering our database, we can take one of two approaches:

  1. Alert customers when we detect an SSN that they entered an SSN and prevent them from entering data until they remove it.
  2. Eliminate the SSN in the application logic from the data without telling the customer.
Remember, that we want to stop this on the application end, so the SSN will never be passed into the database, so none of the elimination logic will involve T-SQL. This tip shows some examples of ways to do this, and if you don't have an application development team, or you are a full stack developer and may be unaware of how to do this, this will help. Otherwise, check with the current application team as they may already have classes or business logic that handles confidential data and we want to ensure that whatever we roll out is compatible with their existing architecture.

Regular Expressions to Check for SSNs

Let's do a quick review of regular expressions as they relate to this example, using SSNs:

  • \d: A digit.
  • \d{3}: Three digits.
  • [^\d]: A non digit character, such as "a" or "-".

These are useful to know because SSNs often look like 000-00-000 or 000.00.0000, in other words, a pattern of three digits, a non digit character, two digits, a non digit character and four digits. In some cases, they may be nine digits in a row (i.e.: \d{9}), though as we'll see in this tip, we must be careful before assuming that nine consecutive digits equal an SSN.

For this example, we will be using C# and handle a free text form where users can enter any text, such as 000-00-0000. Since we want to prevent SSNs from entering our database and we want this issue handled on the application layer, we don't have a field asking for an SSN, as that would defeat the purpose. That means we should be wary of fields allowing customers to enter anything they want, such as free text fields. As a demonstration of why we don't need to worry about other fields, such as the phone field - if a customer enters their SSNs in the phone field, how would we know? Our application should only allow a 10 digit number (and nothing less) anyway, so even if a customer entered 0000000000 in a phone field, we wouldn't know that was a social security number because it's 10 digits. The customer would have to tell us in some other way.

Now that we know where we're going to wrap our business logic around, we must identify how an SSN might enter a free text form. Per the Social Security Administration link above this, SSNs are nine digits and are often expressed in a three digit separator two digit separator four digit format (000 separator 00 separator 0000). More often than not, the separator is a -, but we will handle any separator provided it is not a digit. How would we know that 00050050000 was an SSN anyway, unless otherwise stated, since it's an eleven digit number? Therefore, we are seeking either a (1) nine digit number, or (2) eleven characters with nine digits and two separators, one after the first three digits, and the second after the first five digits.

Detecting Confidential Data

Since free text forms allow strings, the following example, using a simple console application for testing using the System.Text.RegularExpression namespace, finds an SSN in a string - one seeking nine consecutive digits and the other 11 characters, with nine digits and two non digit separators in a 000-00-0000 format:

static void Main(string[] args)
{
 List<string> textvalues = new List<string>();
 textvalues.Add("The quick brown fox jumped 000-00-0000 over the lazy dogs.");
 textvalues.Add("The quick brown fox jumped over the lazy dogs 000-00-0000.");
 textvalues.Add("The quick brown 000000000 fox jumped over the lazy dogs.");
 textvalues.Add("The quick brown fox 0000000000 fox jumped over the lazy dogs.");
 textvalues.Add("ValueOne,ValueTwo,000-00-0000,8005551212,Durkah durkah");

 foreach (string s in textvalues)
 {
  //  Do we find a match of formatted SSNs?
  Match formatted_ssn = Regex.Match(s.ToLower().Trim(), @"\d{3}[^\d]\d{2}[^\d]\d{4}");
  //  Do we find a match of a straight nine digit SSN?
  Match straight_ssn = Regex.Match(s.ToLower().Trim(), @"[^\d]\d{9}[^\d]");
  Console.WriteLine(s);
  Console.WriteLine("\t" + "Formatted SSNs: " + formatted_ssn.ToString());
  Console.WriteLine("\t" + "Straight SSNs: " + straight_ssn.ToString());
 }
 Console.ReadLine();
 textvalues.Clear();
}

In the above example, using Regex we found two types of SSNs. I will note here that in some cases, a nine digit number may not be an SSN - for instance, some companies may use a nine digit number for an account number and routing numbers are generally nine digits long (for instance, see this example of routing numbers). Therefore, depending on our application and business, we may not want to check for a straight nine digit number.

Detecting and Removing Confidential Data from Input

Now that we found a match, our next step will be to remove the match, and (or) alert the customer that we don't accept SSNs. What we do here will depend on our company policy - for instance, we may eliminate the SSNs without alerting the customer, or we may prevent data from entering the database as long as it appears to contain SSNs. To do either, we will obtain the property value of the formatted_ssn and straight_ssn from the above code and either (1) strip the SSN if we find a value, or (2) alert the customer they entered an SSN and we do not take SSNs:

static void Main(string[] args)
{
 List<string> textvalues = new List<string>();
 textvalues.Add("The quick brown fox jumped 000-00-0000 over the lazy dogs.");
 textvalues.Add("The quick brown fox jumped over the lazy dogs 000-00-0000.");
 textvalues.Add("The quick brown 000000000 fox jumped over the lazy dogs.");
 textvalues.Add("The quick brown fox 0000000000 fox jumped over the lazy dogs.");
 textvalues.Add("ValueOne,ValueTwo,000-00-0000,8005551212,Durkah durkah");

 foreach (string s in textvalues)
 {
  //  Replace mathces of either 000[non-digit]00[non-digit]0000 structure or nine consecutive digits
  string final = Regex.Replace(s, @"\d{3}[^\d]\d{2}[^\d]\d{4}", "");
  final = Regex.Replace(final, @"\d{9}", "");
  Console.WriteLine("Before: " + s);
  Console.WriteLine("After: " + final + Environment.NewLine);
 }
 Console.ReadLine();
 textvalues.Clear();
}


static void Main(string[] args)
{
 List<string> textvalues = new List<string>();
 textvalues.Add("The quick brown fox jumped 000-00-0000 over the lazy dogs.");
 textvalues.Add("The quick brown fox jumped over the lazy dogs 000-00-0000.");
 textvalues.Add("The quick brown 000000000 fox jumped over the lazy dogs.");
 textvalues.Add("The quick brown fox 0000000000 fox jumped over the lazy dogs.");
 textvalues.Add("ValueOne,ValueTwo,000-00-0000,8005551212,Durkah durkah");

 foreach (string s in textvalues)
 {
  Match formatted_ssn = Regex.Match(s.ToLower().Trim(), @"\d{3}[^\d]\d{2}[^\d]\d{4}");
  Match straight_ssn = Regex.Match(s.ToLower().Trim(), @"[^\d]\d{9}[^\d]");
  //  Did we find a formatted_ssn or straight_ssn value?  Then we don't take it.
  if (formatted_ssn.Value != "" || straight_ssn.Value != "")
  {
   Console.WriteLine("We've detected that you entered an SSN.  We do not take SSNs; please remove.");
  }
  //  No SSNs entered, so we're good.
  else
  {
   Console.WriteLine(s);
  }
  
 }
 Console.ReadLine();
 textvalues.Clear();
}

Testing for Confidential Data using a Console Application

Now that we have our tested logic in place - and for convenience, I highly recommend testing on a console application - we will move these logical approaches to their own methods and test it with a Windows form. We would apply this same logic to an ASP.NET form, Ruby on Rails form (logic is the same, the code will differ), etc. The below code is for demo purposes only; most applications have classes to handle data, in general, when it comes to accepting data and the logic would be added to these existing classes.

public class CatchData
{
 public static string stripSSNs(string search)
 {
  string final = Regex.Replace(search, @"\d{3}[^\d]\d{2}[^\d]\d{4}", "");
  final = Regex.Replace(final, @"\d{9}", "");
  return final;
 }

 public static string preventSSNs(string entry)
 {
  Match formatted_ssn = Regex.Match(entry.ToLower().Trim(), @"\d{3}[^\d]\d{2}[^\d]\d{4}");
  Match straight_ssn = Regex.Match(entry.ToLower().Trim(), @"[^\d]\d{9}[^\d]");

  if (formatted_ssn.Value != "" || straight_ssn.Value != "")
  {
   return "We've detected an SSN.  Please do not enter an SSN.";
  }
  //  This additional else catches a string with a length of nine characters, which is only numbers; this is useful
  //  because in the above if statement, we're looking for a nine digit character between non digit characters.
  else if (entry.Length == 9 && (Regex.Match(entry.ToLower().Trim(), @"\d{9}").Value != ""))
  {
   return "We've detected an SSN.  Please do not enter an SSN.";
  }
  else
  {
   return entry;
  }
 }
}

In our first example, if we input an SSN, it will simply strip it and return the value without the SSN. In the second example, it will generate an error if we input an SSN. Both examples output the result in a message box - in an actual application, these strings would be part of the parameters added to the database (i.e.: passing a string parameter to a stored procedure accepting VARCHARs using the namespace System.Data.SqlClient)

string c_entry = CatchData.stripSSNs(Convert.ToString(txt1.Text));
//  Let's see what we get:
MessageBox.Show(c_entry);


string c_entry = CatchData.preventSSNs(Convert.ToString(txt1.Text));
//  Let's see what we get:
MessageBox.Show(c_entry);



The above examples show a few ways in which we can prevent confidential data from entering our database, using business logic before data ever arrives in the database to either (1) strip the confidential data from the entry, or (2) alert the customer that they've entered confidential data we don't store. While this code functionally works, if you work solely on the database side of applications, I would suggest speaking with your middle-end/front-end developers about this, as you don't want to apply a process on top of their existing process, if they haven't checked that it works with what they have (or that it meets their performance standard). This example shows that we can minimize confidential data headaches with ways to prevent it from entering our database.

Next Steps
  • What other data may you want to avoid storing and how would you catch these data?
  • Touch up on your familiarity with regular expressions.


Last Update:


signup button

next tip button



About the author
MSSQLTips author Tim Smith Tim Smith works as a DBA and developer and also teaches Automating ETL on Udemy.

View all my tips
Related Resources





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Tuesday, September 23, 2014 - 8:02:33 AM - Tim Back To Top

Thank you Vinny; their Q&A page is also fascinating and pertinent to developers for testing (I wouldn't recommend controlling for certain numbers because it's better to catch attempts even if the attempt is invalid):

"The length of the SSN did not change. The SSN is still comprised of nine numeric digits."

"Randomization also introduced previously unassigned area numbers for assignment excluding area numbers 000, 666 and 900-999."

"SSN randomization will not assign group number 00 or serial number 0000. SSNs containing group number 00 or serial number 0000 will continue to be invalid."

They also include an email, ssn.randomization@ssa.gov, if you have questions.


Monday, September 22, 2014 - 10:20:30 PM - Vinny Back To Top

The Social Security Administration, as of 2011 changed its policy regarding SSN randomization, to include new rules for generating and issuing SSN's.  While this logic in this article may be true and help detect SSNs for input prevention, you may want to take a look at the Social Security Administrations new rules for issuing SSN's to members, and how it applies to the logic behind this code.  :)

http://www.ssa.gov/employer/randomization.html


Monday, September 22, 2014 - 9:24:55 AM - Bill Back To Top

@Rick - Brian White added some useful comments in Nitansh's article. He also mentioned PCI, which is a reference to https://www.pcisecuritystandards.org/


Friday, September 12, 2014 - 6:35:09 PM - Tim Back To Top

@Rick - good question.  Those environments could also look at encrypting their data (articles by K. Kelley here and Nitansh Agarwal here) as well as keeping sensitive data access to the person who absolutely requires it (and this should be a very small category) as determined by legal.  Ideally, the latter should be separate, heavily guarded, and a pain to obtain data.  Anything that can be accessed quickly has a higher probability of being less secure (the inverse isn't always true).


Friday, September 12, 2014 - 7:21:38 AM - Rick Dobson Back To Top

Some businesses, such as debt collection agencies, hospitals, and marketing firms, may require access to personal information that is otherwise confidential.  What do you recommend for those working with SQL Server database servers in these circumstances?


Learn more about SQL Server tools