Protect Confidential SQL Server Data Q and A

Problem

Following the webcast on protecting confidential data, a few viewers asked excellent questions unrelated to RegEx and the below provide some approaches as answers to these questions.

Solution

Are there any options to find SSNs in SQL Server besides RegEx?

Good question, and Microsoft provides us with other tools that we can use to identify what might be confidential data. We have a few tools at our disposal if we don’t want to use RegEx, though these will depend on how we structure our environment. For instance, if we maintain and work in an environment where the names of our columns directly indicate our data, we can use something similar to the below query:

SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%SSN%'

We could also pass in other possible column name combinations, like payment, in the case of identifying confidential payment information. In addition to this method, we can use the below example code:

-- Example using SSNs: column name could be TAXID, SOCIALSECURITYNUMBER, SSN, etc. and we should consider that valid SSNs are between 9 
-- characters long (000000000) through 11 characters long (000-00-0000).
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%SSN%'
 OR COLUMN_NAME LIKE '%TAX%'
 OR COLUMN_NAME LIKE '%SOCIAL%'
 OR CHARACTER_MAXIMUM_LENGTH BETWEEN 9 AND 11
 
-- Example using VISA cards: column name could be VISA, PAYMENTINFO, CREDITCARD, etc. and we should consider that valid VISAs are between 16 
-- characters long (0000000000000000) through 19 characters long (0000-0000-0000-0000). 
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%VISA%'
 OR COLUMN_NAME LIKE '%PAYMENT%'
 OR COLUMN_NAME LIKE '%CREDIT%'
 OR CHARACTER_MAXIMUM_LENGTH BETWEEN 16 AND 19

In the example of the maximum length, if we found numerous values in the table where the LEN was greater than 16 to 19 digits, and nothing else indicates it was payment data from a VISA credit card, we would be making the same assumptions as if we used a RegEx pattern for a VISA card number with the format of 0000000000000000 or 0000-0000-0000-0000. If we stored data with separators (i.e.”-“), then we would take the same approach with 19 characters and would be making the same assumption of a RegEx pattern with 0000-0000-0000-0000. Behind RegEx is an assumption – just because we find a match (or don’t) doesn’t mean that we’ve found confidential data, only that we’ve found the possibility of confidential data. The same is true with these other tools – we’re looking for character ranges, data types, maximum lengths, and combinations of how they could possibly be stored. The main difference is that RegEx, once a person becomes familiar with it, will identify matches faster because it can actually look at patterns in the data – these other tools look at what columns could be storing the data.

Some environments may keep the names of columns only in the raw data and for production data sets, change the column names. In this case, we would look through system objects that either (1) track the flow of data, or (2) track if we note what columns might be even when their names don’t indicate anything that appears confidential.

-- SSN or TaxID Example
SELECT *
FROM sys.syscomments
WHERE text LIKE '%SSN%'
 OR text LIKE '%Social%'
 OR text LIKE '%Tax%'
 
-- Credit Card Example
SELECT *
FROM sys.syscomments
WHERE text LIKE '%Payment%'
 OR text LIKE '%Card%'
 OR text LIKE '%VISA%'

We can also expand the above functionality by looking at code from any type of file, whether a SQL, C#, XML or other file. These files may have stored comments, information, or code that might tell us that we’re storing or possibly “handling” confidential data. For instance, the function below looks through SQL, C#, XML, and PowerShell files for the word Social, testing if we possibly have something like “Social Security Number.” We could also look for SSN, TaxID, or Payment if we suspect that these values are stored.

Function FindWord ($githubfolder, $pattern)
{
    $cnt = Get-ChildItem $githubfolder -Include @("*.sql", "*.cs", "*.xml", "*.ps1") -Recurse | Select-String -pattern $pattern
    if ($cnt.Count -gt 0)
    {
        $cnt
    }
}
## We will enter the path where we store our code and then the key word we're seeking:
FindWord -githubfolder "C:\Users\Timothy Smith\Documents\GitHub\" -pattern "Social"

Keep in mind, that if our application is asking for data, like a SSN, when we don’t need it – if this exists on a form in C# or HTML, a tool like this will find it because the text like “Enter Your SSN:” will be somewhere on the form and from there we can find the affiliated text box or entry tool.

Finally, consider tools like GreenSQL for some environments. If your company wants to minimize costs in development, these tools make sense – it costs money to develop solutions and in some companies, they don’t want you to rebuild a wheel if it already exists.

Why would anyone store sensitive data un-encrypted in SQL Server?

Good question and unfortunately I’m unable to answer for those who don’t use it, though I suspect some may look at it more from a performance perspective or under estimate the probability that their environment could be compromised. RegEx and other tools offer useful ways to identify data which may not be encrypted, and once they are identified, if they must be stored (recall that, legally, a lot of companies couldn’t defend storage of confidential data – they just haven’t been contested yet), encryption is a must. After having written that, people will still try to find ways to break encryption, so I would be careful before thinking that it won’t ever be cracked or that it is the only possible solution; it makes a phenomenal addition to security.

Nitansh Agarwal wrote an excellent article on column level encryption for MSSQLTips, and I highly suggest checking it out to get a perspective on how to encrypt column data.

How Do You Determine What Data Are Confidential?

This is an excellent question because it can vary by industry. I’d suggest the following for identifying what you may be storing that could be considered confidential, in order of what is most important and should be done first:

Contact your legal department (if there is one), an attorney, or law professor, if in doubt about data. Some data, like payment information and SSNs are obvious, some may not seem to be obvious. In some cases (i.e.: your company doesn’t have a legal department) this can cost, but not protecting confidential data costs more. In addition, some attorneys who specialize in fields may have written articles or hold local events covering some of these concerns; if available to you, take advantage of these resources as they can be helpful in preventing problems.
Ask fellow developers in the same industry without specific data, yet still covering what areas are concerning, such as “What type of data should we be extra cautious about?” Developers don’t have to (and shouldn’t) say, “Oh, account number 369sxsw658 is …” but they can say that payment information and address information, in certain situations, may be considered confidential. As an example of this, since I spend and have spent most of my time in the financial industry, I’ve cautioned other developers on what they’ve written or said because breaking confidentiality will cost both in the world of finance and the world of personal integrity. Some clients expect confidentiality with every piece of data and they have every right to expect this.
Ask government officials or look at government websites on the industry you’re in. For instance, if we worked in health care, we could search government websites for health care data, security and confidentiality. Remember, if using a site, it must be a .gov site – any other sites should be verified.
Ask yourself what you feel uneasy about companies storing and consider that might be confidential. Always remember, if you don’t need to store it, don’t. One example I use here a lot is hospitals and doctor’s offices asking for SSNs – they don’t need that information and consumers don’t have to provide it, if they don’t want to.
Think like a hacker: suppose that a hacker obtain access to your database – what would he or she go after, outside of extra permissions? As you can see from this tip, a lot of the searching for SSN or other columns to identify possible confidential data come from what many hackers will do – search system objects to identify data they want to steal.

Is It Better To Mask At the Application Level Or The SQL Server Database Level?

Consider this related analogy to your architecture when it comes to security: imagine if we disabled all electronic access to our financial accounts – no online, no credit card, no checks, no bill pay, no mobile. We decided that when we wanted to buy something, we would visit the financial institution face-to-face, withdraw cash that we needed, and use it, never withdrawing more than we needed in the moment. Secure? Definitely! Inconvenient? Definitely! Convenience and security seldom mix together, and as a good economist would tell us, it’s opportunity cost in action.

For maximum security, if a hacker broke through our application and obtained access to our database and the confidential data were masked inside the database, he or she would have obtained little (i.e.: only the non-masked part) regarding that piece of confidential data. In the presentation, I used an example of loan processing where the first server would store some information of a customer for a JOIN taking a loan application, but the SSN data was on a different server and after a loan application was taken by a loan officer, the information was passed to the second server for a decision (which, at that point, would send the customer a notification about loan approval). This offers an extra obstacle for a hacker – he or she now has to get access to server two after obtaining access to server one, or must find a way to access server two, in order to get access to the SSN. Unfortunately, it also creates headaches for DBAs and developers in that we now have two servers to maintain in addition to a process where our applications get written to a file (for instance) that get picked up in a queuing process. What’s faster? Anything less secure. What’s more secure? Generally, something well built with hurdles that might slow processes.

Should I Leave Preventing/Masking Confidential Data To Application Developers or Take Initiative As A DBA?

I’d recommend working with application developers as to what process they currently have in place to (1) avoid storing confidential data, or (2) protect confidential data because it builds team cohesion and also we may end up in an environment where we are responsible for protection of confidential data later and by working with a strong team with solid ideas, develop good ideas from that experience.

Outside of Not Storing/Encrypting/Architectural/Masking Confidential Data, Are There Other Ways To Deter Hackers?

One technique that I’ve discussed with colleagues is data muddying, which we only discuss as useful in text based data and environments willing to spend extra money on protection. An example would be Tweets that are inaccurate, dishonest, or pure spam to confuse marketing bots. Since algorithms make assumptions (for instance, an algorithm assumes that if we post, “I just ate ice cream and it was incredibly good!” one some level, we like the ice cream), we misguide these algorithms with inaccurate data. Unfortunately, this generally has limited effectiveness on numeric data and has large costs associated with it (i.e.: how would you do this with data like SSNs?), as queries consume more time and environments require more storage space. It does have useful applications in certain situations; for an example, if we had two clients who exchanged private messages, we stored the past text of their messages, and 99.9999% of the data stored was intentionally fake. Even after a hacker broke through the architecture, then the encryption, he or she would be left with the monumental task of finding the .0001% of legitimate data.

Next Steps

Look through your environment to see if you spot possible confidential data.
Test and expand on the above code as necessary for your environment.
If unaware of confidential data, outline a few people that may know what to do and get in touch.

Tim Smith

Tim consults for FinTek Development and teaches the course Automating ETL on Udemy. He’s worked with technology since high school, helping his school win its first TCEA award and continues to work in automation, data architecture, back-end development, and smart contract architecture. Tim enjoys testing new technologies early in the diffusion of innovation curve and was an early adopter of NoSQL and smart contract development. He has a blog at http://www.fintekdev.com/ and helps contribute to local technical and financial events in Texas.