Solving forget me requests for GDPR in SQL Server
In my previous GDPR tip, I discussed deleting someone’s information to comply with a “forget me” request, and how you must ensure the data is really gone. As mentioned in that tip, this data can stick around in execution plans and on the data pages themselves. It also sticks around in other places like statistics histograms, the transaction logs, and backups taken prior to the delete, which I will talk about another time.
Today I want to address the perceived inconsistency between being asked to forget someone forever, yet keeping their information to prevent that person from re-entering the system later. People ask, if I have to delete their data today, how do I know to not let them back into the system tomorrow, and start the cycle over again? Must I keep their e-mail address on hand within some sort of blacklist, and then isn’t that (ironically) considered non-compliance?
While the regulation makes it clear that you can keep personal information for the purpose of auditing and validating that you’ve complied with a request, there is a lot of confusion in the industry about where to draw the line. How long do I have to keep that data, how long can I keep the data, how far away from my active marketing system must the data exist, and so on. Auditors, attorneys, and anyone looking to troll or take advantage of this can make your life very difficult. They could try to claim that an e-mail address stored in plain text constitutes their information as still being part of your system, and that it could still be used for the request that you forget them. Proving that the data is still there strictly to continue complying with their request can be tedious and expensive.
Note that I will focus the conversation around e-mail address here, but really the piece of data can be anything in the PII realm – social security number, full name, and IP address. Most of these pieces of data have problems with associating them with one and only one person (SSN can be re-issued, there are lots of John Herbert Smiths, and IP addresses are both shared and redistributed). I want to focus on e-mail address because it is something you can keep for as long as you want, uniquely identifies you, and because it is one piece of data most likely to come under fire in any serious GDPR litigation.
One way to set minds at ease in your organization is to use a one-way hash to store the e-mail address or other personally identifiable data as a part of the removal and documentation process. By hashing the data, you make it impossible for someone to know that a certain value is there, unless (a) they already know the value, and (b) they know which algorithm you used to hash, and any salt that you may have added. But since you have all these pieces of information, you can easily validate at some future time that the e-mail address can only be verified when the value is presented and, further to that, that if your data were breached, all the hackers would get is a meaningless hash.
Think about how passwords work: most systems can validate that the password you entered matches the password that is stored, but they would never dare store your password in plain text. There are exceptions, of course, but let’s stick to the folks who do it right. How do they do it? When you create your password, they hash it, and store the hashed version. When they validate your password, first they hash the string you presented to them, then they compare that to the hashed version they previously stored. There’s often a lot more to it, and I don’t want to suggest that properly securing a password is that simple, but that’s the gist at a high level.
We can apply the same concepts to an e-mail address, for forgetting them, verifying they’ve been forgotten, and preventing them from re-entering the system – without ever storing the plain text e-mail address associated with the request.
Let’s set up a simple example: We have a table with customers’ e-mail address information, and we’ll insert three rows:
CREATE TABLE dbo.Customers ( CustomerID int IDENTITY(1,1), Email nvarchar(320) NULL, CONSTRAINT PK_Customers PRIMARY KEY (CustomerID), INDEX UQ_Email UNIQUE(Email) WHERE Email IS NOT NULL ); INSERT dbo.Customers(Email) VALUES (N'[email protected]'),(N'[email protected]'),(N'[email protected]');
Maybe there is also a flag somewhere that says someone has opted out, but that is unimportant to this example (and a NULL e-mail address could be that flag).
Then, in our auditing database, we create a table to store "forget me" requests:
USE Auditing; GO CREATE TABLE dbo.ForgetMeRequests ( CustomerID int, HashedEmail varbinary(64), RequestTimestamp datetime2(0), OriginalEmail nvarchar(320) );
This table has the CustomerID as optional, since the request could simply be to never let me into your system, in which case there is no customer. The timestamp of the request is there because we're going to use that as part of the hash. The original, non-hashed e-mail address is there in case you want to keep it temporarily on-hand in a more convenient format, as the regulation allows. Consider that optional.
Now, when a forget me request comes in, we look up the customer by that e-mail address, set it to NULL, and then store the hashed version of the e-mail address in our auditing table. How can we create a one-way hashed version of an e-mail address? We can use the HASHBYTES() function in SQL Server. First, a simple table-valued function:
USE Auditing; GO CREATE FUNCTION dbo.HashTheEmail ( @Email nvarchar(320), @Now datetime2(0) ) RETURNS table WITH SCHEMABINDING AS RETURN (SELECT Email = @Email, HashedEmail = HASHBYTES(N'SHA2_512', @Email -- + salt: + CONVERT(nvarchar(30),DATEDIFF_BIG(SECOND, 0, @Now)))); GO
So, a stored procedure in the customer database might look like this (feel free to optionally add the un-hashed e-mail address to the log table if you like):
USE Auditing; GO CREATE PROCEDURE dbo.DudeForgetMePlease @Email nvarchar(320) AS BEGIN SET NOCOUNT ON; DECLARE @Now datetime2(0) = SYSDATETIME(); INSERT dbo.ForgetMeRequests(CustomerID, HashedEmail, RequestTimestamp) --, OriginalEmail SELECT c.CustomerID, h.HashedEmail, @Now --, @Email FROM dbo.HashTheEmail(@Email,@Now) AS h LEFT OUTER JOIN CustomerDB.dbo.Customers AS c ON c.Email = h.Email; UPDATE CustomerDB.dbo.Customers SET Email = NULL WHERE Email = @Email; END GO
First, the procedure hashes the e-mail address and pulls the row from the Customers table, if it exists, that matches that e-mail address. It puts this data into the forget me log and then sets that customer’s e-mail address to NULL (again, if it exists). If we run this for the [email protected] customer, and then another e-mail address who is not a customer:
EXEC dbo.DudeForgetMePlease @Email = N'[email protected]'; EXEC dbo.DudeForgetMePlease @Email = N'[email protected]';
Then look at the tables (Customers and ForgetMeRequests) , we see:
Verifying they’ve been forgotten
Next, we can write a stored procedure that will help validate that a person is no longer in the system. We use a similar technique with our function:
USE Auditing; GO CREATE PROCEDURE dbo.DudeProveIAmNotInYourSystem @Email nvarchar(320) AS BEGIN SET NOCOUNT ON; SELECT N'Checking ' + @Email; -- need to check if they are in there *without* a forget me request: SELECT CustomerID, Email FROM CustomerDB.dbo.Customers AS c WHERE Email = @Email; -- need to also show the hashed version of their e-mail address -- if it was previously involved in a forget me request, so they -- know we can prevent them from re-entering the system: SELECT f.* FROM dbo.ForgetMeRequests AS f CROSS APPLY dbo.HashTheEmail(@Email, f.RequestTimestamp) AS h WHERE f.HashedEmail = h.HashedEmail; END GO
We can call this procedure with the two e-mail addresses we’ve explicitly forgotten, an e-mail address of an existing customer, and an e-mail address we’ve never seen, to demonstrate the output:
EXEC dbo.DudeProveIAmNotInYourSystem @Email = N'[email protected]'; EXEC dbo.DudeProveIAmNotInYourSystem @Email = N'[email protected]'; EXEC dbo.DudeProveIAmNotInYourSystem @Email = N'[email protected]'; EXEC dbo.DudeProveIAmNotInYourSystem @Email = N'[email protected]';
Output is as you would expect. The first two calls only yield data from the forget me log, the third shows the existing customer row, and the last one shows nothing:
Preventing them from getting back in the system
Since we have a common and consistent pattern of generating a one-way hash for an e-mail address, preventing them from coming into the system again simply requires comparing the hash of the e-mail address against those associated with all existing forget me requests. The optional salt, used in the original hash at the time of the request, can be combined with our function and whatever ETL process or stored procedure is being used to introduce new e-mail addresses to the system. If we find a match, then we know that this e-mail address shouldn’t be added. In singleton, stored procedure form:
USE CustomerDB; GO CREATE PROCEDURE dbo.CreateNewCustomer @Email nvarchar(320) AS BEGIN SET NOCOUNT ON; -- first check if they are on the forget me list: IF EXISTS ( SELECT 1 FROM AuditingDB.dbo.ForgetMeRequests AS f CROSS APPLY AuditingDB.dbo.HashTheEmail(@Email, f.RequestTimestamp) AS h WHERE f.HashedEmail = h.HashedEmail ) BEGIN RAISERROR(N'Cannot add %s due to forget me request.', 11, 1, @Email); RETURN; END -- if we got here, it's ok to insert... PRINT N'Adding ' + @Email; END GO
If you get a lot of forget me requests, this may become an expensive thing to check for every new e-mail address you add to the system. On the other hand, it might also be a sign that you don’t have the most honorable business model. For brevity I just created a heap with no indexes, but that probably isn’t going to be how you want to go to production.
In any case, let’s see what happens when we try to add as new customers the two forget me requests we’ve already processed, as well as a new, never-before-seen customer:
EXEC dbo.CreateNewCustomer @Email = N'[email protected]'; EXEC dbo.CreateNewCustomer @Email = N'[email protected]'; EXEC dbo.CreateNewCustomer @Email = N'[email protected]';
Msg 50000, Level 11, State 1, Procedure dbo.CreateNewCustomer Cannot add [email protected] due to forget me request. Msg 50000, Level 11, State 1, Procedure dbo.CreateNewCustomer Cannot add [email protected] due to forget me request. Adding [email protected]
So, we were able to successfully identify the two e-mail addresses that had previously asked to be forgotten, and the new customer was added without any issue.
As a side note, don’t put the hashing mechanism in the same database as the data and, if possible, don’t put the hashed versions of the e-mail addresses in either of those databases either. Separation is key, since if you put everything in one database, all it takes is one unencrypted .bak file to go missing and they’ve got all the info they need. You might even consider a slightly different approach by performing the hashing outside of SQL Server altogether. I use T-SQL here simply for convenience, and not because it’s the most secure approach.
Make no mistake: GDPR is going to be a pain in the rear. The regulation has existed since 2016, but a lot of companies are pulling all the stops at the 11th hour. Be prepared to appease the people driving your business requirements, even they aren’t perfectly in line with the letter of the regulation.
Read on for related tips and other resources:
- Data cleanup in SQL Server becomes more important with GDPR
- Is your SQL Server environment ready for GDPR?
- How does GDPR affect your SQL Server Recovery Plans
- SQL Server Security Checklist
- All SQL Server security-related tips
About the author
View all my tips