Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Scripts          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Storing E-mail addresses more efficiently in SQL Server


By:   |   Updated: 2012-03-13   |   Comments (8)   |   Related: 1 | 2 | 3 | 4 | More > Database Design

Problem

A lot of people are storing large quantities of e-mail addresses in their systems. As someone who focuses on squeezing every bit of performance out of the various hardware components in a system, when I look at a table full of e-mail addresses, I can't help but think that it is wasteful. Since a large percentage of any system's user base is going to be supplying e-mail addresses from a few of the vastly popular providers (such as GMail and Hotmail), we are storing the "@gmail.com" and "@hotmail.com" parts of those addresses on disk over and over and over again. Further to that it makes it difficult to search efficiently for all users of GMail, for example, since the domain name is not at the beginning of the column and therefore will require a table or index scan. Even if the e-mail address column is indexed, the index is only useful for exact matches or when trying to find all the users like [email protected]<anydomain>.

Solution

Earlier in my career, I experimented with various ways to store e-mail addresses that would bypass some of these limitations. Initially we had a table like this:

CREATE TABLE dbo.Contacts_Original
(
    EmailAddress VARCHAR(320) PRIMARY KEY
);

(For simplicity's sake, let's ignore that there were obviously other columns involved.)

But we quickly came across some of the problems mentioned above - the table grew quite quickly, and certain types of queries started to take unacceptably long times (such as running reports on the activity for Hotmail.com users). At first we tried splitting the e-mail address up, which would at least buy us the ability to index the domain first, which helped certain queries:

CREATE TABLE dbo.Contacts_Split
(
    LocalPart    VARCHAR(64)  NOT NULL,
    DomainName   VARCHAR(255) NOT NULL,
    PRIMARY KEY  (DomainName, LocalPart)
);

This solution was very short-lived as well, since we still had a tremendous amount of space being wasted on duplicate domain names (and not all of them were nice and short like gmail.com and hotmail.com). So the logical evolution was to store each domain name *once*, and even though it would need more code for maintenance, we would be better off in the end since we were I/O-bound and always strived to find ways to reduce I/O. So the final version of the schema looked something like this:

CREATE TABLE dbo.EmailDomains
(
    DomainID     INT IDENTITY(1,1) PRIMARY KEY,
    DomainName   VARCHAR(255) NOT NULL UNIQUE
);
CREATE TABLE dbo.Contacts_DomainFK
(
    DomainID     INT NOT NULL FOREIGN KEY
                 REFERENCES dbo.EmailDomains(DomainID),
    LocalPart    VARCHAR(64),
    PRIMARY KEY  (DomainID, LocalPart)
);

But aside from the obvious additional complexities in code, how would this new solution perform compared to the previous solutions? I wanted to run it through three tests:

  1. Inserting 10,000 new e-mail addresses
  2. Counting the number of addresses in use by a specific domain
  3. Updating all the addresses for one domain to use a new domain (something like this really did happen)

So first I needed a way to generate 10,000 unique e-mail addresses. There are plenty of ways to generate data; I like to play with the system objects, so I came up with this approach:

;WITH x(n,i,rn) AS
(
    SELECT TOP (10000) LOWER(s1.name),
        ABS(s2.[object_id] % 10),
        ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM sys.all_columns AS s1
    CROSS JOIN sys.all_objects AS s2
),
y(i, localpart, domain) AS
(
    SELECT i, n+RTRIM(rn),
        CASE WHEN i < =2 THEN 'gmail.com'
             WHEN i <= 4 THEN 'hotmail.com'
             WHEN i = 5 THEN 'aol.com'
             WHEN i = 6 THEN 'compuserve.net'
             WHEN i = 7 THEN 'mci.com'
             WHEN i = 8 THEN 'hotmail.co.uk'
             ELSE RTRIM(rn) + '.com'
        END
    FROM x
)
SELECT domain, c = COUNT(*)
FROM y
GROUP BY domain ORDER BY c DESC;

I wanted a lot of duplicate domains, but I also wanted a lot of singletons. This query produced 992 rows on my system, favoring gmail and hotmail domains:

Grouped domain query results

And changing the final SELECT to actually build e-mail addresses:

  ...
)
SELECT DISTINCT y.localpart, y.domain, email = y.localpart + '@' + y.domain
FROM x INNER JOIN y ON x.i = y.i;

Yields these partial results (10,000 total unique e-mail addresses):

E-mail address query results

(In spite of appearances, not all e-mail addresses start with 'grantee'.)

So now that we have a way to easily generate 10,000 unique e-mail addresses, let's proceed with our tests.

Creating 10,000 contact rows

In order to test the creation of 10,000 rows, I created a single stored procedure for each table (since this is how contacts were typically generated in our system). Note that I use MERGE for consistency across methods, not for performance, error handling, isolation or concurrency (those are topics for another tip). You'll notice that the stored procedure code gets a little more complex in each case; hopefully that complexity will pay off.

CREATE PROCEDURE dbo.Contact_Create_Original
 @EmailAddress VARCHAR(320)
AS
BEGIN
 SET NOCOUNT ON;
 MERGE INTO dbo.Contacts_Original AS c
   USING (SELECT EmailAddress = @EmailAddress) AS p
   ON c.EmailAddress = p.EmailAddress
 WHEN NOT MATCHED THEN
   INSERT (EmailAddress) VALUES(p.EmailAddress);
END
GO
CREATE PROCEDURE dbo.Contact_Create_Split
 @LocalPart  VARCHAR(64),
 @DomainName VARCHAR(255)
AS
BEGIN
 SET NOCOUNT ON;
 MERGE INTO dbo.Contacts_Split AS c
   USING 
   (
  SELECT LocalPart  = @LocalPart,
         DomainName = @DomainName
   ) AS p
   ON  c.LocalPart  = p.LocalPart
   AND c.DomainName = p.DomainName
 WHEN NOT MATCHED THEN
   INSERT (LocalPart, DomainName) VALUES(p.LocalPart, p.DomainName);
END
GO
CREATE PROCEDURE dbo.Contact_Create_DomainFK
 @LocalPart  VARCHAR(64),
 @DomainName VARCHAR(255)
AS
BEGIN
 SET NOCOUNT ON;
 DECLARE @DomainID INT;
 SELECT @DomainID = DomainID 
   FROM dbo.EmailDomains
   WHERE DomainName = @DomainName;
 IF @DomainID IS NULL
 BEGIN
   INSERT dbo.EmailDomains(DomainName) SELECT @DomainName;
   SELECT @DomainID = SCOPE_IDENTITY();
 END
 MERGE INTO dbo.Contacts_DomainFK AS c
   USING (SELECT DomainID = @DomainID, LocalPart = @LocalPart) AS p
   ON c.DomainID = p.DomainID
   AND c.LocalPart = p.LocalPart
 WHEN NOT MATCHED THEN
   INSERT (DomainID, LocalPart) VALUES(p.DomainID, p.LocalPart);
END
GO

Then, using the query above, I selected the 10,000 unique e-mail addresses into a #temp table (so I could operate on the same set multiple times):

  ...
)
SELECT DISTINCT EmailAddress = y.localpart + '@' + y.domain, 
  DomainName = y.domain, LocalPart = y.localpart
  INTO #foo FROM x INNER JOIN y ON x.i = y.i;

Now I could create a cursor for each method and time the run:

DECLARE 
 @EmailAddress  VARCHAR(320),
 @DomainName    VARCHAR(255),
 @LocalPart     VARCHAR(64);
SELECT SYSDATETIME();
DECLARE c CURSOR 
  LOCAL STATIC FORWARD_ONLY READ_ONLY
  FOR SELECT EmailAddress, DomainName, LocalPart FROM #foo;
OPEN c;
FETCH NEXT FROM c INTO @EmailAddress, @DomainName, @LocalPart;
WHILE (@@FETCH_STATUS <> -1)
BEGIN
 --EXEC dbo.Contact_Create_Original @EmailAddress = @EmailAddress;
 --EXEC dbo.Contact_Create_Split    @LocalPart    = @LocalPart, @DomainName = @DomainName;
 --EXEC dbo.Contact_Create_DomainFK @LocalPart    = @LocalPart, @DomainName = @DomainName;
 FETCH NEXT FROM c INTO @EmailAddress, @DomainName, @LocalPart;
END
CLOSE c;
DEALLOCATE c;
SELECT SYSDATETIME();

Results:

Timing results for creating 10,000 contacts

I also wanted to check the disk space usage (and I could do a less complex version of a typical usage query because I know the table is not partitioned, there is no LOB or overflow data, and the only index in each table is the clustered index):

SELECT t.name, data_in_kb = 8 * (p.in_row_data_page_count)
  FROM sys.tables AS t
  INNER JOIN sys.dm_db_partition_stats AS p
  ON t.[object_id] = p.[object_id];

Results:

Space usage results

So when splitting the domains out into a separate table, we have a slight increase in cost when performing inserts, and a marginal savings in disk space (about 10%). In our case the savings were a substantially larger percentage because we were talking about extrapolating millions of addresses, and a larger percentage of those addresses were distributed across a handful of the most popular domains.

Counting contacts from a particular domain

I picked a domain with a lot of contacts (gmail.com) and a domain with one (1010.com), and timed counting each of those 10,000 times:

SELECT SYSDATETIME();
GO
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.Contacts_Original WHERE EmailAddress LIKE '%@gmail.com';
SELECT @c = COUNT(*) FROM dbo.Contacts_Original WHERE EmailAddress LIKE '%@1010.com';
GO 10000
SELECT SYSDATETIME();
GO
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.Contacts_Split WHERE DomainName = 'gmail.com';
SELECT @c = COUNT(*) FROM dbo.Contacts_Split WHERE DomainName = '1010.com';
GO 10000
SELECT SYSDATETIME();
GO
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.Contacts_DomainFK AS c 
 INNER JOIN  dbo.EmailDomains AS d ON c.DomainID = d.DomainID
 WHERE d.DomainName = 'gmail.com';
SELECT @c = COUNT(*) FROM dbo.Contacts_DomainFK AS c 
 INNER JOIN  dbo.EmailDomains AS d ON c.DomainID = d.DomainID
 WHERE d.DomainName = '1010.com';
GO 10000
SELECT SYSDATETIME();

Results:

Timing results for counting contacts by domain

Because of the JOIN, the inline split method is a little more efficient than breaking the domain out into its own table, but both are far more efficient than searching domain names within a complete e-mail address. Due to our specific needs, we actually solved this problem using an indexed view with DomainID and COUNT_BIG() - opting to pay the cost of that as up-front index maintenance and with a little additional storage (a little over 12 bytes * the number of distinct domains).

Updating all addresses to use a different domain

My last test was not a common use case but it happened more than once - for example when AT&T purchased MCI, they created a new domain name and simply moved all MCI customers over to the new one. Or at least we'll pretend in this case that the transition was that simple. The code to do this is not overly complicated, but for the purposes of producing measurable activity, I'm going to change them forward and back 500 times:

SELECT SYSDATETIME();
GO
UPDATE dbo.Contacts_Original 
 SET EmailAddress = LEFT(EmailAddress, CHARINDEX('@', EmailAddress))
 + 'att.net' WHERE EmailAddress LIKE '%@mci.com';
UPDATE dbo.Contacts_Original 
 SET EmailAddress = LEFT(EmailAddress, CHARINDEX('@', EmailAddress))
 + 'mci.com' WHERE EmailAddress LIKE '%@att.net';
GO 500
SELECT SYSDATETIME();
GO
UPDATE dbo.Contacts_Split
 SET DomainName = 'att.net' WHERE DomainName = 'mci.com';
UPDATE dbo.Contacts_Split
 SET DomainName = 'mci.com' WHERE DomainName = 'att.net';
GO 500
SELECT SYSDATETIME();
GO
UPDATE dbo.EmailDomains
 SET DomainName = 'att.net' WHERE DomainName = 'mci.com';
UPDATE dbo.EmailDomains
 SET DomainName = 'mci.com' WHERE DomainName = 'att.net';
GO 500
SELECT SYSDATETIME();

Results:

Timing results for updating a domain name

It shouldn't be surprising that substring-based string searching and manipulation will be significantly more expensive than wholesale updating and matching. And of course the most efficient outcome would be when only a single value is changed. Even if this isn't going to be a common use case, I lend some weight to the efficiency here because it is the kind of thing that could potentially take your table offline.

Conclusion

As you can see, it can make a difference in your workload by experimenting with various ways to store certain types of data, especially if a lot of your information is repeated or treated independently. You could extend this to think about other information like URLs - if you store a lot of google or bing search result URLs, for example, then you may be storing a whole lot of redundant information by keeping the URL fully intact.

Keep in mind this solution was initially developed in SQL Server 2005. In SQL Server 2008 and above, we would have experimented with Data Compression to hopefully achieve even greater savings. In a future tip, I'll extend this to see how different the impact is when compression is part of the equation.

Next Steps


Last Updated: 2012-03-13


get scripts

next tip button



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a Product Manager at SentryOne, with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and serves as a community moderator for the Database Administrators Stack Exchange.

View all my tips




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Thursday, March 07, 2013 - 7:02:09 AM - janberjanson Back To Top

 

can i ask for a software?


Thursday, April 26, 2012 - 9:27:15 AM - Rob Back To Top

In DomainSplit have you thought of storing CHECKSUM([Domain]) in order to speed up Domain Lookups? You can even store it in a Indexed view if you don't want the code overhead of giving it a column.

 


Tuesday, March 27, 2012 - 10:13:51 AM - Aaron Bertrand Back To Top

But Duncan, if you have thousands of domains and millions of distinct e-mail addresses, it might be less efficient to do counts for all of them just to throw most of the results away (assuming we're after a single domain or a small subset). It all depends of course but this doesn't seem like the optimal approach at a high level.


Tuesday, March 27, 2012 - 9:17:36 AM - Duncan Back To Top

I have found COUNT() operations to be faster when I do the count, THEN join onto the FK table for reporting.


Thursday, March 15, 2012 - 5:45:47 AM - Usman Butt Back To Top

Thanks for the quick response. "DomainGroups" and country mapping was in my mind. Thanks once again.


Wednesday, March 14, 2012 - 10:48:24 AM - Aaron Bertrand Back To Top

Usman, if you index the DomainName, then you shouldn't have much difficulty searching for hotmail.co.uk vs. hotmail.com. We actually went a step further, and had another set of tables for what we called "DomainGroups" - so we could group all the Hotmail domains,, for example, using a relation table. This was because some domain groups were diverse, e.g. hotmail.com, msn.com, live.com, etc.


Wednesday, March 14, 2012 - 10:44:14 AM - Usman Butt Back To Top

Wonderful as always.

But I had the scenario where had to search hotmail/live id users for UK, USA and some other countries. Looking at this much much better design, just wandering the DomainName could be further divided??? Some other things come to mind as well but I guess that could making things more complex.


Wednesday, March 14, 2012 - 10:07:21 AM - JodyB Back To Top

This is some pretty great analysis with the code examples.

The performance differences in the techniques are something to keep in mind.


Learn more about SQL Server tools