Storing E-mail addresses more efficiently in SQL Server - Part 2

By:   |   Comments (9)   |   Related: 1 | 2 | 3 | 4 | > Database Design


Problem

In my last tip, I shared some ideas for determining if you should consider breaking up the e-mail addresses you're storing, even putting the domain names in a separate table. I performed storage and timing comparisons for working with 10,000 unique e-mail addresses, but I completely ignored data compression. I wanted to revisit the same test case and apply data compression to the tables and see how that impacted the outcome.

Solution

If you remember, we were storing contacts in three different ways, and now we're just going to use page compression on each table.

The original table:

CREATE TABLE dbo.Contacts_Original
(
    EmailAddress VARCHAR(320) PRIMARY KEY
) WITH (DATA_COMPRESSION = PAGE);

A table where we split the local part from the domain name:

CREATE TABLE dbo.Contacts_Split
(
    LocalPart    VARCHAR(64)  NOT NULL,
    DomainName   VARCHAR(255) NOT NULL,
    PRIMARY KEY  (DomainName, LocalPart)
) WITH (DATA_COMPRESSION = PAGE);

And finally, a two-table solution where domain names were stored in their own separate table:

CREATE TABLE dbo.EmailDomains
(
    DomainID     INT IDENTITY(1,1) PRIMARY KEY,
    DomainName   VARCHAR(255) NOT NULL UNIQUE
) WITH (DATA_COMPRESSION = PAGE);
CREATE TABLE dbo.Contacts_DomainFK
(
    DomainID     INT NOT NULL FOREIGN KEY
                 REFERENCES dbo.EmailDomains(DomainID),
    LocalPart    VARCHAR(64),
    PRIMARY KEY  (DomainID, LocalPart)
) WITH (DATA_COMPRESSION = PAGE);

Again we'll run these tables through the same three tests:

  1. Inserting 10,000 new e-mail addresses
  2. Counting the number of addresses in use by a specific domain
  3. Updating all the addresses for one domain to use a new domain

Again we can use the same query to generate 10,000 unique e-mail addresses:

;WITH x(n,i,rn) AS
(
    SELECT TOP (10000) LOWER(s1.name),
        ABS(s2.[object_id] % 10),
        ROW_NUMBER() OVER (ORDER BY s1.[object_id])
    FROM sys.all_columns AS s1
    CROSS JOIN sys.all_objects AS s2
),
y(i, localpart, domain) AS
(
    SELECT i, n+RTRIM(rn),
        CASE WHEN i < =2 THEN 'gmail.com'
             WHEN i <= 4 THEN 'hotmail.com'
             WHEN i = 5 THEN 'aol.com'
             WHEN i = 6 THEN 'compuserve.net'
             WHEN i = 7 THEN 'mci.com'
             WHEN i = 8 THEN 'hotmail.co.uk'
             ELSE RTRIM(rn) + '.com'
        END
    FROM x
)
SELECT DISTINCT y.localpart, y.domain, email = y.localpart + '@' + y.domain
  FROM x INNER JOIN y ON x.i = y.i;

So with that in place, we could proceed with our tests.

Creating 10,000 contact rows

Again, we have the same three stored procedures, one to handle each case: the original table, the split table, and the domain FK tables.

CREATE PROCEDURE dbo.Contact_Create_Original
 @EmailAddress VARCHAR(320)
AS
BEGIN
 SET NOCOUNT ON;
 MERGE INTO dbo.Contacts_Original AS c
   USING (SELECT EmailAddress = @EmailAddress) AS p
   ON c.EmailAddress = p.EmailAddress
 WHEN NOT MATCHED THEN
   INSERT (EmailAddress) VALUES(p.EmailAddress);
END
GO
CREATE PROCEDURE dbo.Contact_Create_Split
 @LocalPart  VARCHAR(64),
 @DomainName VARCHAR(255)
AS
BEGIN
 SET NOCOUNT ON;
 MERGE INTO dbo.Contacts_Split AS c
   USING 
   (
  SELECT LocalPart  = @LocalPart,
         DomainName = @DomainName
   ) AS p
   ON  c.LocalPart  = p.LocalPart
   AND c.DomainName = p.DomainName
 WHEN NOT MATCHED THEN
   INSERT (LocalPart, DomainName) VALUES(p.LocalPart, p.DomainName);
END
GO
CREATE PROCEDURE dbo.Contact_Create_DomainFK
 @LocalPart  VARCHAR(64),
 @DomainName VARCHAR(255)
AS
BEGIN
 SET NOCOUNT ON;
 DECLARE @DomainID INT;
 SELECT @DomainID = DomainID 
   FROM dbo.EmailDomains
   WHERE DomainName = @DomainName;
 IF @DomainID IS NULL
 BEGIN
   INSERT dbo.EmailDomains(DomainName) SELECT @DomainName;
   SELECT @DomainID = SCOPE_IDENTITY();
 END
 MERGE INTO dbo.Contacts_DomainFK AS c
   USING (SELECT DomainID = @DomainID, LocalPart = @LocalPart) AS p
   ON c.DomainID = p.DomainID
   AND c.LocalPart = p.LocalPart
 WHEN NOT MATCHED THEN
   INSERT (DomainID, LocalPart) VALUES(p.DomainID, p.LocalPart);
END
GO

And again, we dump the 10,000 unique addresses into a #temp table for re-use:

  ...
)
SELECT DISTINCT EmailAddress = y.localpart + '@' + y.domain, 
  DomainName = y.domain, LocalPart = y.localpart
  INTO #foo FROM x INNER JOIN y ON x.i = y.i;

And once again I could create a cursor for each method and time the run:

DECLARE 
 @EmailAddress  VARCHAR(320),
 @DomainName    VARCHAR(255),
 @LocalPart     VARCHAR(64);
SELECT SYSDATETIME();
DECLARE c CURSOR 
  LOCAL STATIC FORWARD_ONLY READ_ONLY
  FOR SELECT EmailAddress, DomainName, LocalPart FROM #foo;
OPEN c;
FETCH NEXT FROM c INTO @EmailAddress, @DomainName, @LocalPart;
WHILE (@@FETCH_STATUS <> -1)
BEGIN
 --EXEC dbo.Contact_Create_Original @EmailAddress = @EmailAddress;
 --EXEC dbo.Contact_Create_Split    @LocalPart    = @LocalPart, @DomainName = @DomainName;
 --EXEC dbo.Contact_Create_DomainFK @LocalPart    = @LocalPart, @DomainName = @DomainName;
 FETCH NEXT FROM c INTO @EmailAddress, @DomainName, @LocalPart;
END
CLOSE c;
DEALLOCATE c;
SELECT SYSDATETIME();

The results here are almost identical to the results without compression:

Timing results for creating 10,000 contacts

I also wanted to check the disk space usage again:

SELECT t.name, data_in_kb = 8 * (p.in_row_data_page_count)
  FROM sys.tables AS t
  INNER JOIN sys.dm_db_partition_stats AS p
  ON t.[object_id] = p.[object_id]
  WHERE p.index_id IN (0,1);

Results:

Space usage results

So when splitting the domains out into a separate table, we again have a slight increase in cost when performing inserts, and a mixed bag result in terms of disk space. This could partially be because the "random" domain names in use are all about 8 characters and probably compress a lot better than the integer values - I think the differences would be much more substantial if the domain names followed a much more realistic and volatile pattern. I'll leave that for perhaps another discussion.

Counting contacts from a particular domain

Again I picked a domain with a lot of contacts (gmail.com) and a domain with one (1010.com), and timed counting each of those 10,000 times:

SELECT SYSDATETIME();
GO
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.Contacts_Original WHERE EmailAddress LIKE '%@gmail.com';
SELECT @c = COUNT(*) FROM dbo.Contacts_Original WHERE EmailAddress LIKE '%@1010.com';
GO 10000
SELECT SYSDATETIME();
GO
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.Contacts_Split WHERE DomainName = 'gmail.com';
SELECT @c = COUNT(*) FROM dbo.Contacts_Split WHERE DomainName = '1010.com';
GO 10000
SELECT SYSDATETIME();
GO
DECLARE @c INT;
SELECT @c = COUNT(*) FROM dbo.Contacts_DomainFK AS c 
 INNER JOIN  dbo.EmailDomains AS d ON c.DomainID = d.DomainID
 WHERE d.DomainName = 'gmail.com';
SELECT @c = COUNT(*) FROM dbo.Contacts_DomainFK AS c 
 INNER JOIN  dbo.EmailDomains AS d ON c.DomainID = d.DomainID
 WHERE d.DomainName = '1010.com';
GO 10000
SELECT SYSDATETIME();

Results:

Timing results for counting contacts by domain

The original method is still by far the least efficient approach for this operation, but now the separate domain table comes into a virtual tie with the split method (probably due to a significantly fewer number of pages required to satisfy these specific queries).

Updating all addresses to use a different domain

Finally, we test changing a domain name and then changing it back, 500 times:

SELECT SYSDATETIME();
GO
UPDATE dbo.Contacts_Original 
 SET EmailAddress = LEFT(EmailAddress, CHARINDEX('@', EmailAddress))
 + 'att.net' WHERE EmailAddress LIKE '%@mci.com';
UPDATE dbo.Contacts_Original 
 SET EmailAddress = LEFT(EmailAddress, CHARINDEX('@', EmailAddress))
 + 'mci.com' WHERE EmailAddress LIKE '%@att.net';
GO 500
SELECT SYSDATETIME();
GO
UPDATE dbo.Contacts_Split
 SET DomainName = 'att.net' WHERE DomainName = 'mci.com';
UPDATE dbo.Contacts_Split
 SET DomainName = 'mci.com' WHERE DomainName = 'att.net';
GO 500
SELECT SYSDATETIME();
GO
UPDATE dbo.EmailDomains
 SET DomainName = 'att.net' WHERE DomainName = 'mci.com';
UPDATE dbo.EmailDomains
 SET DomainName = 'mci.com' WHERE DomainName = 'att.net';
GO 500
SELECT SYSDATETIME();

Results:

Timing results for updating a domain name

Once again it should come as no surprise that changing the domain as part of an e-mail address, as opposed to an entire value, is going to be more expensive - however the reduced I/O afforded by compression does bring this more in-line with the first split alternative. But obviously since a single-value change is all that's required for the separate domain name storage, this still comes in as the clear winner.

Conclusion

We can still see that different storage approaches can affect your workload and storage requirements, with or without data compression. You should play with the data you're storing in different ways and see how the various components of your typical workload respond to the changes.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, July 7, 2014 - 1:24:35 PM - Aaron Bertrand Back To Top (32578)

Bob maybe, I didn't test that approach, but why store the whole domain not just once for every row, but TWICE for every row? With longer domain names and larger databases this is not going to make the problem any better... that's a lot of storage and I/O to get "the same" performance in one case.


Monday, July 7, 2014 - 11:47:51 AM - Bob B Back To Top (32575)
What about using a computed column to get the domain name and then applying an index to the computed column. The index will add to the storage of the original method, but should make it perform on par with the other 2 solutions on the count by domain test (as long as it is changed to use the computed column vice "LIKE" for where clause).

Friday, April 27, 2012 - 8:38:31 AM - Aaron Bertrand Back To Top (17153)

Charlie, I see the "what does disk space cost?" argument often. Let's forget for a moment the fact that in most systems today I/O is still the biggest bottleneck. Even on systems with SSDs, retrieving data from disk and storing it in memory are still real parts of the process - the more redundant data you are storing on disk, the more space it requires in memory, and the longer it takes to move in and out of memory. This approach provided a real, measurable impact on our entire workload, top to bottom; I just chose some very simple examples to demonstrate here. Would the same change make a huge impact on hardware you went out and bought tomorrow? Probably not as drastic, no. That was x86 class hardware with small amounts of memory, so the difference was noticeable. That said, the above numbers don't lie, either. At scale, acting like disk space is extremely expensive will benefit you in the long run. For the same reason we still don't use GUIDs as primary keys if we don't have to.


Friday, April 27, 2012 - 1:49:22 AM - Dimitrios Staikos Back To Top (17146)

OK, I see that you do it with the primary key on both fields.

I usually write such things as unique check constraints, so my mind didn't register :-)


Friday, April 27, 2012 - 1:47:20 AM - Dimitrios Staikos Back To Top (17145)

That's a nice idea, but how do you prevent duplicate email addresses from being entered, if we suppose you care about this?

Looks like you can't write a simple constraint for that. If you do it manually (with an SP checking things during insert) I am not sure it can work without locking the whole table with the local part.

So you probably need a trigger or something to guard against this. I don't know how that would affect your results, since it will add extra processing in the split tables scenario.

Also it might make sense for some kinds of queries to add a CLUSTERED index on Contacts_DomainFK(DomainID), so that emails of the same domain are physically close to each other. This might delay your random inserts (done once, who cares), but might improve other things like counting emails per domain, retrieving all mails from a domain, etc.

Moreover, in all TABLES that I design, and all tests that I do, all my tables have an int column named ID with is the primary key, rather than using two fields as the primary key. It makes joins clear and all queries more efficient I believe.


Thursday, April 26, 2012 - 9:20:36 AM - Charlie Harnett Back To Top (17126)

While this is nifty in some ways, it seems to me the practical value is limited.

First, what does disk space cost? Not much. We no longer care about compression for space savings, there must be some other and highly significant performance benefit to justify the added complexity of the split or multi-table solution. Getting apps into service rapidly is generally more valuable here.

Second, what's the real performance benefit here? In this case, the author tested changing domain names 500 times. That test is tailored to a particular solution, is this a realistic test for a typical site? If a user changes his or her domain name, is he or she not almost as likely to change the name? Similarly, is counting addresses by domain name a realistic test? We might look at domain names registered in our client database as some sort of indication of how many different companies are represented in our database but many of these client addresses will be "Yahoo" or "Gmail" and a simple count is going to be somewhat misleading.


Monday, April 16, 2012 - 8:43:29 AM - Usman Butt Back To Top (16937)

Nice one again. Thanks for keep it going.

Moreover, with data compression, CPU is always a key factor. So I believe fluctuation in CPU usage stats should be shown as well (Although the above UPDATE example would have least CPU usage for Domain_FK, Some complex filter, join, order by etc. queries would have been ideal) I may be asking for too much :), but I think there is room to see the performance of row level compression as well especially for Domain_FK case in highly DML operated environment and/or with low CPU resource etc. May be in another tip?

Futhermore, the 10,000 rows creation script would not always yield 10,000 rows (Atleast did not happen for my new db. With Distinct that seems to be the case anyhow unless I missed soemthing)


Thursday, April 12, 2012 - 3:51:13 PM - Aaron Bertrand Back To Top (16884)

Pretty generic hardware in a virtual machine - 4 CPUs + 8 GB RAM. The point of the exercise wasn't to see how fast anything is on any specific hardware, but rather how the different approaches vary on the *same* hardware.


Thursday, April 12, 2012 - 1:52:42 PM - Kev Back To Top (16880)

Good case study, however I would like to know what hardware were you using doing your test?















get free sql tips
agree to terms