SQL Server User Defined Type To Store and Process Email Addresses

By: Andy Novick | Comments (2) | Related: 1 | 2 | 3 | 4 | > User Defined Type UDT

Problem

When stored as strings, email addresses sort by the user portion of the address instead of by the domain or the top level domain (TLD). That works for some applications, but in other applications queries would be faster if emails are stored in order by domain and TLD and easily searchable by those fields. How does one go about creating a structure in SQL Server for indexing email addresses effectively?

Solution

Email addresses are defined in section 3.4.1 of RFC 5322. While the spec allows a very general format the addresses that we'll work with are in the form:

<local-part>@<domain>.<tld>

The <local-part> is usually a user identifier. The <domain> identifies the organization and the <tld> identifies the type of organization with a string such as com, gov, edu and net.

When your application's job is to pick the right set of email addresses for sending bulk messages, it's the domain name and the TLD that are more meaningful than the local-part, or user-id, of the address. While we need the complete address for operations such a sending our message, it's the domain and TLD that are used in searches. These are the parts of the address that should be indexed.

The solution presented in this article is to create a SQL CLR User-Defined Type (UDT) to store email addresses. UDT's are a CLR type that have been available since SQL Server 2005. Using them gives you control over the storage format and the ability to add methods that aren't available with T-SQL types. SQL Server 2008 uses system defined UDT's to implement the hierarchid and spacial types.

A SQLCLR UDT is a .Net type that has two attributes, [Serializable] and [SqlUserDefinedType], and also follows a few rules. SQL Server stores the UDT in binary form and retrieves and reinstantiates the type to allow them to be used in T-SQL. The type can also have methods that enable the inclusion of business logic along with the data.

Visual Studio 2008 creates UDT's the same way that it creates other SQLCLR objects based on the Database\SQLCLR project template. If you're not familiar with the SQLCLR, you may want to refer to earlier artcles such as Writing to an operating system file using the SQL Server SQLCLR and SQL Server CLR function to concatenate values in a column. This article will also borrow the regular expression for validating an email address from SQL Server CLR function to improve performance of validating email addresses.

Once the VS project is created, add a new type with the "Project\Add User-Defined Type..." menu command. The template creates a skeleton type with the required [Serializable] and [SqlUserDefinedType] attributes. It also creates the implementation of the INullable interface that UDT's must implement and adds required ToString and Parse methods.

Here's the code for the UDT:

using System;

using System.Data;

using System.Data.SqlClient;

using System.Data.SqlTypes;

using Microsoft.SqlServer.Server;

using System.Text.RegularExpressions;

[Serializable]

[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.UserDefined,

                                               MaxByteSize=256)]

public struct email : INullable , IBinarySerialize

    // Private members

    private bool m_Null;

    public string m_tld;

    public string m_domain;

    public string m_local_part;

    public override string ToString()

        return m_local_part + "@" + m_domain + '.' + m_tld;

    public bool IsNull

get

            // Put your code here

            return m_Null;

    public static email Null

get

            email h = new email();

            h.m_Null = true;

            return h;

    public static email Parse(SqlString s)

        if (s.IsNull | s.Value.Length == 0) return Null;

        if (!IsValid(s)) return Null;

        email udt = new email();

        int at_pos = s.Value.IndexOf('@');

        int period_pos = s.Value.LastIndexOf('.');

        udt.m_local_part = s.Value.Substring(0, at_pos);

        udt.m_domain = s.Value.Substring(at_pos + 1, period_pos - at_pos - 1);

        udt.m_tld = s.Value.Substring(period_pos + 1,

                             s.Value.Length - period_pos - 1);

        return udt;

    private static bool IsValid (SqlString s)

        return s.Value.Trim().Length<=256

            & Regex.IsMatch(s.Value,

               @"^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|

                   (([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$");

    // return the tld

    public string tld

    { [SqlMethodAttribute(OnNullCall=false, IsMutator=false,

                   InvokeIfReceiverIsNull=false, IsDeterministic=true)]

        get {return m_tld;}

    public string domain

    {  [SqlMethodAttribute(OnNullCall=false,  IsMutator=false,

                   InvokeIfReceiverIsNull=false,  IsDeterministic=true)]

        get { return m_domain;}

    public string local_part

    {  [SqlMethodAttribute(OnNullCall=false, IsMutator=false,

                   InvokeIfReceiverIsNull=false, IsDeterministic=true)]

        get {return m_local_part;}

    /// <summary>

    /// custom deserialization from the reader

    /// </summary>

    /// <param name="r">BinaryReader.</param>

    void IBinarySerialize.Read(System.IO.BinaryReader r)

        m_Null = r.ReadBoolean();

        if (!m_Null) {

            m_local_part = r.ReadString();

            m_domain = r.ReadString();

            m_tld = r.ReadString();

    /// <summary>

    /// custom serialization to the writer

    /// </summary>

    /// <param name="w">BinaryWriter. </param>

    void IBinarySerialize.Write(System.IO.BinaryWriter w)

        w.Write(m_Null);

        if (!m_Null) {

            w.Write(m_local_part);

            w.Write(m_domain);

            w.Write(m_tld);

Due to the use of strings, email must implement the UserDefined serialization Format and thus must implement the IBinarySerialize interface to write and read the struct's fields. We'll see later that the use of IBinarySerialize limits the options for indexing any field based on the email UDT.

UDT's must implment the INullable interface and the Visual Studio template creates it for us. The required Parse function is the only way that the code provides to bring data into the class. It validates the email and breaks it into the parts, local_part, domain and tld, that are needed for the application. The regular expression is borrowed from the earlier article SQL Server CLR function to improve performance of validating email addresses. The regular expression string was broken into two lines so that the while expression is visible.

There are three read-only properties for the parts of the email address named tld, domain, and local_part. SQL Server allows us to access these properties from T-SQL in a way similar to using them in a .Net program. Other methods, both static and instance, may be added to the class and are exposed by SQL Server.

Build and Deploy with Visual Studio with the "Build\Deploy" menu just like the other SQL CLR projects and the email type can be used. This query uses it and shows how to access the read-only properties:

DECLARE @em dbo.email, @nullem dbo.email
SET @em = '[email protected]'
SELECT @em [@em]
     , @em.ToString() [ToString()]
     , @em.local_part [local part]
     , @em.domain [domain]
     , @em.tld [tld]
     , @nullem [nullem]
GO

@em                           ToString()   local part   domain   tld   nullem
----------------------------- ------------ ------------ -------- ----- -------
0x00036162630364656603636F6D  [email protected]  abc          def      com   NULL

Notice that SQL Server stores the UDT as binary. The first byte is the boolean false for the m_Null field. The second byte is the length of the next string, three characters. That's followed by the ascii representation of the three characters "abc". The remainder of the binary representation follows. To invoke a method, such as ToString the parenthesis must be placed after the method name. To invoke the get method of properties only the property name is required. SQL Server takes care of returning NULL for null email types.

There is a catch when invoking method and property names. They're case sensitive! That holds true even if the default collation of the SQL Server is case insensitive and even if you write in VB.Net, which is normally case insensitive. Here's what happens when you make a mistake in the case:

DECLARE @em dbo.email
SET @em = '[email protected]'
SELECT @em [@em]
, @em.ToString() [ToString()]
, @em.LOCAL_PART [LOCAL PART]
GO

Msg 6592, Level 16, State 3, Line 5
Could not find property or field 'LOCAL_PART' for type 'email' in assembly 'email_type'.

UDTs can be used in tables much like any built in type. The following script creates the table enrollee and populates it with some sample data. At the end there's a select with a where clause based on the primary_email.tld property:

CREATE TABLE enrollee
               (first_name VARCHAR(50) NOT NULL
                ,last_name VARCHAR(50) NOT NULL
                ,primary_email email NOT NULL
                )
GO

INSERT INTO enrollee (first_name, last_name, primary_email)
VALUES ('Moe', 'Howard', '[email protected]')
       ,('Andy','Novick', '[email protected]')
       ,('Shemp','Howard', '[email protected]')
       ,('Curley','Howard', '[email protected]')
       ,('Larry','Fine' , 'Larry@fine_brothers.net')
       ,('Joe', 'Besser', '[email protected]')
       ,('Curley Joe','DeRita', '[email protected]')
GO

SELECT first_name, Last_name, primary_email.ToString()
FROM enrollee WHERE primary_email.tld = 'net'
GO

first_name   Last_name   
------------ ----------- --------------------------
Larry        Fine        Larry@fine_brothers.net
Joe          Besser      [email protected]

The execution plan shows a problem. There is no primary key and as you'd expect the query requires a table scan.

is to create a SQL CLR User-Defined Type (UDT) to store email addresses.

The table scan isn't a performance problem when there are just seven rows, but how about when there's multiple pages of data. This query adds rows using data from the AdventureWorks sample database:

INSERT INTO enrollee (last_name, first_name, primary_email)
SELECT FirstName, LastName, EmailAddress
FROM adventureworks.Person.Contact
GO

UPDATE STATISTICS dbo.enrollee
GO

(19972 row(s) affected)
(1 row(s) affected)

There are just two rows with "net" as the tld, but the query continues to require a table scan. That's slow and will get slower as the number of enrollees grows into the millions. The solution, of course, is better indexing. Let's try and add a primary key based on the primary_email column:

INSERT INTO enrollee (last_name, first_name, primary_email)
SELECT FirstName, LastName, EmailAddress
FROM adventureworks.Person.Contact
GO

UPDATE STATISTICS dbo.enrollee
GO

Msg 1919, Level 16, State 1, Line 1
Column 'primary_email' in table 'enrollee' is of a type that is 
invalid for use as a key column in an index.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

SQL Server isn't going to allow the use of a UDT in an index. There was a time before SQL Server 2005 SP1 when it was allowed, but there were enough problems with UDT's in indexes that the feature was removed.

The solution to getting a better index on parts of a UDT is to create persisted computed columns on the parts of the UDT that should be indexed. Then create a nonclusted index with the parts in the index. The UDT can't be part of the index key. but it can be included so that the index will cover important queries. Here's the script that implements the computed columns and the index:

ALTER TABLE enrollee
   ADD tld AS primary_email.tld persisted
ALTER TABLE enrollee
   ADD domain AS primary_email.domain persisted
GO

CREATE INDEX ix_enrollee_covery_by_tld_domain ON dbo.enrollee
   (tld, domain, last_name, first_name)
     include (primary_email)
GO

Warning! The maximum key length is 900 bytes. The index 
'ix_enrollee_covery_by_tld_domain' has maximum length of 
16100 bytes. For some combination of large values, the 
insert/update operation will fail.

Warning: The maximum length of the row exceeds the permissible 
limit of 8060 bytes. For some combination of large values, 
the insert/update operation will fail.

The enrollee table now has tld and domain columns computed from the primary_email column. The persisted clause means that the value of the computed columns are stored again. The computed columns use as much storage as if these columns were added to the table and managed by code, but they have the advantage that they can't get out of sync with the primary_email column. The index has the computed columns and the two name columns in the key, but the key can't include the UDT so it's included. The results of our select are the same:

SELECT first_name, Last_name, primary_email.ToString()
FROM enrollee WHERE primary_email.tld = 'net'
GO

first_name   Last_name   
------------ ----------- --------------------------
Larry        Fine        Larry@fine_brothers.net
Joe          Besser      [email protected]

However, this time we get an index seek on our non-clustered index, which is much quicker than a table scan.

Next Steps

Incorporate UDTs in your next project
Be aware of the need to persist computed columns in order to index data in the UDT

About the author

Andy Novick is a SQL Server Developer in the Boston area with 25 years of database and application development experience.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips

SQL Server User Defined Type To Store and Process Email Addresses

Problem

Solution

Next Steps

About the author

Comments For This Article