Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

SQL Server User Defined Type To Store and Process Email Addresses


By:   |   Read Comments (2)   |   Related Tips: 1 | 2 | 3 | 4 | More > User Defined Type UDT

New Whitepaper > 10 Ways We Can Steal Your Data >> download now


Problem

When stored as strings, email addresses sort by the user portion of the address instead of by the domain or the top level domain (TLD).  That works for some applications, but in other applications queries would be faster if emails are stored in order by domain and TLD and easily searchable by those fields.  How does one go about creating a structure in SQL Server for indexing email addresses effectively?

Solution

Email addresses are defined in section 3.4.1 of RFC 5322.  While the spec allows a very general format the addresses that we'll work with are in the form:

<local-part>@<domain>.<tld>

The <local-part> is usually a user identifier.  The <domain> identifies the organization and the <tld> identifies the type of organization with a string such as com, gov, edu and net. 

When your application's job is to pick the right set of email addresses for sending bulk messages, it's the domain name and the TLD that are more meaningful than the local-part, or user-id, of the address.  While we need the complete address for operations such a sending our message, it's the domain and TLD that are used in searches.  These are the parts of the address that should be indexed.

The solution presented in this article is to create a SQL CLR User-Defined Type (UDT) to store email addresses.  UDT's are a CLR type that have been available since SQL Server 2005.   Using them gives you control over the storage format and the ability to add methods that aren't available with T-SQL types.  SQL Server 2008 uses system defined UDT's to implement the hierarchid and spacial types.

A SQLCLR UDT is a .Net type that has two attributes, [Serializable] and [SqlUserDefinedType], and also follows a few rules.  SQL Server stores the UDT in binary form and retrieves and reinstantiates the type to allow them to be used in T-SQL.  The type can also have methods that enable the inclusion of business logic along with the data.

Visual Studio 2008 creates UDT's the same way that it creates other SQLCLR objects based on the Database\SQLCLR project template.  If you're not familiar with the SQLCLR, you may want to refer to earlier artcles such as Writing to an operating system file using the SQL Server SQLCLR and SQL Server CLR function to concatenate values in a column. This article will also borrow the regular expression for validating an email address from SQL Server CLR function to improve performance of validating email addresses

Once the VS project is created, add a new type with the "Project\Add User-Defined Type..." menu command.  The template creates a skeleton type with the required [Serializable] and [SqlUserDefinedType] attributes.  It also creates the implementation of the INullable interface that UDT's must implement and adds required ToString and Parse methods. 

Here's the code for the UDT:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text.RegularExpressions;
 
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedType(Format.UserDefined,
                                               MaxByteSize=256)]
public struct email : INullable , IBinarySerialize
{
    // Private members
    private bool m_Null;
    public string m_tld;
    public string m_domain;
    public string m_local_part;
 
    public override string ToString()
    {
        return m_local_part + "@" + m_domain + '.' + m_tld;
    }
 
    public bool IsNull
    {
        get
        {
            // Put your code here
            return m_Null;
        }
    }
 
    public static email Null
    {
        get
        {
            email h = new email();
            h.m_Null = true;
            return h;
        }
    }
 
    public static email Parse(SqlString s)
    {
        if (s.IsNull | s.Value.Length == 0) return Null;
 
        if (!IsValid(s)) return Null;
 
 
        email udt = new email();
 
        int at_pos = s.Value.IndexOf('@');
        int period_pos = s.Value.LastIndexOf('.');        
 
        udt.m_local_part = s.Value.Substring(0, at_pos);
        udt.m_domain = s.Value.Substring(at_pos + 1, period_pos - at_pos - 1);
        udt.m_tld = s.Value.Substring(period_pos + 1,
                             s.Value.Length - period_pos - 1);
 
        return udt;
    }
 
    private static bool IsValid (SqlString s)
    {
        return s.Value.Trim().Length<=256 
            & Regex.IsMatch(s.Value, 
               @"^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|
                   (([\w-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$");
    }
 
    // return the tld
    public string tld
    { [SqlMethodAttribute(OnNullCall=false, IsMutator=false, 
                   InvokeIfReceiverIsNull=false, IsDeterministic=true)]
        get {return m_tld;}
    }
 
    public string domain
    {  [SqlMethodAttribute(OnNullCall=false,  IsMutator=false, 
                   InvokeIfReceiverIsNull=false,  IsDeterministic=true)]
        get { return m_domain;}
    }
 
    public string local_part
    {  [SqlMethodAttribute(OnNullCall=false, IsMutator=false, 
                   InvokeIfReceiverIsNull=false, IsDeterministic=true)]
        get {return m_local_part;}
    }
 
    /// <summary>
    /// custom deserialization from the reader
    /// </summary>
    /// <param name="r">BinaryReader.</param>
    void IBinarySerialize.Read(System.IO.BinaryReader r)
    {
        m_Null = r.ReadBoolean();
        if (!m_Null) {
            m_local_part = r.ReadString();
            m_domain = r.ReadString();
            m_tld = r.ReadString();
        }
 
    }
 
    /// <summary>
    /// custom serialization to the writer
    /// </summary>
    /// <param name="w">BinaryWriter. </param>
    void IBinarySerialize.Write(System.IO.BinaryWriter w)
    {
        w.Write(m_Null);
        if (!m_Null) {
            w.Write(m_local_part);
            w.Write(m_domain);
            w.Write(m_tld);
        } 
    }
}

Due to the use of strings, email must implement the UserDefined serialization Format and thus must implement the IBinarySerialize interface to write and read the struct's fields.  We'll see later that the use of IBinarySerialize limits the options for indexing any field based on the email UDT.

UDT's must implment the INullable interface and the Visual Studio template creates it for us.  The required Parse function is the only way that the code provides to bring data into the class.  It validates the email and breaks it into the parts, local_part, domain and tld, that are needed for the application.  The regular expression is borrowed from the earlier article  SQL Server CLR function to improve performance of validating email addresses.  The regular expression string was broken into two lines so that the while expression is visible. 

There are three read-only properties for the parts of the email address named tld, domain, and local_part.  SQL Server allows us to access these properties from T-SQL in a way similar to using them in a .Net program.   Other methods, both static and instance, may be added to the class and are exposed by SQL Server.

Build and Deploy with Visual Studio with the "Build\Deploy" menu just like the other SQL CLR projects and the email type can be used.  This query uses it and shows how to access the read-only properties:

DECLARE @em dbo.email@nullem dbo.email
SET @em 'abc@def.com'
SELECT @em [@em]
     
@em.ToString() [ToString()]
     
@em.local_part [local part]
     
@em.domain [domain]
     
@em.tld [tld]
     
@nullem 
[nullem]
GO
@em                           ToString()   local part   domain   tld   nullem
----------------------------- ------------ ------------ -------- ----- -------
0x00036162630364656603636F6D  abc@def.com  abc          def      com   NULL   

Notice that SQL Server stores the UDT as binary. The first byte is the boolean false for the m_Null field.  The second byte is the length of the next string, three characters.  That's followed by the ascii representation of the three characters "abc".  The remainder of the binary representation follows.  To invoke a method, such as ToString the parenthesis must be placed after the method name.  To invoke the get method of properties only the property name is required.  SQL Server takes care of returning NULL for null email types.

There is a catch when invoking method and property names.  They're case sensitive! That holds true even if the default collation of the SQL Server is case insensitive and even if you write in VB.Net, which is normally case insensitive.  Here's what happens when you make a mistake in the case:

DECLARE @em dbo.email
SET @em 'abc@def.com'
SELECT @em [@em]
     
@em.ToString() [ToString()]
     
@em
.LOCAL_PART [LOCAL PART]
GO
Msg 6592, Level 16, State 3, Line 5
Could not find property or field 'LOCAL_PART' for type 'email' in assembly 'email_type'.

UDTs can be used in tables much like any built in type.  The following script creates the table enrollee and populates it with some sample data.  At the end there's a select with a where clause based on the primary_email.tld property:

CREATE TABLE enrollee 
               
(first_name VARCHAR(50) NOT NULL
                ,
last_name VARCHAR(50) NOT NULL
                ,
primary_email email NOT NULL
                )
GO
 
INSERT INTO enrollee (first_namelast_nameprimary_email)
 
VALUES ('Moe''Howard''moe@stooges.com')
       ,(
'Andy','Novick''anovick@novicksoftware.com')
       ,(
'Shemp','Howard''shemp@stooges.com')
       ,(
'Curley','Howard''curley@stooges.com')
       ,(
'Larry','Fine' 'Larry@fine_brothers.net')
       ,(
'Joe''Besser''joe@baldy.net')
       ,(
'Curley Joe','DeRita''cj@wardel.com')
GO
       
SELECT first_nameLast_nameprimary_email.ToString()
 
FROM enrollee WHERE primary_email.tld 
'net'
GO
first_name   Last_name   
------------ ----------- --------------------------
Larry        Fine        Larry@fine_brothers.net
Joe          Besser      joe@baldy.net

The execution plan shows a problem.  There is no primary key and as you'd expect the query requires a table scan. 

 is to create a SQL CLR User-Defined Type (UDT) to store email addresses.

The table scan isn't a performance problem when there are just seven rows, but how about when there's multiple pages of data. This query adds rows using data from the AdventureWorks sample database:

INSERT INTO enrollee (last_namefirst_nameprimary_email)
SELECT FirstNameLastNameEmailAddress
   
FROM adventureworks.Person.Contact
GO

UPDATE STATISTICS 
dbo.enrollee
GO
 
(19972 row(s) affected)
(1 row(s) affected)

There are just two rows with "net" as the tld, but the query continues to require a table scan.  That's slow and will get slower as the number of enrollees grows into the millions.  The solution, of course, is better indexing.  Let's try and add a primary key based on the primary_email column:

INSERT INTO enrollee (last_namefirst_nameprimary_email)
SELECT FirstNameLastNameEmailAddress
   
FROM adventureworks.Person.Contact
GO

UPDATE STATISTICS 
dbo.enrollee
GO
Msg 1919, Level 16, State 1, Line 1
Column 'primary_email' in table 'enrollee' is of a type that is 
invalid for use as a key column in an index.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

SQL Server isn't going to allow the use of a UDT in an index.  There was a time before SQL Server 2005 SP1 when it was allowed, but there were enough problems with UDT's in indexes that the feature was removed. 

The solution to getting a better index on parts of a UDT is to create persisted computed columns on the parts of the UDT that should be indexed.  Then create a nonclusted index with the parts in the index.  The UDT can't be part of the index key. but it can be included so that the index will cover important queries.  Here's the script that implements the computed columns and the index:

ALTER TABLE enrollee
   
ADD tld AS primary_email.tld persisted
ALTER TABLE enrollee
   
ADD domain AS primary_email.domain persisted
GO

CREATE INDEX ix_enrollee_covery_by_tld_domain ON dbo.enrollee
   
(tlddomainlast_namefirst_name)
    
include (primary_email)
GO
Warning! The maximum key length is 900 bytes. The index 
'ix_enrollee_covery_by_tld_domain' has maximum length of 
16100 bytes. For some combination of large values, the 
insert/update operation will fail.
Warning: The maximum length of the row exceeds the permissible 
limit of 8060 bytes. For some combination of large values, 
the insert/update operation will fail.

The enrollee table now has tld and domain columns computed from the primary_email column.  The persisted clause means that the value of the computed columns are stored again.  The computed columns use as much storage as if these columns were added to the table and managed by code, but they have the advantage that they can't get out of sync with the primary_email column. The index has the computed columns and the two name columns in the key, but the key can't include the UDT so it's included.  The results of our select are the same:

SELECT first_nameLast_nameprimary_email.ToString()
 
FROM enrollee WHERE primary_email.tld 'net'
GO
first_name   Last_name   
------------ ----------- --------------------------
Larry        Fine        Larry@fine_brothers.net
Joe          Besser      joe@baldy.net

However, this time we get an index seek on our non-clustered index, which is much quicker than a table scan.

 this time we get an index seek on our non-clustered index, which is much quicker than a table scan
Next Steps
  • Incorporate UDTs in your next project
  • Be aware of the need to persist computed columns in order to index data in the UDT


Last Update:


signup button

next tip button



About the author





Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Wednesday, June 21, 2017 - 4:00:32 AM - ema Back To Top

hi,

it s a very usefull post con clr and udt.

so, i have some problem in case hi insert null values in the column where it s definied the custum type.

for ex in:

 CREATE TABLE enrollee  

               (first_name VARCHAR(50) NOT NULL 
                ,
last_name VARCHAR(50) NOT NULL 
                ,
primary_email email NOT NULL 
                ) 
GO 
  
INSERT INTO enrollee (first_namelast_nameprimary_email
 
VALUES ('Moe''Howard''moe@stooges.com'
       ,(
'Andy','Novick'null)
GO 

 

then 

"System.Data.SqlTypes.SqlNullValueException: Data is Null. This method or property cannot be called on Null values."

 

how to fix it???

 

thank you!

 


Tuesday, July 24, 2012 - 1:48:51 PM - kiran Back To Top

hi sir,

i want to insert multiple atrribute into single record in sql..

for ex,a person has multiple phone no

 


Learn more about SQL Server tools