Learn more about SQL Server tools

mssqltips logo
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Scripts          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Name Parsing with SQL Server Functions and T-SQL Programming


By:   |   Updated: 2014-05-22   |   Comments (4)   |   Related: 1 | 2 | 3 | More > Functions - System

Problem

My company receives names in a variety of different formats from business partners. The formatting for names varies substantially across business partners. However, the names need to be parsed into one consistent format to help with tracking details associated with persons. What SQL Server built-in functions and T-SQL programming techniques are especially helpful for parsing names into one format that come from vendors in different formats?

Solution

It is not uncommon to encounter situations where a single agency accepts data on persons from many other business partners. For example, a single agency may attempt to collect previously unpaid loan balances for two or more loans originated by a single lender or different lenders. Another typical example of the need to coordinate data by a person's name involves the collection of patient ratings for multiple treatment episodes at a hospital or other healthcare provider. In both these scenarios, it is not unusual for the same person's name to arrive for processing in different formats. One lender or healthcare provider can enter names so that the first, middle, and last names occur in that order, but another lender can enter names so that the last name is followed by a comma which precedes the first and middle names.

This is the first in a three-part series on how to process names in different formats to one consistent format. The three tips will focus on SQL Server functions that are especially useful for parsing names as well as SQL programming techniques for processing strings in an incoming name string to consistently formatted name parts, such as first name, middle name, and last name. This initial tip introduces the functions, highlights what's special about them for name parsing, and demonstrates basic name parsing programming tips with several examples.

Basic Functions for Parsing Names in SQL Server

In this tip, we are focusing on basic demonstrations for how to use five SQL Server string functions for name parsing. These are CHARINDEX, SUBSTRING, RIGHT, LEN, and REPLACE. You can use these functions to learn about the contents of a name string containing name parts and then extract the parts so that you can enter them into a staging table for tracking persons within and across data feeds. For this introductory tip, it is assumed that name parts are separated by just blanks, and the parts sometimes include a trailing comma. Our main focus will be handling strings that can vary by the number of name parts, such as first and last name only versus first name, middle name, last name and suffix. The goal is to equip you to devise one code base for handling name strings that correctly extracts name parts in as many different formats as your needs dictate.

One place to get started parsing name field values is to segment the values into name parts. Each string in a name field value can represent a part of a name. The CHARINDEX function is especially helpful for finding the location of delimiters for strings in a name field. For a name field value with just two strings for first and last names separated by a space, the CHARINDEX function can find the location of the blank between the name parts. If a name field value contains more than two strings for several different parts, then you can nest CHARINDEX function calls within one another to find the location of delimiters between successive strings within a name field value.

Armed with information about the location of the space between the first and last name parts in a name string, you can use the SUBSTRING function to extract each string part. Then, you can assign the string parts to the first name and last name in a staging table. The SUBSTRING function extracts a sequence of characters from a starting point. When extracting the first string from a name field, you start at position one. When extracting the second or another subsequent string, you start at one position past the space just after the preceding string part. The number of characters to extract can be computed. The computed value can depend on the position of the blank space and/or the length of a name field value. In my experience, it is typical to use an expression to compute the number of characters to extract from a name field value.

The LEN function can report the number of characters in a name field value. When extracting all the characters after some delimiter, such the last name after the first name, you can use the RIGHT function in combination with the LEN function value to specify the number of characters to extract.

It is sometimes handy when name parsing to be able to modify the selected characters within a name field value. For example, you can use the REPLACE function to remove commas from a name field value so that they are not included in a parsed name.

Parsing Just First Name and Last Name from a Name String in SQL

One especially easy name parsing example is to parse name field values when there are just two parts, such as a first name followed by a space and a second name. The only issue for this name parsing task is to dynamically adjust the lengths of strings extracted for the first and second names. You can implement the solution to the issue with information about the location of the blank space between the name parts as well as the length of the name field value.

The following script illustrates the basics of how to extract the first and last names from a name field value. The script commences with a DECLARE statement followed by a SET statement that assigns a string to the @nameString value (Jerome Sardo). The CHARINDEX function in a SELECT statement presents the syntax and lets you examine how it works for finding the position of the first space in the name field. The first character in the name field, J, has position 1 so that the location of the blank space is 7.

The characters for the first name extend from position 1 through position 6, which is one less than the first space's location. The SUBSTRING function takes three arguments. The first argument is for the name field. The second argument is for the location of the first character from which to start extracting. The third argument is for the number of characters to extract. The CHARINDEX function in a simple expression nests within the SUBSTRING function to specify the number of characters to extract.

The following script presents a couple of different approaches to extracting the last name.

  • The first approach uses the SUBSTRING function. This time extraction from the @nameString local variable begins with the first character after the space between the two strings in the local variable. The last argument designates the number of characters to extract as the difference between the length of the string in @nameString less the location of the first space in @nameString.
  • The second approach illustrates the use the RIGHT function instead of the SUBSTRING function to extract the second string from @nameString. The RIGHT function takes two arguments. The first argument is the name field name. The second argument is the number of characters from the end of the name field moving towards the beginning of the name field. In this case, the arithmetic expression for this number of characters is the same as the third argument for the SUBSTRING function approach.

The final SELECT statement in the code sample combines the SUBSTRING function for the first name with the RIGHT function for the last name to display the first name followed by a space and the last name. The point of the last SELECT statement is to remind you that you can combine the first and last names any way that you wish after you extract them from a name field value.

-- Code for parsing first name followed by last name
DECLARE @nameString as varchar(128)
SET @nameString = 'Jerome Sardo'
-- find the delimiter for the end of the first name
SELECT CHARINDEX(' ',@namestring,1) space_location
-- extract the first name
SELECT SUBSTRING(@nameString,1,CHARINDEX(' ',@namestring,1)-1) fname
-- extract the second name one way
SELECT SUBSTRING(@nameString, CHARINDEX(' ',@namestring,1)+1, 
   LEN(@namestring)-CHARINDEX(' ',@namestring,1)) lname_1
-- extract the second name a different way
SELECT RIGHT(@namestring, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) lname_2
-- Bring the parts together as first name space last name
SELECT
SUBSTRING(@nameString,1,CHARINDEX(' ',@namestring,1)-1)
+ ' ' +
RIGHT(@namestring, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) parsed_name_parts

Parsing a Name Field Value with a Suffix in T-SQL

There are numerous approaches to implementing name parsing for name field values in different formats. One strategy is to code parsing schemes for each name representation that you want to be able to parse. You can successively pass a name field value to different name parsing code implementations until you find one that works or you fail to discover a match and leave the name field value unparsed.

The next example is especially designed to parse a name with three parts in which the third part is a suffix, such as Jr. or III. The code sample begins by assigning the value 'Robert Dobson, Jr.' to a string variable. Alternate SET assignment statements that are commented out help you to readily confirm the same code fails to parse some names in other formats. A subsequent example will illustrate an another approach which is more flexible in the types of name formats that it can parse.

After assigning a value to the @nameString variable, the code discovers how many parts are in the @nameString variable. Three SELECT statements track the location of blanks in the @nameString variable.

  • The first SELECT statement looks for the first blank in the variable. If there are no blanks, it returns a value of 0.
  • The second SELECT statement looks for the second blank in the variable by nesting one CHARINDEX function within another. Notice that the search for the second blank begins one space AFTER the first blank.
  • The third SELECT statement looks for the third blank in the variable -- again, by using nested CHARINDEX functions.

You can edit the solution to exit out of processing @nameString values for which the code will generate an error. Aside from being optimized for three-part names with a suffix, the code may be suitable in some cases for parsing strings such as Robert William Dobson, Sr. On the other hand, the code sample will not parse names such as Microsoft or Robert Dobson. The code to track the location of blanks in a name field value can be used to facilitate exiting out of some name parsing code that will generate an error.

The code for extracting the first and second parts of a three-part name is generally similar to code for extracting the first and last name in the preceding code sample. One key distinction introduced with this sample is the optional removal of a comma after the second name part. The REPLACE function enables this feature by swapping the first comma in the second name part with a zero-length string. The change is optional in the sense that if there is no comma in the second name, then nothing gets replaced.

The code for extracting the third part in the @nameString variable begins at one space past the second blank in the nameString variable. Instead of specifying extraction of just the number of characters until the next blank, the LEN function specifies the total length of all the characters in the @nameString variable. This designation for the third argument has two benefits. First, you do not have to write an arithmetic expression for extracting precisely the number of characters in the third part. The SUBSTRING function automatically ceases retrieving characters when there are no more characters to extract. Second, if there are more name parts than just one after the second part, then the LEN function in the SUBSTRING function extracts all the remaining parts.

The closing section of the following code sample re-formats the extracted parts from the @nameString variable and changes the case as well. While the input source is in the format 'Robert Dobson, Jr.', the re-formatted output from the final SELECT statement is in the format 'DOBSON JR., ROBERT'. The case transformation is implemented with the UPPER function. All the other functions are described previously.

-- Code for parsing first and last names with a suffix
DECLARE @nameString as varchar(128),
 @firstSpaceLoc as smallint,
 @secondSpaceLoc as smallint,
 @thirdSpaceLoc as smallint

SET @nameString = 'Robert Dobson, Jr.'
--SET @nameString = 'Robert Dobson'
--SET @nameString = 'Microsoft'

-- How many strings are in the name?
-- Is there one space in the name
SELECT CHARINDEX(' ',@namestring,1) firstStringLoc
-- Is there second space in the name
SELECT CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) secondStringLoc
-- Is there a third space in the name
SELECT CHARINDEX(' ', @namestring, CHARINDEX(' ', @namestring, 
             CHARINDEX(' ',@nameString,1)+1)+1) thirdStringLoc
-- extract first string
SELECT LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1) firstString
-- extract second string
SELECT
 SUBSTRING (
 @nameString, 
 CHARINDEX(' ',@namestring,1)+1, 
 CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) 
 - CHARINDEX(' ',@namestring,1)
 ) secondString

-- clean second string of a comma
SELECT 
REPLACE (
 SUBSTRING (
 @nameString, 
 CHARINDEX(' ',@namestring,1)+1, 
 CHARINDEX(' ', @namestring, 
 CHARINDEX(' ',@nameString,1)+1) 
 - CHARINDEX(' ',@namestring,1)
 ),
 ',',
 ''
 ) cleanedSecondString 
-- extract third string
SELECT SUBSTRING (
 @nameString,
 CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1,
 LEN(@nameString)
 ) thirdString
-- original whole name string
SELECT @nameString originalFormattedName
-- put string parts together in different order and case
SELECT
 UPPER
 (
 REPLACE (
 SUBSTRING (
 @nameString, 
 CHARINDEX(' ',@namestring,1)+1, 
 CHARINDEX(' ', @namestring, 
 CHARINDEX(' ',@nameString,1)+1) 
 - CHARINDEX(' ',@namestring,1)
 ),
 ',',
 ''
 ) 
 )
 + ' ' +
 UPPER
 (
 SUBSTRING (
 @nameString,
 CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1,
 LEN(@nameString)
 ) 
 )
 + ', ' +
 UPPER
 (
 LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1)
 ) reformattedName

Parsing Names with One, Two, or Three Strings in SQL Server

The preceding two code samples are each especially designed to parse names arriving in a specific format. The first sample is especially tailored for names with two strings, and the second is tailored for names with three strings. Neither sample will handle a name with just one string, such as Microsoft. In addition, the sample for parsing three-part names does not successfully parse names with just two parts.

The next code sample handles name fields with one, two, or three name parts. The code sample achieves this flexibility by scanning the @nameString variable to keep track of how many spaces are in the name. Then, it uses that information to extract the strings for each name part in the variable. This code sample, and the previous ones, assumes a blank space delimits name parts. After a name part is extracted, the part is stored in a local variable for re-use, such as for re-formatting the names into an output format or assigning a value to a column of a staging table with parsed name parts.

The code relies on six local variables besides the one for the name field value. Three local variables keep track of the position of successive spaces in a name field. The code tracks whether there are 0, 1, 2, or 3 spaces in a name and the location of up to three spaces. If there are no spaces in a name, such as Microsoft, then all three local variables are 0. A name field containing three spaces, such as Robert William Dobson, Sr., assigns a non-zero value to all three local variables. The first local variable has the location of the space between Robert and William. The second local variable has the location of the space between William and Dobson. The third local variable has the location of the space between Dobson, and Sr.

In developing this name parsing solution, it was convenient to echo back the result of how many spaces were in a name. It is easy for name parsing code to become complex, and confirming intermediate results is often useful for unit testing. The code sample confirms how many spaces are detected in a SELECT statement with the original name field (@nameString) and three CASE...END statements. The successive CASE...END statements assess whether there are one, two, or three spaces in a name. If the code discovers no spaces in a name, then the first CASE...END statement indicates there is not one space in the name.

Another round of three CASE...END statements assign values to the local variables for the first, second, and third name parts. The first CASE...END statement assigns the characters before the first space to the first local variable for the first name part, such as Robert for the name Robert Dobson, Jr. If the first CASE...END statement detects that there are no spaces in a name, such as for Microsoft, then it merely assigns the whole name field to the first local variable.

The second CASE...END statement includes three WHEN clauses for assigning a value to the second name part; the values can be an empty string, the name part after the first space, or the name part between the first and second spaces with any commas removed. The values for the local variables tracking spaces in a name determine which WHEN clause makes the assignment for the second name part.

The third CASE...END statement also allows three possible values for the third name part. If the local variable for the first or second space is 0, then the third string is set to an empty string. If the position for the second space is greater than 0 and the position of the third space is 0, then third name part equals the characters from one beyond the second space and the end of the name field value. When the value of the third space position is greater than 0, then the third name part is set to the characters after the second space.

The final block of code reports out the parsed name parts in two formats. The first SELECT statement reports the original name field value followed by its name parts. The second SELECT statement re-formats the original name format to a specific output format. Notice that there is no need to re-specify the original parsing code when re-formatting the original name value. This is one special advantage of assigning parsed name parts to local variables.

-- Code for parsing a name with multiple parts

DECLARE     @nameString as varchar(max),
            @firstSpaceLoc as smallint,
            @secondSpaceLoc as smallint,
            @thirdSpaceLoc as smallint,
            @firstString as varchar(max),
            @secondString as varchar(max),
            @thirdString as varchar(max) 
 
--SET @nameString = 'Microsoft'
--SET @nameString = 'Robert Dobson'
SET @nameString = 'Robert Dobson, Jr.'
--SET @nameString = 'Robert William Dobson, Sr.'
 
-- How many strings are in the name?
 
-- Is there one space in the name
SET @firstSpaceLoc = CHARINDEX(' ',@namestring,1)
 
-- Is there second space in the name
SET @secondSpaceLoc = CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)
 
-- Is there a third space in the name
SET @thirdSpaceLoc = 
CASE
      WHEN CHARINDEX(' ', 
                              @namestring, 
                              CHARINDEX(' ',@nameString,1)+1) = 0 THEN 0
      WHEN CHARINDEX(' ', 
                              @namestring, 
                              CHARINDEX(' ',@nameString,1)+1) > 0 THEN
                                    CHARINDEX(' ', @namestring, 
                                    CHARINDEX(' ', @namestring, 
                                    CHARINDEX(' ',@nameString,1)+1)+1)
END
 
SELECT
            @nameString sourceString,
            CASE 
                        WHEN @firstSpaceLoc > 0 THEN 'There is one space'
                        ELSE 'There is not one space'
            END [Is there one space],
            CASE 
                        WHEN @secondSpaceLoc > 0 THEN 'There is a second space'
                        ELSE 'There is not a second space'
            END [Is there a second space],
            CASE 
                        WHEN @thirdSpaceLoc > 0 THEN 'There is a third space'
                        ELSE 'There is not a third space'
            END [Is there a third space]
 
 
-- extract and save strings
SELECT

       @firstString = 
       CASE
            WHEN @firstSpaceLoc > 0 THEN LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1)
            ELSE @nameString
       END,
       @secondString =   
       CASE
            WHEN @firstSpaceLoc = 0 THEN ''
            WHEN @secondSpaceLoc = 0 THEN 
                        RIGHT(@namestring, LEN(@namestring)- CHARINDEX(' ',@namestring,1))
            WHEN @secondSpaceLoc > 0 THEN
                        REPLACE     (
                        SUBSTRING   (
                                       @nameString, 
                                       CHARINDEX(' ',@namestring,1)+1, 
                                       CHARINDEX(' ', @namestring, 
                                       CHARINDEX(' ',@nameString,1)+1) 
                                             - CHARINDEX(' ',@namestring,1)
                                     ),
                                     ',',
                                     ''
                                     )
            ELSE ''
      END,
      @thirdString =
      CASE
            WHEN @firstSpaceLoc = 0 OR @secondSpaceLoc = 0  THEN ''
            WHEN @secondSpaceLoc > 0 
                        AND @thirdSpaceLoc = 0 THEN
                        SUBSTRING   (
                                       @nameString,
                                       CHARINDEX(' ', @namestring, 
                                       CHARINDEX(' ',@nameString,1)+1)+1,
                                       LEN(@nameString)
                                     )     
            ELSE RIGHT(@namestring,LEN(@namestring) - @secondSpaceLoc)              
      END
 
 
-- Report names
SELECT
            @nameString sourceString,
            @firstString [First string],
            @secondString [Second string],
            @thirdString [Third string]
 
SELECT
            CASE
                  WHEN @thirdSpaceLoc > 0 THEN 
                                    @thirdString + ', '  + @firstString + ' ' + @secondString
                  WHEN @secondSpaceLoc > 0 AND @thirdSpaceLoc = 0 THEN 
                                    @secondString + ' '  + @thirdString + ', ' + @firstString
                  WHEN @firstSpaceLoc > 0 THEN 
                                    @secondString + ', '  + @firstString 
                  WHEN @firstSpaceLoc = 0 THEN 
                                    @firstString
            END [Reported Name]

Next Steps

If you decide you want to learn more about this topic, one obvious next step is to copy the scripts discussed in this tip and start modifying them based on your own custom requirements and desires to learn more about name parsing. If you are just starting out with SQL programming, you may want to take a little time and dig deeper into the operation and syntax for string functions mentioned and demonstrated in this article. I rank the MSDN site (http://msdn.microsoft.com/en-us/library/ms181984.aspx) as a particularly authoritative source. Also, I remind you about prior MSSQLTips.com coverage of string processing. Here are some prior tips for you to reference.

Much coverage of name parsing tends to focus on relatively narrow topics. This series of tips aims at imparting systematic coverage of best practices for name parsing. For example, this introductory tip on name parsing covers three examples. When implementing name parsing solutions, it is often a good idea to nest one string function within another. You can frequently extract parts from a name field value by finding the location of successive blank spaces (or other delimiters) within a field value. You can typically simplify your code by saving intermediate parsing results in local variables and then operating on local variables instead of string expressions used to assigned values to functions.



Last Updated: 2014-05-22


get scripts

next tip button



About the author




Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Email me updates 


Signup for our newsletter
 I agree by submitting my data to receive communications, account updates and/or special offers about SQL Server from MSSQLTips and/or its Sponsors. I have read the privacy statement and understand I may unsubscribe at any time.



    



Tuesday, May 27, 2014 - 11:23:12 PM - Rick Dobson Back To Top

You have discovered that name parsing is tricky.  That's a significant reason why I am tackling the issue with a three-part set of tips.

I believe that all the questions you pose are resolvable after mastering the basics of name parsing and a sufficient amount of software engineering.  The first tip in my three-part series on name parsing introduces some basic techniques that you can use to start building your T-SQL name parsing skill.  I particularly direct you to the third script which handles name strings having one through four name parts.

My second tip, which I am working on now, will get closer to answering some of your specific question.  However, the point of the series is to equip folks to write their own name parsing solutions.  I urge you to take one of your simpler name parsing issues which is currently not solved and give it a try.  If you get it solved, then you can move on to a more complicated example.  If you cannot get it solved, maybe you can share your progress, and others can help you get to a solution for that problem.


Tuesday, May 27, 2014 - 6:15:39 AM - Satishchandra vishwakarma Back To Top

Hi Rick,

Thanks for sharing this article,we have also same requirement about name parsing but in that we have first_name,middle_name,last_name,prefix and suffix too same as USA naming format.

and I am confused because in some name complete name is given in some names just first_name,and last_name is given or some where only first_name is given so when I start parsing the sata it gets shuffled and the first_name goes into prefix column or many type of error comes.

please let me know how to proceed.

 

Thanks and regards.

Satishchandra Vishwakarma


Saturday, May 24, 2014 - 8:15:59 AM - Rick Dobson Back To Top

In my opinion, automated name parsing is meant to generate solutions that work for most of the data most of the time.  For example, you may not parse correctly persons with two-part first names who have a middle name.  On the other hand, if persons with two-part first names were sufficiently critical to some application, then some pre-processing or extension of the parsing techniques discussed in the first tip may be created to handle this scenario.

The first tip, or even the whole series of tips, is not meant to handle every name parsing requirement that could ever exist.  On the contrary, the name parsing series is to empower you with some basic capabilities so that can go on and build name parsing solutions to any level of complexity that your requirements and resources permit.  Hopefully, some of you may be motivated to include an extension in a reply or even in a separate tip of your own.  I look forward to reading them.


Thursday, May 22, 2014 - 7:39:19 PM - TimothyAWiseman Back To Top

I rather like this article and I am eager to see the other parts of the series.

I think two things are worth noting though.  First it is not uncommon to see first names that have multiple parts.  For instance, I have met a lady named "Mary Anne" who also had a middle name.  This makes separating a name into parts with mere spaces complicated.

 

The other thing is that while this is a great T-SQL Solution, T-SQL is not as good at string manipulation as most other languages, and for truly complex parsing it may be worth either exporting the parsing to the application layer which is written in a more suitable language or implementing it as a CLR.


Learn more about SQL Server tools