By: Rick Dobson | Comments (4) | Related: 1 | 2 | 3 | > Functions System
Problem
My company receives names in a variety of different formats from business partners. The formatting for names varies substantially across business partners. However, the names need to be parsed into one consistent format to help with tracking details associated with persons. What SQL Server built-in functions and T-SQL programming techniques are especially helpful for parsing names into one format that come from vendors in different formats?
Solution
It is not uncommon to encounter situations where a single agency accepts data on persons from many other business partners. For example, a single agency may attempt to collect previously unpaid loan balances for two or more loans originated by a single lender or different lenders. Another typical example of the need to coordinate data by a person's name involves the collection of patient ratings for multiple treatment episodes at a hospital or other healthcare provider. In both these scenarios, it is not unusual for the same person's name to arrive for processing in different formats. One lender or healthcare provider can enter names so that the first, middle, and last names occur in that order, but another lender can enter names so that the last name is followed by a comma which precedes the first and middle names.
This is the first in a three-part series on how to process names in different formats to one consistent format. The three tips will focus on SQL Server functions that are especially useful for parsing names as well as SQL programming techniques for processing strings in an incoming name string to consistently formatted name parts, such as first name, middle name, and last name. This initial tip introduces the functions, highlights what's special about them for name parsing, and demonstrates basic name parsing programming tips with several examples.
Basic Functions for Parsing Names in SQL Server
In this tip, we are focusing on basic demonstrations for how to use five SQL Server string functions for name parsing. These are CHARINDEX, SUBSTRING, RIGHT, LEN, and REPLACE. You can use these functions to learn about the contents of a name string containing name parts and then extract the parts so that you can enter them into a staging table for tracking persons within and across data feeds. For this introductory tip, it is assumed that name parts are separated by just blanks, and the parts sometimes include a trailing comma. Our main focus will be handling strings that can vary by the number of name parts, such as first and last name only versus first name, middle name, last name and suffix. The goal is to equip you to devise one code base for handling name strings that correctly extracts name parts in as many different formats as your needs dictate.
One place to get started parsing name field values is to segment the values into name parts. Each string in a name field value can represent a part of a name. The CHARINDEX function is especially helpful for finding the location of delimiters for strings in a name field. For a name field value with just two strings for first and last names separated by a space, the CHARINDEX function can find the location of the blank between the name parts. If a name field value contains more than two strings for several different parts, then you can nest CHARINDEX function calls within one another to find the location of delimiters between successive strings within a name field value.
Armed with information about the location of the space between the first and last name parts in a name string, you can use the SUBSTRING function to extract each string part. Then, you can assign the string parts to the first name and last name in a staging table. The SUBSTRING function extracts a sequence of characters from a starting point. When extracting the first string from a name field, you start at position one. When extracting the second or another subsequent string, you start at one position past the space just after the preceding string part. The number of characters to extract can be computed. The computed value can depend on the position of the blank space and/or the length of a name field value. In my experience, it is typical to use an expression to compute the number of characters to extract from a name field value.
The LEN function can report the number of characters in a name field value. When extracting all the characters after some delimiter, such the last name after the first name, you can use the RIGHT function in combination with the LEN function value to specify the number of characters to extract.
It is sometimes handy when name parsing to be able to modify the selected characters within a name field value. For example, you can use the REPLACE function to remove commas from a name field value so that they are not included in a parsed name.
Parsing Just First Name and Last Name from a Name String in SQL
One especially easy name parsing example is to parse name field values when there are just two parts, such as a first name followed by a space and a second name. The only issue for this name parsing task is to dynamically adjust the lengths of strings extracted for the first and second names. You can implement the solution to the issue with information about the location of the blank space between the name parts as well as the length of the name field value.
The following script illustrates the basics of how to extract the first and last names from a name field value. The script commences with a DECLARE statement followed by a SET statement that assigns a string to the @nameString value (Jerome Sardo). The CHARINDEX function in a SELECT statement presents the syntax and lets you examine how it works for finding the position of the first space in the name field. The first character in the name field, J, has position 1 so that the location of the blank space is 7.
The characters for the first name extend from position 1 through position 6, which is one less than the first space's location. The SUBSTRING function takes three arguments. The first argument is for the name field. The second argument is for the location of the first character from which to start extracting. The third argument is for the number of characters to extract. The CHARINDEX function in a simple expression nests within the SUBSTRING function to specify the number of characters to extract.
The following script presents a couple of different approaches to extracting the last name.
- The first approach uses the SUBSTRING function. This time extraction from the @nameString local variable begins with the first character after the space between the two strings in the local variable. The last argument designates the number of characters to extract as the difference between the length of the string in @nameString less the location of the first space in @nameString.
- The second approach illustrates the use the RIGHT function instead of the SUBSTRING function to extract the second string from @nameString. The RIGHT function takes two arguments. The first argument is the name field name. The second argument is the number of characters from the end of the name field moving towards the beginning of the name field. In this case, the arithmetic expression for this number of characters is the same as the third argument for the SUBSTRING function approach.
The final SELECT statement in the code sample combines the SUBSTRING function for the first name with the RIGHT function for the last name to display the first name followed by a space and the last name. The point of the last SELECT statement is to remind you that you can combine the first and last names any way that you wish after you extract them from a name field value.
-- Code for parsing first name followed by last name DECLARE @nameString as varchar(128) SET @nameString = 'Jerome Sardo' -- find the delimiter for the end of the first name SELECT CHARINDEX(' ',@namestring,1) space_location -- extract the first name SELECT SUBSTRING(@nameString,1,CHARINDEX(' ',@namestring,1)-1) fname -- extract the second name one way SELECT SUBSTRING(@nameString, CHARINDEX(' ',@namestring,1)+1, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) lname_1 -- extract the second name a different way SELECT RIGHT(@namestring, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) lname_2 -- Bring the parts together as first name space last name SELECT SUBSTRING(@nameString,1,CHARINDEX(' ',@namestring,1)-1) + ' ' + RIGHT(@namestring, LEN(@namestring)-CHARINDEX(' ',@namestring,1)) parsed_name_parts
Parsing a Name Field Value with a Suffix in T-SQL
There are numerous approaches to implementing name parsing for name field values in different formats. One strategy is to code parsing schemes for each name representation that you want to be able to parse. You can successively pass a name field value to different name parsing code implementations until you find one that works or you fail to discover a match and leave the name field value unparsed.
The next example is especially designed to parse a name with three parts in which the third part is a suffix, such as Jr. or III. The code sample begins by assigning the value 'Robert Dobson, Jr.' to a string variable. Alternate SET assignment statements that are commented out help you to readily confirm the same code fails to parse some names in other formats. A subsequent example will illustrate an another approach which is more flexible in the types of name formats that it can parse.
After assigning a value to the @nameString variable, the code discovers how many parts are in the @nameString variable. Three SELECT statements track the location of blanks in the @nameString variable.
- The first SELECT statement looks for the first blank in the variable. If there are no blanks, it returns a value of 0.
- The second SELECT statement looks for the second blank in the variable by nesting one CHARINDEX function within another. Notice that the search for the second blank begins one space AFTER the first blank.
- The third SELECT statement looks for the third blank in the variable -- again, by using nested CHARINDEX functions.
You can edit the solution to exit out of processing @nameString values for which the code will generate an error. Aside from being optimized for three-part names with a suffix, the code may be suitable in some cases for parsing strings such as Robert William Dobson, Sr. On the other hand, the code sample will not parse names such as Microsoft or Robert Dobson. The code to track the location of blanks in a name field value can be used to facilitate exiting out of some name parsing code that will generate an error.
The code for extracting the first and second parts of a three-part name is generally similar to code for extracting the first and last name in the preceding code sample. One key distinction introduced with this sample is the optional removal of a comma after the second name part. The REPLACE function enables this feature by swapping the first comma in the second name part with a zero-length string. The change is optional in the sense that if there is no comma in the second name, then nothing gets replaced.
The code for extracting the third part in the @nameString variable begins at one space past the second blank in the nameString variable. Instead of specifying extraction of just the number of characters until the next blank, the LEN function specifies the total length of all the characters in the @nameString variable. This designation for the third argument has two benefits. First, you do not have to write an arithmetic expression for extracting precisely the number of characters in the third part. The SUBSTRING function automatically ceases retrieving characters when there are no more characters to extract. Second, if there are more name parts than just one after the second part, then the LEN function in the SUBSTRING function extracts all the remaining parts.
The closing section of the following code sample re-formats the extracted parts from the @nameString variable and changes the case as well. While the input source is in the format 'Robert Dobson, Jr.', the re-formatted output from the final SELECT statement is in the format 'DOBSON JR., ROBERT'. The case transformation is implemented with the UPPER function. All the other functions are described previously.
-- Code for parsing first and last names with a suffix DECLARE @nameString as varchar(128), @firstSpaceLoc as smallint, @secondSpaceLoc as smallint, @thirdSpaceLoc as smallint SET @nameString = 'Robert Dobson, Jr.' --SET @nameString = 'Robert Dobson' --SET @nameString = 'Microsoft' -- How many strings are in the name? -- Is there one space in the name SELECT CHARINDEX(' ',@namestring,1) firstStringLoc -- Is there second space in the name SELECT CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) secondStringLoc -- Is there a third space in the name SELECT CHARINDEX(' ', @namestring, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1) thirdStringLoc -- extract first string SELECT LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1) firstString -- extract second string SELECT SUBSTRING ( @nameString, CHARINDEX(' ',@namestring,1)+1, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) - CHARINDEX(' ',@namestring,1) ) secondString -- clean second string of a comma SELECT REPLACE ( SUBSTRING ( @nameString, CHARINDEX(' ',@namestring,1)+1, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) - CHARINDEX(' ',@namestring,1) ), ',', '' ) cleanedSecondString -- extract third string SELECT SUBSTRING ( @nameString, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1, LEN(@nameString) ) thirdString -- original whole name string SELECT @nameString originalFormattedName -- put string parts together in different order and case SELECT UPPER ( REPLACE ( SUBSTRING ( @nameString, CHARINDEX(' ',@namestring,1)+1, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) - CHARINDEX(' ',@namestring,1) ), ',', '' ) ) + ' ' + UPPER ( SUBSTRING ( @nameString, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1, LEN(@nameString) ) ) + ', ' + UPPER ( LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1) ) reformattedName
Parsing Names with One, Two, or Three Strings in SQL Server
The preceding two code samples are each especially designed to parse names arriving in a specific format. The first sample is especially tailored for names with two strings, and the second is tailored for names with three strings. Neither sample will handle a name with just one string, such as Microsoft. In addition, the sample for parsing three-part names does not successfully parse names with just two parts.
The next code sample handles name fields with one, two, or three name parts. The code sample achieves this flexibility by scanning the @nameString variable to keep track of how many spaces are in the name. Then, it uses that information to extract the strings for each name part in the variable. This code sample, and the previous ones, assumes a blank space delimits name parts. After a name part is extracted, the part is stored in a local variable for re-use, such as for re-formatting the names into an output format or assigning a value to a column of a staging table with parsed name parts.
The code relies on six local variables besides the one for the name field value. Three local variables keep track of the position of successive spaces in a name field. The code tracks whether there are 0, 1, 2, or 3 spaces in a name and the location of up to three spaces. If there are no spaces in a name, such as Microsoft, then all three local variables are 0. A name field containing three spaces, such as Robert William Dobson, Sr., assigns a non-zero value to all three local variables. The first local variable has the location of the space between Robert and William. The second local variable has the location of the space between William and Dobson. The third local variable has the location of the space between Dobson, and Sr.
In developing this name parsing solution, it was convenient to echo back the result of how many spaces were in a name. It is easy for name parsing code to become complex, and confirming intermediate results is often useful for unit testing. The code sample confirms how many spaces are detected in a SELECT statement with the original name field (@nameString) and three CASE...END statements. The successive CASE...END statements assess whether there are one, two, or three spaces in a name. If the code discovers no spaces in a name, then the first CASE...END statement indicates there is not one space in the name.
Another round of three CASE...END statements assign values to the local variables for the first, second, and third name parts. The first CASE...END statement assigns the characters before the first space to the first local variable for the first name part, such as Robert for the name Robert Dobson, Jr. If the first CASE...END statement detects that there are no spaces in a name, such as for Microsoft, then it merely assigns the whole name field to the first local variable.
The second CASE...END statement includes three WHEN clauses for assigning a value to the second name part; the values can be an empty string, the name part after the first space, or the name part between the first and second spaces with any commas removed. The values for the local variables tracking spaces in a name determine which WHEN clause makes the assignment for the second name part.
The third CASE...END statement also allows three possible values for the third name part. If the local variable for the first or second space is 0, then the third string is set to an empty string. If the position for the second space is greater than 0 and the position of the third space is 0, then third name part equals the characters from one beyond the second space and the end of the name field value. When the value of the third space position is greater than 0, then the third name part is set to the characters after the second space.
The final block of code reports out the parsed name parts in two formats. The first SELECT statement reports the original name field value followed by its name parts. The second SELECT statement re-formats the original name format to a specific output format. Notice that there is no need to re-specify the original parsing code when re-formatting the original name value. This is one special advantage of assigning parsed name parts to local variables.
-- Code for parsing a name with multiple parts DECLARE @nameString as varchar(max), @firstSpaceLoc as smallint, @secondSpaceLoc as smallint, @thirdSpaceLoc as smallint, @firstString as varchar(max), @secondString as varchar(max), @thirdString as varchar(max) --SET @nameString = 'Microsoft' --SET @nameString = 'Robert Dobson' SET @nameString = 'Robert Dobson, Jr.' --SET @nameString = 'Robert William Dobson, Sr.' -- How many strings are in the name? -- Is there one space in the name SET @firstSpaceLoc = CHARINDEX(' ',@namestring,1) -- Is there second space in the name SET @secondSpaceLoc = CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) -- Is there a third space in the name SET @thirdSpaceLoc = CASE WHEN CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) = 0 THEN 0 WHEN CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) > 0 THEN CHARINDEX(' ', @namestring, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1) END SELECT @nameString sourceString, CASE WHEN @firstSpaceLoc > 0 THEN 'There is one space' ELSE 'There is not one space' END [Is there one space], CASE WHEN @secondSpaceLoc > 0 THEN 'There is a second space' ELSE 'There is not a second space' END [Is there a second space], CASE WHEN @thirdSpaceLoc > 0 THEN 'There is a third space' ELSE 'There is not a third space' END [Is there a third space] -- extract and save strings SELECT @firstString = CASE WHEN @firstSpaceLoc > 0 THEN LEFT(@nameString,CHARINDEX(' ',@namestring,1)-1) ELSE @nameString END, @secondString = CASE WHEN @firstSpaceLoc = 0 THEN '' WHEN @secondSpaceLoc = 0 THEN RIGHT(@namestring, LEN(@namestring)- CHARINDEX(' ',@namestring,1)) WHEN @secondSpaceLoc > 0 THEN REPLACE ( SUBSTRING ( @nameString, CHARINDEX(' ',@namestring,1)+1, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1) - CHARINDEX(' ',@namestring,1) ), ',', '' ) ELSE '' END, @thirdString = CASE WHEN @firstSpaceLoc = 0 OR @secondSpaceLoc = 0 THEN '' WHEN @secondSpaceLoc > 0 AND @thirdSpaceLoc = 0 THEN SUBSTRING ( @nameString, CHARINDEX(' ', @namestring, CHARINDEX(' ',@nameString,1)+1)+1, LEN(@nameString) ) ELSE RIGHT(@namestring,LEN(@namestring) - @secondSpaceLoc) END -- Report names SELECT @nameString sourceString, @firstString [First string], @secondString [Second string], @thirdString [Third string] SELECT CASE WHEN @thirdSpaceLoc > 0 THEN @thirdString + ', ' + @firstString + ' ' + @secondString WHEN @secondSpaceLoc > 0 AND @thirdSpaceLoc = 0 THEN @secondString + ' ' + @thirdString + ', ' + @firstString WHEN @firstSpaceLoc > 0 THEN @secondString + ', ' + @firstString WHEN @firstSpaceLoc = 0 THEN @firstString END [Reported Name]
Next Steps
If you decide you want to learn more about this topic, one obvious next step is to copy the scripts discussed in this tip and start modifying them based on your own custom requirements and desires to learn more about name parsing. If you are just starting out with SQL programming, you may want to take a little time and dig deeper into the operation and syntax for string functions mentioned and demonstrated in this article. I rank the MSDN site (http://msdn.microsoft.com/en-us/library/ms181984.aspx) as a particularly authoritative source. Also, I remind you about prior MSSQLTips.com coverage of string processing. Here are some prior tips for you to reference.
- Parsing a URL with SQL Server Functions
- SQL Server CLR and TSQL functions to parse a delimited string
- New DMF for SQL Server 2008 sys.dm_fts_parser to parse a string
Much coverage of name parsing tends to focus on relatively narrow topics. This series of tips aims at imparting systematic coverage of best practices for name parsing. For example, this introductory tip on name parsing covers three examples. When implementing name parsing solutions, it is often a good idea to nest one string function within another. You can frequently extract parts from a name field value by finding the location of successive blank spaces (or other delimiters) within a field value. You can typically simplify your code by saving intermediate parsing results in local variables and then operating on local variables instead of string expressions used to assigned values to functions.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips