Using SOUNDEX and DIFFERENCE to Standardize Data in SQL Server


By:   |   Updated: 2010-11-16   |   Comments (4)   |   Related: More > Functions - System

Problem

My client wants to standardize address information for existing and future addresses collected for their customers, particularly the street suffixes. The application used to enter and collect address information has the street suffix separated from the address field, but it is a textbox instead of a drop down list therefore things are not standardized. I know there are some options out there to standarize data, but they would like a less expensive alternative. Are there any functions in SQL Server that I can use to standardized data?

Solution

SQL Server offers two functions that can be used to compare string values: The SOUNDEX and DIFFERENCE functions.

The SOUNDEX function converts a phrase to a four-character code. The first character is the first letter of the phrase. Vowels are omitted unless the first letter of the phrase is a vowel and the other three characters represent the rest of the phrase. Below is a comparison between the SOUNDEX of "MSSQLTIPS" and "MSSQLTPS":

comparison between the SOUNDEX of "MSSQLTIPS" and "MSSQLTPS"

The DIFFERENCE function evaluates two expressions and assigns a value between 0 and 4, with 0 being little to no similarity and 4 representing the same or very similar phrases. This value is derived from the number of characters in the SOUNDEX of each phrase that are the same. Below is an example of the DIFFERENCE function used in two scenarios:

an example of the DIFFERENCE function

For the example of standardizing street suffixes, we can create a new table named ref_RoadTypes with the following structure:

CREATE TABLE dbo.ref_RoadTypes
(
RoadTypeID TINYINT NOT NULL
CONSTRAINT PK_refRoadTypes_RoadTypeID PRIMARY KEY
, RoadTypeAbbr VARCHAR(20) NOT NULL
, RoadType VARCHAR(50) NOT NULL
, CreatedDt DATETIME NOT NULL
CONSTRAINT dft_refRoadTypes_CreatedDt DEFAULT(CURRENT_TIMESTAMP)
, CreatedBy VARCHAR(50) NOT NULL
CONSTRAINT dft_refRoadTypes_CreatedBy DEFAULT(SYSTEM_USER)
, ModifiedDt DATETIME NOT NULL
CONSTRAINT dft_refRoadTypes_ModifiedDt DEFAULT(CURRENT_TIMESTAMP)
, ModifiedBy VARCHAR(50) NOT NULL
CONSTRAINT dft_refRoadTypes_ModifiedBy DEFAULT(SYSTEM_USER)
)

Use this sample script RoadTypeInsert.txt to create a test table and test data. Once the values are in the table you can use either function to retrieve the best match for the street suffix. Below are samples of each function being used to retrieve the best match:

FUNCTION QUERY
SOUNDEX
DECLARE @Suffix VARCHAR(20)  
SELECT @Suffix = 'Bolevard' --(intentionally misspelled)  
SELECT RoadTypeAbbr 
FROM dbo.ref_RoadTypes  
WHERE SOUNDEX(RoadType) = SOUNDEX(@Suffix)  
Results SOUNDEX example


DIFFERENCE
DECLARE @Phrase VARCHAR(20)  
SELECT @Phrase = 'Str'  
SELECT TOP 1 RoadTypeAbbr 
FROM dbo.ref_RoadTypes  
ORDER BY DIFFERENCE(@Phrase, RoadType) DESC  
Results DIFFERENCE example


Next Steps


Last Updated: 2010-11-16


get scripts

next tip button



About the author
MSSQLTips author Tim Cullen Tim Cullen has been working in the IT industry since 2003 and currently works as a SQL Server Reports Developer.

View all my tips





Comments For This Article




Thursday, July 09, 2015 - 1:37:04 PM - Joe Celko Back To Top

I would get a pacakge like Melissa Data, SSA, etc. to do this job. Why re-invent the wheel? 


Wednesday, July 18, 2012 - 9:13:38 PM - Mica Back To Top
Why would you guess' that it would also happen to an image as well? A frenid of mine has pointed out that at some point, someone might have manually edited the field of this row (possible I guess, unlikely but possible). At which point, some how they might have inserted invalid/unknown content into the field.My frenid tells me that with an Oracle date, that they store date plus time and zone information all in one. If you insert a plain date (ie dd/mm/yyyy) it will set all the other information to a default, which would get stripped out again if you selected the field back. That makes sense to me, as every other field was displaying through ASP as a plain date. So, maybe the person that edited this field appended additional information (ie time/timezone), it was recognised by Oracle (hence allowed the update/insert) yet isn't recognised by ASP (hence the variable sub-type coming out as VBObject, instead of VBDate).Have you heard of anyone else having a similar problem?

Tuesday, November 23, 2010 - 10:17:07 AM - Steven Christensen Back To Top

While the soundex and difference functions are useful I found that the generalizations they produce in evaluating address information is too broad to be a dependable method of evaluating addresses.  In many cases the data entry issues are significant and ultimately they are the key factor that you want to address.  The following function that I have placed below is based on a scoring system that has proven to be far more effective in combination with the street types and other parts of addresses.  I have a scoring system set up that actually is 99.6% accurate for the Canadian address information that exists in our system and significantly more accurate than other services that I've found for the US addresses.

The key is to score the addresses and order them by the combined score.  Done in this fashion you can achieve an exceptionally high accuracy rating.

Here is the primary scoring function that drives my system.  The function also takes into account the issue of matching streets with numeric names and transposed characters since a good portion of the time street names are not entered correctly.

CREATE

 

FUNCTION [dbo].[MatchScore](@String VARCHAR(2000),@String2 varchar(2000))

RETURNS

 

decimal(18,4)

AS
BEGIN

-- @string is the input string to compare to the database

-- @string2 is the database search string value to compare to.

 

 

 

DECLARE @Count INT,@prevnums bigint,@prevnums2 bigint

 

 

DECLARE @nums bigint,@chars varchar(1000),@nums2 bigint,@chars2 varchar(1000),@cresults decimal(18,2),@maxstring1 decimal,@maxstring2 decimal

 

 

declare @spacestart int

 

 

declare @tagMatches as varchar(100), @tagMatches2 as varchar(100),@maxlen real

-- replace spaces,periods and hyphens to minimize skewed results since these are irrelevant

 

 

 

set @String=upper(ltrim(rtrim(replace(replace(replace(@String,' ',''),'-',''),'.',''))))
set @String2=upper(ltrim(rtrim(replace(replace(replace(@String2,' ',''),'-',''),'.',''))))
set @maxstring1=len(@String)
set @maxstring2=len(@String2)
set @maxlen =len(@String)

 

 

set @cresults = 0
if not(len(rtrim(@string))=0 or len(rtrim(@string2))=0)
begin
set @spacestart=charindex(' ',@string)
IF @spacestart>0 and CHARINDEX(left(@String,@spacestart),@String2)>0

 

 

BEGIN
set @cresults=1
END
ELSE
BEGIN
set @cresults=0
END

 

 

 

-- set the strings to the same length so we compare apples to apples

 

 

 

if len(@String)< len(@String2)
BEGIN
SET @STRING = @STRING + Replicate('|',len(@String2) -len(@String))
set @maxlen =len(@String2)
END

 

 

 

if len(@String2)< len(@String)
BEGIN
SET @STRING2 = @STRING2 + Replicate('|',len(@String)- len(@String2))
END

 

 

 

begin
SET @Count = 0
set @nums=0
set @nums2=0

 

 

WHILE (@Count <= LEN(@String) or @Count <= LEN(@String2))
begin
set @nums=cast(isnull(ASCII(right(left(@String,@count),1)),0) as varchar(10))
set @chars=cast(right(left(@String,@count),1) as char(1))
set @nums2=cast(isnull(ASCII(right(left(@String2,@count),1)),0) as varchar(10))
set @chars2=cast(right(left(@String2,@count),1) as char(1))

 

 

if (isnull(@nums,0)=isnull(@nums2,0))
begin
set @cresults=@cresults + ((1/isnull(@maxlen,1))*100)

 

 

 

-- if I am checking numbers lets make sure that we are checking the number seperately so we don't confuse 5 and 50

 

 

 

if (isnumeric(isnull(@chars,''))=1)
begin
set @tagMatches = isnull(@tagMatches,'') + cast(isnull(@chars,'') as char(1))
end

 

 

 

if (isnumeric(isnull(@chars2,''))=1)
begin
set @tagMatches2 = isnull(@tagMatches2,'') + cast(isnull(@chars2,'')as char(1))
end
end

 

 

 

else
begin
if @count>0

begin
set @prevnums=cast(isnull(ASCII(right(left(@String,@count-1),1)),0) as varchar(10))
set @prevnums2=cast(isnull(ASCII(right(left(@String2,@count-1),1)),0) as varchar(10))

 

 

if (@nums=@prevnums2) or (@nums2=@prevnums)
begin
set @cresults=@cresults + ((1/isnull(@maxlen,1))*100)
end
else
begin
set @cresults=@cresults - ((1/isnull(@maxlen,1))*100)
end
end
end
set @count=@count+1
end
end
end

else

 


begin
set @cresults=0
end

if

 

len(@tagMatches)>0

 

 

begin

if (@tagMatches=@tagMatches2)
begin
set @cresults=@cresults + len(@tagMatches)
end

 

 

 

else
begin
set @cresults=@cresults - len(@tagMatches)
end

end

if

 

@nums=@nums2
begin
set @cresults=@cresults + 1
end

else

 

begin
set @cresults=@cresults - 1
end

 

matchfound

:

 

 

 

if @cresults<0
begin
set @cresults=0
end

return
@cresults
end


Tuesday, November 23, 2010 - 4:26:28 AM - Ben B Back To Top

Hello,

Your examples worked as expected but I got very odd results with the following:

-- DIFFERENCE test

DECLARE

 

@Phrase VARCHAR(20)

SELECT

 

@Phrase = 'ST'

SELECT

 

TOP 10 RoadTypeAbbr FROM dbo.ref_RoadTypes

ORDER

 

BY DIFFERENCE(@Phrase, RoadType) DESC

gave me..

RD
ALY
BND
BLF
BGS
BG
BCH
AVE
BRK
BRKS

I would expect "ST" to be in my results...  At least in the top 10!

Any ideas what is happening here?



download


Recommended Reading

Concatenate SQL Server Columns into a String with CONCAT()

SQL Server Rounding Functions - Round, Ceiling and Floor

SQL Server 2016 STRING_SPLIT Function

Using FOR XML PATH and STRING_AGG() to denormalize SQL Server data

Different ways to get random data for SQL Server data sampling





get free sql tips
agree to terms


Learn more about SQL Server tools