Using SOUNDEX and DIFFERENCE to Standardize Data in SQL Server

Problem

My client wants to standardize address information for existing and future addresses collected for their customers, particularly the street suffixes. The application used to enter and collect address information has the street suffix separated from the address field, but it is a textbox instead of a drop down list therefore things are not standardized. I know there are some options out there to standarize data, but they would like a less expensive alternative. Are there any functions in SQL Server that I can use to standardized data?

Solution

SQL Server offers two functions that can be used to compare string values: The SOUNDEX and DIFFERENCE functions.

The SOUNDEX function converts a phrase to a four-character code. The first character is the first letter of the phrase. Vowels are omitted unless the first letter of the phrase is a vowel and the other three characters represent the rest of the phrase. Below is a comparison between the SOUNDEX of “MSSQLTIPS” and “MSSQLTPS”:

comparison between the SOUNDEX of "MSSQLTIPS" and "MSSQLTPS"

The DIFFERENCE function evaluates two expressions and assigns a value between 0 and 4, with 0 being little to no similarity and 4 representing the same or very similar phrases. This value is derived from the number of characters in the SOUNDEX of each phrase that are the same. Below is an example of the DIFFERENCE function used in two scenarios:

an example of the DIFFERENCE function

For the example of standardizing street suffixes, we can create a new table named ref_RoadTypes with the following structure:

CREATE TABLE dbo.ref_RoadTypes
(
RoadTypeID TINYINT NOT NULL
CONSTRAINT PK_refRoadTypes_RoadTypeID PRIMARY KEY
, RoadTypeAbbr VARCHAR(20) NOT NULL
, RoadType VARCHAR(50) NOT NULL
, CreatedDt DATETIME NOT NULL
CONSTRAINT dft_refRoadTypes_CreatedDt DEFAULT(CURRENT_TIMESTAMP)
, CreatedBy VARCHAR(50) NOT NULL
CONSTRAINT dft_refRoadTypes_CreatedBy DEFAULT(SYSTEM_USER)
, ModifiedDt DATETIME NOT NULL
CONSTRAINT dft_refRoadTypes_ModifiedDt DEFAULT(CURRENT_TIMESTAMP)
, ModifiedBy VARCHAR(50) NOT NULL
CONSTRAINT dft_refRoadTypes_ModifiedBy DEFAULT(SYSTEM_USER)
)

Use this sample script RoadTypeInsert.txt to create a test table and test data. Once the values are in the table you can use either function to retrieve the best match for the street suffix. Below are samples of each function being used to retrieve the best match:

FUNCTIONQUERY
SOUNDEX
DECLARE @Suffix VARCHAR(20)   SELECT @Suffix = ‘Bolevard’ –(intentionally misspelled)   SELECT RoadTypeAbbr  FROM dbo.ref_RoadTypes   WHERE SOUNDEX(RoadType) = SOUNDEX(@Suffix)  
Results SOUNDEX example


DIFFERENCE
DECLARE @Phrase VARCHAR(20)   SELECT @Phrase = ‘Str’   SELECT TOP 1 RoadTypeAbbr  FROM dbo.ref_RoadTypes   ORDER BY DIFFERENCE(@Phrase, RoadType) DESC  
Results DIFFERENCE example


Next Steps

Leave a Reply

Your email address will not be published. Required fields are marked *