By: Siddharth Mehta | Comments | Related: > SQL Server 2017
Problem
Matching one dataset with another is a very basic and frequent operation while designing query or data processing logic in any database management systems including SQL Server. Mostly database systems support exact joins like inner join, outer join, left join, right join, etc. which depend on an exact match of one value in one or more fields in one dataset with an exact match of the same value in another field of another dataset. While working with a lot of textual data, it is a very normal use-case where one may need to perform an inexact matching or in other words fuzzy / heuristic matching of data values. For example, while entering free text comments or description, it is possible the users may make spelling mistakes and there may be a need to match these words with a dictionary to find the matching words and correct the spelling mistakes. Such data cleansing use-cases heavily rely on inexact matching and the relational joins would not come to much use in this case. The need is to have a joining method which can support inexact / fuzzy matching. In this tip we will learn how to solve this problem.
Solution
Fuzzyjoin R package can be used in SQL Server 2017 and higher to perform different kinds of fuzzy joins of data.
First, let's understand what distinct types of fuzzy joins are supported by this package. Below is a list of distinct types of inexact matching supported by the fuzzyjoin package along with the associated function name. We will be using some of these functions in the upcoming steps to understand the use.
- Numeric values that are within some tolerance (difference_inner_join)
- Strings that are similar in Levenshtein/cosine/Jaccard distance, or other metrics from the stringdist package (stringdist_inner_join)
- A regular expression in one column matching to another (regex_inner_join). Other Regex based join functions include regex_left_join (include all rows of left table), regex_right_join (include all rows of right table), regex_full_join (include all rows in each table), regex_semi_join (filter left table for rows with matches), regex_anti_join (filter left table for rows without matches)
- Euclidean or Manhattan distance across multiple columns (distance_inner_join)
- Geographic distance based on longitude and latitude (geo_inner_join)
- Intervals of (start, end) that overlap (interval_inner_join)
- Genomic intervals, which include both a chromosome ID and (start, end) pairs, that overlap (genome_inner_join)
Example of Fuzzy Matching with SQL Server 2017 and R
Now the we know the inventory of different join functions supported by the fuzzyjoin package, we can start with the exercise to understand the working and usage of some of these functions. Follow the steps as shown below.
Step 1 - It is assumed the SQL Server 2017 as well as R is installed on the development machine. If you are new to R in SQL Server, you can refer the R tutorial to get up to speed on working with R in SQL Server.
Step 2 - Its assumed that SQL Server Management Studio (SSMS) in installed on the development machine. We need to install the "dplyr" package for output redirection (%>%) operator and "fuzzyjoin" package for using the above-mentioned functions. You can read more about installing R packages in SQL Server from here.
Step 3 - The first step is to open a new query window and ensure that the packages are installed successfully and accessible in the SSMS environment. Execute the below code, and if the code executes successfully, that would mean that the library is installed correctly and accessible in SSMS. We are referencing the packages in the library function and executing the script using sp_execute_external_script stored procedure.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(dplyr) library(fuzzyjoin) '
Step 4 - Now to simulate the problem in question, we need two data frames which consists of misspellings as well as a dictionary list of correct spellings. And then join the two to find inexact matches. In the below code we are creating two data frames named misspelling and words which are the dictionary words, and adding some sample data to demonstrate and test the matching. We are creating a new vector named joined and assigning it the output of the fuzzyjoin. We are using the stringdist_inner_join function to match the two data frames and specifying that the distance between misspellings and dictionary words can be 1 at max by setting value of 1 for the max_dist parameter. Execute the below code.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(dplyr) library(fuzzyjoin) misspelling <- data.frame("misspelling" = c("sould", "fiels", "conscent", "fleed", "abritrary", "accademic", "charactors", "nomatch")) words <- data.frame("rightwords" = c("arbitrary", "field", "consent", "should", "could", "sold", "sound", "bleed", "academic", "characters", "charactery")) joined <- misspelling %>% stringdist_inner_join(words, by = c(misspelling = "rightwords"), max_dist = 1) print(joined) '
Step 5 - The output of the code would be as shown below. The would "sould" got matched with multiple words where change or addition or removal of one character would make the word match with the word in the dictionary. You can also notice that the word "charactery" from the dictionary, and the word "nomatch" from the misspelling is missing in the output.
In this way, we can a get a tableplot for multi-variate high-voluminous data analysis primarily to understand distribution and relationships between different variables in a dataset.
Step 6 - Now let's change the code a little bit and increase the distance to 2. Execute the below code.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(dplyr) library(fuzzyjoin) misspelling <- data.frame("misspelling" = c("sould", "fiels", "conscent", "fleed", "abritrary", "accademic", "charactors", "nomatch")) words <- data.frame("rightwords" = c("arbitrary", "field", "consent", "should", "could", "sold", "sound", "bleed", "academic", "characters", "charactery")) joined <- misspelling %>% stringdist_inner_join(words, by = c(misspelling = "rightwords"), max_dist = 2) print(joined) '
Step 7 - This time you would find that the word "charactery" matched the word "charactors", "fleed" matched "field" and is in the output. But the word "nomatch" is still missing from the misspelling list. The reason is that there is no match for this in the dictionary list of words.
Step 8 - You can add an additional parameter distance_col = "distance" to the stringdist_inner_join function, and the distance will be in the output as shown below.
Step 9 - We have been doing inner joins until now. In case we need all the words from one dataset, for example the misspellings data frame in this case, we need to use a left join. Change the code as shown below by using the stringdist_left_join function to get all the values from misspellings irrespective of whether it inexactly matches any words from the dictionary list.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(dplyr) library(fuzzyjoin) misspelling <- data.frame("misspelling" = c("sould", "fiels", "conscent", "fleed", "abritrary", "accademic", "charactors", "nomatch")) words <- data.frame("rightwords" = c("arbitrary", "field", "consent", "should", "could", "sold", "sound", "bleed", "academic", "characters", "charactery")) leftjoined <- misspelling %>% stringdist_left_join(words, by = c(misspelling = "rightwords"), max_dist = 2) print(leftjoined) '
Step 10 - The output would look as shown below. This time the "nomatch" word in the misspelling is in the output and the corresponding match in the dictionary does not exists, so it shows the value as <NA>.
Step 11 - Let's say we have a situation where we anticipate that we have a high volume of words that won't have a match in the dictionary. In this case instead of doing a left join and filtering the data, we can use a different type of join called an anti-join which returns data that does not match. Execute the antijoin function as shown below.
EXECUTE sp_execute_external_script @language = N'R', @script = N' library(dplyr) library(fuzzyjoin) misspelling <- data.frame("misspelling" = c("sould", "fiels", "conscent", "fleed", "abritrary", "accademic", "charactors", "nomatch")) words <- data.frame("rightwords" = c("arbitrary", "field", "consent", "should", "could", "sold", "sound", "bleed", "academic", "characters", "charactery")) antijoined <- misspelling %>% stringdist_anti_join(words, by = c(misspelling = "rightwords"), max_dist = 1) print(antijoined) '
Step 12 - The output from this code would look as shown below, where the only word in the entire list which does not have an equivalent match in the dictionary is shown.
In this way, distinct functions from the fuzzyjoin package can be used for inexact matching. In a real life scenario, data would come from a table in SQL Server for both datasets that would be joined for the fuzzy matching by using the InputDataSet data frame to pass the data from SQL Server to the R Script. And after the fuzzy join is applied, the output can be extracted from the R script using the OutputDataSet data frame. If you are new to this, you can read more about this in the R tutorial.
Next Steps
- Consider using the regular expression joins which provides the maximum flexibility for fuzzy joining of datasets.
- You can refer to the FuzzyJoin developer reference manual here.
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips