Fuzzy Matching
Synopsis
Is operator allows you to merge two data sets in a fuzzy way based on two nominal attributes. This means it matches examples which are not necessarily equal, but similar.
Description
The operator takes one attribute from the left side example set and one from the right example set to match rows. If you want to perform a multi-attribute match, please check the Cross Distance operator.
Between the two chosen attribute we calculates a similarity. The operator merges the k most similar examples from both sides. If there are colliding attributes _from_ES2 is appended, as done by the Join operator.
The similarity method can be defined using the 'distance measure' parameter. Currently all similarity measures are Levenshtein distance based. Levenshtein distance is using the number of changes you need to do to get from one string to the other to define a distance. The used distance measures are taken from the fuzzywuzzy library. For a detailed explanation of the different options, please see: https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
Input
left
The first example set used for matching.
right
The second example set used for matching.
Output
matched
The merged example set. This contains the union of both attributes from the left and the right side. If there are colliding attributes a "_from_ES2" is added to the right hand side's attribute. For each row of the left side we have up to "number_of_matches" rows with the closest match in the resulting table.
Parameters
Left side attribute
The attribute of the left hand side example set which should be used for merging.
Right side attribute
The attribute of the left hand side example set which should be used for merging.
Number of matches
Defines the maximum amount matches you want to find for each left hand side example.
Similarity measure
Similarity measure which should be used to determine a match.