Fuzzy Matching

Synopsis

Is operator allows you to merge two data sets in a fuzzy way based on two nominal attributes. This means it matches examples which are not necessarily equal, but similar.

Description

The operator takes one attribute from the left side example set and one from the right example set to match rows. If you want to perform a multi-attribute match, please check the Cross Distance operator.

Between the two chosen attribute we calculates a similarity. The operator merges the k most similar examples from both sides. If there are colliding attributes _from_ES2 is appended, as done by the Join operator.

The similarity method can be defined using the 'distance measure' parameter. Currently all similarity measures are Levenshtein distance based. Levenshtein distance is using the number of changes you need to do to get from one string to the other to define a distance. The used distance measures are taken from the fuzzywuzzy library. For a detailed explanation of the different options, please see: https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

Input

left

The first example set used for matching.

right

The second example set used for matching.

Output

matched

The merged example set. This contains the union of both attributes from the left and the right side. If there are colliding attributes a "_from_ES2" is added to the right hand side's attribute. For each row of the left side we have up to "number_of_matches" rows with the closest match in the resulting table.

Parameters

Left side attribute

The attribute of the left hand side example set which should be used for merging.

Right side attribute

The attribute of the left hand side example set which should be used for merging.

Number of matches

Defines the maximum amount matches you want to find for each left hand side example.

Similarity measure

Similarity measure which should be used to determine a match.

Synopsis​

Description​

Input​

left​

right​

Output​

matched​

Parameters​

Left side attribute​

Right side attribute​

Number of matches​

Similarity measure​