k-NN
Synopsis
This Operator generates a k-Nearest Neighbor model, which is used for classification or regression.
Description
The k-Nearest Neighbor algorithm is based on comparing an unknown Example with the k training Examples which are the nearest neighbors of the unknown Example.
The first step of the application of the k-Nearest Neighbor algorithm on a new Example is to find the k closest training Examples. "Closeness" is defined in terms of a distance in the n-dimensional space, defined by the n Attributes in the training ExampleSet.
Different metrices, such as the Euclidean distance, can be used to calculate the distance between the unknown Example and the training Examples. Due to the fact that distances often depends on absolute values, it is recommended to normalize data before training and applying the k-Nearest Neighbor algorithm. The metric used and its exact configuration are defined by the parameters of the Operator.
In the second step, the k-Nearest Neighbor algorithm classify the unknown Example by a majority vote of the found neighbors. In case of a regression, the predicted value is the average of the values of the found neighbors.
It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.
Differentiation
k-Means
This Operator performs clustering on an unlabeled data set. It can be considered as the equivalent of the k-NN algorithm for unsupervised learning. A cluster calculated by the k-Means algorithm consists of similar Examples, where the similarity is calculated by a given measure in the Attribute space.
Input
training set
This input port expects an ExampleSet.
Output
model
The k-Nearest Neighbor model is delivered from this output port. This model can now be applied on unseen data sets for prediction of the label Attribute.
example set
The ExampleSet that was given as input is passed through without changes.
Parameters
K
Finding the k training Examples that are closest to the unknown Example is the first step of the k-NN algorithm. If k = 1, the Example is simply assigned to the class of its nearest neighbor. k is typically a small, positive and odd integer.
Weighted vote
If this parameter is set, the distance values between the Examples are also taken into account for the prediction. It can be useful to weight the contributions of the neighbors, so that nearer neighbors contribute more than more distant ones.
Measure types
This parameter is used for selecting the type of measure to be used for finding the nearest neighbors. The following options are available:
- MixedMeasures: Mixed Measures are used to calculate distances in case of both nominal and numerical Attributes.
- NominalMeasures: In case of only nominal Attributes different distance metrices can be used to calculate distances on this nominal Attributes.
- NumericalMeasures: In case of only numerical Attributes different distance metrices can be used to calculate distances on this numerical Attributes.
- BregmannDivergences: Bregmann divergences are more generic "closeness" measure types with does not satisfy the triangle inequality or symmetry. For more details see the parameter divergence.