Skip to main content

k-NN

Synopsis

This Operator generates a k-Nearest Neighbor model, which is used for classification or regression.

Description

The k-Nearest Neighbor algorithm is based on comparing an unknown Example with the k training Examples which are the nearest neighbors of the unknown Example.

The first step of the application of the k-Nearest Neighbor algorithm on a new Example is to find the k closest training Examples. "Closeness" is defined in terms of a distance in the n-dimensional space, defined by the n Attributes in the training ExampleSet.

Different metrices, such as the Euclidean distance, can be used to calculate the distance between the unknown Example and the training Examples. Due to the fact that distances often depends on absolute values, it is recommended to normalize data before training and applying the k-Nearest Neighbor algorithm. The metric used and its exact configuration are defined by the parameters of the Operator.

In the second step, the k-Nearest Neighbor algorithm classify the unknown Example by a majority vote of the found neighbors. In case of a regression, the predicted value is the average of the values of the found neighbors.

It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

Differentiation

k-Means

This Operator performs clustering on an unlabeled data set. It can be considered as the equivalent of the k-NN algorithm for unsupervised learning. A cluster calculated by the k-Means algorithm consists of similar Examples, where the similarity is calculated by a given measure in the Attribute space.

Input

training set

This input port expects an ExampleSet.

Output

model

The k-Nearest Neighbor model is delivered from this output port. This model can now be applied on unseen data sets for prediction of the label Attribute.

example set

The ExampleSet that was given as input is passed through without changes.

Parameters

K

Finding the k training Examples that are closest to the unknown Example is the first step of the k-NN algorithm. If k = 1, the Example is simply assigned to the class of its nearest neighbor. k is typically a small, positive and odd integer.

Weighted vote

If this parameter is set, the distance values between the Examples are also taken into account for the prediction. It can be useful to weight the contributions of the neighbors, so that nearer neighbors contribute more than more distant ones.

Measure types

This parameter is used for selecting the type of measure to be used for finding the nearest neighbors. The following options are available:

  • MixedMeasures: Mixed Measures are used to calculate distances in case of both nominal and numerical Attributes.
  • NominalMeasures: In case of only nominal Attributes different distance metrices can be used to calculate distances on this nominal Attributes.
  • NumericalMeasures: In case of only numerical Attributes different distance metrices can be used to calculate distances on this numerical Attributes.
  • BregmannDivergences: Bregmann divergences are more generic "closeness" measure types with does not satisfy the triangle inequality or symmetry. For more details see the parameter divergence.

Mixed measure

The only available option for mixed measure is the 'Mixed Euclidean Distance'. For numerical values the euclidean distance is calculated. For nomimal values, a distance of 0 is taken if both values are the same and a distance of one is taken otherwise. This parameter is available when the measure type parameter is set to 'mixed measures'.

Nominal measure

This parameters defines how to calculate distances for only nominal Attributes in the input ExampleSet, in case the measure type is set to nominal measure. In case of using a similarity as a distance measure, the actual distance is calculated as the negative similarity. For the different similarities the following variables are defined:

e: number of Attribute for which both Examples have equal and non-zero values

u: number of Attribute for which both Examples have not equal values

z: number of Attribute for which both Examples have zero values

  • NominalDistance: Distance of two values is 0 if both values are the same and 1 otherwise.
  • DiceSimilarity: With the above mentioned defintions the DiceSimilarity is: 2e/(2e+u)
  • JaccardSimilarity: With the above mentioned defintions the JaccardSimilarity is: e/(e+u)
  • KulczynskiSimilarity: With the above mentioned defintions the KulczynskiSimilarity is: e/u
  • RogersTanimotoSimilarity: With the above mentioned defintions the RogersTanimotoSimilarity is: (e+z)/(e+2*u+z)
  • RussellRaoSimilarity: With the above mentioned defintions the RussellRaoSimilarity is: e/(e+u+z)
  • SimpleMatchingSimilarity: With the above mentioned defintions the SimpleMatchingSimilarity is: (e+z)/(e+u+z)

Numerical measure

This parameters defines how to calculate distances for only numerical Attributes in the input ExampleSet, in case the measure type is set to numerical measure. For the different distance measures the following variable is defined:

y(i,j) : Value of the j.th Attribute of the i.th Example. Hence y(1,3) - y(2,3) is the difference of the values of the third Attribute of the first and second Example.

In case of using a similarity as a distance measure, the actual distance is calculated as the negative similarity.

  • EuclideanDistance: Square root of the sum of quadratic differences over all Attributes. Dist = Sqrt ( Sum_(j=1) [y(1,j)-y(2,j)]^2 )
  • CanberraDistance: Sum over all Attributes. The summand is the absolute of the difference of the value, divided by the sum of the absolute values. Dist = Sum_(j=1) |y(1,j)-y(2,j)| / (|y(1,j)|+|y(2,j)|) The CanberraDistance is often used to compare ranked list or for intrusion detection in computer security.
  • ChebychevDistance: Maximum of all differences of all Attributes. Dist = max_(j=1) (|y(1,j)-y(2,j)|)
  • CorrelationSimilarity: The similarity is calculated as the correlation between the Attribute vectors of the two Examples.
  • CosineSimilarity: Similarity measure measuring the cosine of the angle between the Attribute vectors of the two Examples.
  • DiceSimilarity: The DiceSimilarity for numerical Attributes is calculated as 2*Y1Y2/(Y1+Y2). Y1Y2 = Sum over product of values = Sum_(j=1) y(1,j)*y(2,j). Y1 = Sum over values of first Example = Sum_(j=1) y(1,j) Y2 = Sum over values of second Example = Sum_(j=1) y(2,j)
  • DynamicTimeWarpingDistance: Dynamic Time Warping is often use in Time Series analysis for measuring the distance between two temporal sequences. Here the distance on an optimal "warping" path from the Attribute vector of the first Example to the second Example is calculated.
  • InnerProductSimilarity: The similarity is calculated as the sum of the product of the Attribute vectors of the two Examples. Dist = -Similarity = -Sum_(j=1) y(1,j)*y(2,j)
  • JaccardSimilarity: The JaccardSimilarity is calculated as Y1Y2/(Y1+Y2-Y1Y2). See DiceSimilarity for the definition of Y1Y2, Y1 and Y2.
  • KernelEuclideanDistance: The distance is calculated by the euclidean distance of the two Examples, in a transformed space. The transformation is defined by the chosen kernel and configured by the parameters kernel type, gamma, sigma1, sigma2, sigma 3, shift, degree, a, b.
  • ManhattanDistance: Sum of the absolute distances of the Attribute values. Dist = Sum_(j=1) |y(1,j)-y(2,j)|
  • MaxProductSimilarity: The similarity is the maximum of all products of all Attribute values. If the maximum is less or equal to zero the similarity is not defined. Dist = -Similarity = -max_(j=1) (y(1,j)*y(2,j))
  • OverlapSimilarity: The similarity is a variant of simple matching for numerical Attributes and is calculated as minY1Y2 / min(Y1,Y2). See DiceSimilarity for the definition of Y1, Y2. minY1Y2 = Sum over the minimum of values = Sum_(j=1) min [y(1,j),y(2,j)] .

Divergence

This parameter defines which type of Bregmann divergence is used when the measure type parameter is set to 'bregman divergences'. For the different distance measures the following variable is defined:

y(i,j) : Value of the j.th Attribute of the i.th Example. Hence y(1,3) - y(2,3) is the difference of the values of the third Attribute of the first and second Example.

  • GeneralizedIDivergence: The distance is caluclated as Sum1 Sum2. It is not applicable if any Attribute value is less or equal to 0. Sum1 = Sum_(j=1) y(1,j)*ln[y(1,j)/y(2,j)] Sum2 = Sum_(j=1) [y(1,j)-y(2,j)]
  • ItakuraSaitoDistance: The ItakuraSaitoDistance can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)/y(2,1)-ln[y(1,1)/y(2,1)]-1
  • KLDivergence: The Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second expected probability distribution. Dist = Sum_(j=1) [y(1,j)*log_2(y(1,j)/y(2,j))]
  • LogarithmicLoss: The LogarithmicLoss can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)*ln[y(1,1)/y(2,1)]-(y(1,1)-y(2,1))
  • LogisticLoss: The LogisticLoss can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)*ln[y(1,1)/y(2,1)]+(1-y(1,1))*ln[(1-y(1,1))/(1-y(2,1))]
  • MahalanobisDistance: The Mahalanobis distance measures the distance between the two Examples under the assumption they are both random vectors of the same distribution. Therefore the covariance matrix S is calculated on the whole ExampleSet and the Distance is calculated as: Dist = Sqrt [ (vecY1-vecY2) S (vecY1-vecY2) ] vecY1 = Attribute vector of Example 1 vecY2 = Attribute vector of Example 2
  • SquaredEuclideanDistance: Sum of quadratic differences over all Attributes. Dist = Sum_(j=1) [y(1,j)-y(2,j)]^2
  • SquaredLoss: The SquaredLoss can only be calculated for ExampleSets with 1 Attribute. Dist = [y(1,1)-y(2,1)]^2

Kernel type

This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:

  • dot: The dot kernel is defined by k(x,y) = x*y i.e. it is the inner product of x and y.
  • radial: The radial kernel is defined by k(x,y) = exp(-g*||x-y||^2) where g is gamma, specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
  • polynomial: The polynomial kernel is defined by k(x,y) = (x*y+1)^d where d is the degree of the polynomial and is specified by the kernel degree parameter. Polynomial kernels are well suited for problems where all the training data is normalized.
  • sigmoid: This is a hyperbolic tangent sigmoid kernel. The distance is calculated as tanh[a*Y1Y2+b] where Y1Y2 is the inner product of the Attribute vector of the two Examples. a and b can be adjusted using the kernel a and kernel b parameters. A common value for a is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
  • anova: The anova kernel is defined by the raised to the power d of summation of exp(-g(x-y)) where g is gamma and d is degree. The two are adjusted by the kernel gamma and kernel degree parameters respectively.
  • epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has the two adjustable parameters kernel sigma1 and kernel degree.
  • gaussian_combination: This is the gaussian combination kernel. It has the adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
  • multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2+c^2. It has the adjustable parameters kernel sigma1 and kernel sigma shift.

Kernel gamma

This is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova.

Kernel sigma1

This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.

Kernel sigma2

This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.

Kernel sigma3

This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.

Kernel shift

This is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric.

Kernel degree

This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov.

Kernel a

This is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.

Kernel b

This is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.