Skip to main content

Univariate Anomaly Detection

Synopsis

This operator calculates univariate (i.e. one attribute at a time) outlier scores for each attribute in your ExampleSet and provides an aggregrated outlier score for each row of data.

Description

This operator calculates univariate (i.e. one attribute at a time) outlier scores for each attribute in your ExampleSet. In a second step it aggregates the individual outlier scores into one score. Note: this method assumes that your attributes are statistically independent from one another. If this is not the case, your results will not necessarily reflect a true outlier score.

Input

example set

The input ExampleSet.

Output

example set output

The resulting output ExampleSet with the anomaly score(s).

preprocessing model

A preprocessing model which allows you to apply the same method on a different ExampleSet.

Parameters

Method

This parameter allows you to select the method you want to use to calculate univariate outlier scores.

  • Percentile Distance: The Percentile Distance method calculates the distance to the closest percentile. If percentile distance is chosen, the user has to provide a threshold. This defines the percentile to be chosen. A threshold of 0.05 means, that the percentiles to compare against are the 5th and 95th. The score is then the distance to the threshold if the value is above the upper or below the lower percentile and zero otherwise. The option scoring_type can be chosen to only take the upper or lower percentile into account, which means you only search for top or bottom outliers.
  • Quartiles: The Quartiles method calculates the anomaly score as: score = (value - median)/IQR, where IQR is the interquartile range (difference between the 25th and the 75th percentile). The Quartiles method can be seen as a more robust version of the z-Score method (see below).
  • Histogram: The Histogram method constructs a histogram for each attribute. The number of bins are automatically determined via Freedman-Diaconis. For each bin in the histogram it calculates the "probability" as: (frequency+1)/size (+1 is used to avoid divisions by zeros). The anomaly score for a given bin is then calculated as 1/probability.
  • z-Score: The z-Score method calculates the anomaly score as: score = (value - mean)/standard deviation. This can be interpreted as the distance of the current value to the mean, measured in z standard deviations.

Percentile threshold

Only available if the method is percentile distance. Defines the upper and lower percentiles to check against. A threshold of 0.05 means, that the percentiles to compare against are the 5th and 95th.

Scoring type

Only available if the method is percentile distance.

  • both: Both bottom and top outliers are considered.
  • only_top: Only top outliers are considered.
  • only_bottom: Only bottom outliers are considered.

Aggregation method

This parameter allows you to select how you want to aggregate (combine) the different outlier scores from the individual attributes.

  • Average: Calculates the average (arithmetic mean) of all univariate outlier scores for one row of data.
  • Maximum: Finds the maximum of all univariate outlier scores for one row of data.
  • Product: Calculates a normalized product of all univariate outlier scores for one row of data (the product of outlier scores for one row divided by the number of scores). Note: for stability reasons we use sum of logs internally.

Show individual scores

If the show individual scores parameter is set to true, the operator creates a new outlier score attribute for each attribute selected. If set to false, only the aggregated outlier score is shown.

See Also