Skip to main content

Detect Outliers (Isolation Forest)

Synopsis

This operator finds outliers using an Isolation Forest.

Description

Isolation Forests are anomaly detection algorithms. Unlike normal Random Forests they do not need any label to work on. Their purpose is to detect if data points are similar to the training data or not.

In order to do this an isolation tree takes a random attribute and performs a random split on the attribute. The data is send to the two child nodes and the same strategy is performed until either the chosen attribute is constant or there are not enough examples in the node.

The central idea of an isolation forest is then, that "normal" data has long paths along a tree. While abnormal data tend to have short average paths across the forest.

The user has currently two ways how he can determine an anomaly score. If you choose average_path you will receive the average path until a given example reaches a leaf. If you choose normalized_score this score will be normalized like this: score = pow(2,-1*l/c) where l is the average length and c is the theoretical depth of a tree.

For more details, please have a look at the original paper: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

Input

example set

The input ExampleSet.

Output

example set

The resulting output ExampleSet with scores

model

The isolation forest model you can use to apply it on a different data set.

Parameters

Number of trees

The number of trees in the forest

Max leaf size

Maximal number of examples in a leaf. Used as stopping criterion.

Use feature heuristic

If set to true max_features is set to the square root of the number of regular attributes in your example set.

Max features

Defines how many features are taken in every tree. Splits are only performed in those features

Bootstrap ratio

Each tree is trained on a bootstrapped subset of the original table. This means that you draw a sample of the original data set, but allow to have examples more than once in the resulting table. The ratio defined in this setting defines how many rows the bootstrapped table will have. If you have 100 rows in your original table and a bootstrap ratio of 0.9 every treee will be trained on 90 rows.

Score calculation

What option to use to generate the score.