Detect Outlier (COF)
Synopsis
This operator identifies outliers in the given ExampleSet based on the Class Outlier Factors (COF).
Description
The main concept of an ECODB (Enhanced Class Outlier - Distance Based) algorithm is to rank each instance in the ExampleSet given the parameters N (top N class outliers), and K (the number of nearest neighbors). The rank of each instance is found using the formula:
COF = PCL(T,K) - norm(deviation(T)) + norm(kDist(T))
- PCL(T,K) is the Probability of the Class Label of the instance T with respect to the class labels of its K nearest neighbors.
- norm(Deviation(T)) and norm(KDist(T)) are the normalized values of Deviation(T) and KDist(T) respectively and their values fall in the range [0 - 1].
- Deviation(T) is how much the instance T deviates from instances of the same class. It is computed by summing the distances between the instance T and every instance belonging to the same class.
- KDist(T) is the summation of the distance between the instance T and its K nearest neighbors.
This operator adds a new boolean attribute named 'outlier' to the given ExampleSet. If the value of this attribute is true, that example is an outlier and vice versa. Another special attribute 'COF Factor' is also added to the ExampleSet. This attribute measures the degree of being Class Outlier for an example.
An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case such examples should be discarded.
Input
example set input
This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also be used as input.
Output
example set output
A new boolean attribute 'outlier' and a real attribute 'COF Factor' is added to the given ExampleSet and the ExampleSet is delivered through this output port.
original
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.