Forward Selection
Synopsis
This operator selects the most relevant attributes of the given ExampleSet through a highly efficient implementation of the forward selection scheme.
Description
The Forward Selection operator is a nested operator i.e. it has a subprocess. The subprocess of the Forward Selection operator must always return a performance vector. For more information regarding subprocesses please study the <reference key="operator.subprocess">Subprocess</reference>
operator.
The Forward Selection operator starts with an empty selection of attributes and, in each round, it adds each unused attribute of the given ExampleSet. For each added attribute, the performance is estimated using the inner operators, e.g. a cross-validation. Only the attribute giving the highest increase of performance is added to the selection. Then a new round is started with the modified selection. This implementation avoids any additional memory consumption besides the memory used originally for storing the data and the memory which might be needed for applying the inner operators. The stopping behavior parameter specifies when the iteration should be aborted. There are three different options:
- without increase : The iteration runs as long as there is any increase in performance.
- without increase of at least: The iteration runs as long as the increase is at least as high as specified, either relative or absolute. The minimal relative increase parameter is used for specifying the minimal relative increase if the use relative increase parameter is set to true. Otherwise, the minimal absolute increase parameter is used for specifying the minimal absolute increase.
- without significant increase: The iteration stops as soon as the increase is not significant to the level specified by the alpha parameter.
The speculative rounds parameter defines how many rounds will be performed in a row, after the first time the stopping criterion is fulfilled. If the performance increases again during the speculative rounds, the selection will be continued. Otherwise all additionally selected attributes will be removed, as if no speculative rounds had executed. This might help avoiding getting stuck in local optima.
Feature selection i.e. the question for the most relevant features for classification or regression problems, is one of the main data mining tasks. A wide range of search methods have been integrated into RapidMiner including evolutionary algorithms. For all search methods we need a performance measurement which indicates how well a search point (a feature subset) will probably perform on the given data set.
Differentiation
Optimize Selection
The Backward Elimination operator starts with the full set of attributes and, in each round, it removes each remaining attribute of the given ExampleSet. For each removed attribute, the performance is estimated using the inner operators, e.g. a cross-validation. Only the attribute giving the least decrease of performance is finally removed from the selection. Then a new round is started with the modified selection.
Input
example set
This input port expects an ExampleSet. This ExampleSet is available at the first port of the nested chain (inside the subprocess) for processing in the subprocess.
Output
example set
The feature selection algorithm is applied on the input ExampleSet. The resultant ExampleSet with reduced attributes is delivered through this port.
attribute weights
The attribute weights are delivered through this port.
performance
This port delivers the Performance Vector for the selected attributes. A Performance Vector is a list of performance criteria values.
Parameters
Maximal number of attributes
This parameter specifies the maximal number of attributes to be selected through Forward Selections.
Speculative rounds
This parameter specifies the number of times, the stopping criterion might be consecutively ignored before the elimination is actually stopped. A number higher than one might help avoiding getting stuck in local optima.
Stopping behavior
The stopping behavior parameter specifies when the iteration should be aborted. There are three different options:
- without_increase: The iteration runs as long as there is any increase in performance.
- without_increase_of_at_least: The iteration runs as long as the increase is at least as high as specified, either relative or absolute. The minimal relative increase parameter is used for specifying the minimal relative increase if the use relative increase parameter is set to true. Otherwise, the minimal absolute increase parameter is used for specifying the minimal absolute increase.
- without_significant_increase: The iteration stops as soon as the increase is not significant to the level specified by the alpha parameter.
Use relative increase
This parameter is only available when the stopping behavior parameter is set to 'without increase of at least'. If the use relative increase parameter is set to true the minimal relative increase parameter will be used otherwise the minimal absolute increase parameter will be used.
Minimal absolute increase
This parameter is only available when the stopping behavior parameter is set to 'without increase of at least' and the use relative increase parameter is set to false. If the absolute performance increase to the last step drops below this threshold, the selection will be stopped.
Minimal relative increase
This parameter is only available when the stopping behavior parameter is set to 'without increase of at least' and the use relative increase parameter is set to true. If the relative performance increase to the last step drops below this threshold, the selection will be stopped.
Alpha
This parameter is only available when the stopping behavior parameter is set to 'without significant increase'. This parameter specifies the probability threshold which determines if differences are considered as significant.