X-Validation (Deprecated)
Synopsis
This operator performs a cross-validation in order to estimate the statistical performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model (learnt by a particular learning operator) will perform in practice.
Description
The X-Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for training a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase.
The input ExampleSet is partitioned into k subsets of equal size. Of the k subsets, a single subset is retained as the testing data set (i.e. input of the testing subprocess), and the remaining k − 1 subsets are used as training data set (i.e. input of the training subprocess). The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the testing data. The k results from the k iterations then can be averaged (or otherwise combined) to produce a single estimation. The value k can be adjusted using the number of validations parameter.
The learning processes usually optimize the model to make it fit the training data as well as possible. If we test this model on some independent set of data, mostly this model does not perform that well on testing data as it performed on the data that was used to generate it. This is called 'over-fitting'. The Cross-Validation operator predicts the fit of a model to a hypothetical testing data. This can be especially useful when separate testing data is not present.
Differentiation
Explain Predictions
The X-Validation and X-Prediction operators work in the same way. The major difference is the objects returned by these operators. The X-Validation operator returns a performance vector whereas the X-Prediction operator returns labeled ExampleSet.
Input
training example set
This input port expects an ExampleSet for training a model (training data set). The same ExampleSet will be used during the testing subprocess for testing the model.
Output
model
The training subprocess must return a model, which is trained on the input ExampleSet. Please note that model built on the complete input ExampleSet is delivered from this port.
training example set
The ExampleSet that was given as input at the training input port is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
averagable
The testing subprocess must return a Performance Vector. This is usually generated by applying the model and measuring its performance. Two such ports are provided but more can also be used if required. Please note that the statistical performance calculated by this estimation scheme is only an estimate (instead of an exact calculation) of the performance which would be achieved with the model built on the complete delivered data set.