K-Means (H2O)
Synopsis
Performs clustering using the k-means algorithm found in H2O 3.30.0.1.
Description
This Operator performs clustering on the provided ExampleSet and produces a model. The model can be applied later or used for cluster visualization using the Cluster Model Visualizer Operator.
This Operator performs clustering using H2O's k-means algorithm. Clustering groups Examples together which are similar to each other. As no Label Attribute is necessary, Clustering can be used on unlabelled data and is an algorithm of unsupervised machine learning.
The k-means algorithm determines a set of k clusters and assigns each Example to exactly one cluster. The clusters consist of similar Examples. The similarity between Examples is based on the Euclidean distance between them.
A cluster in the k-means algorithm is determined by the position of the center in the n-dimensional space of the n Attributes of the ExampleSet. This position is called centroid. It can, but does not have to be the position of an Example of the ExampleSet.
The k-means algorithm starts with k points which are treated as the centroid of k potential clusters. These start points are determined by some heuristic:
- Random initialization randomly samples the k-specified value of the Examples of the ExampleSet as cluster centers.
- PlusPlus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.
- Furthest initialization chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance.
All Examples are assigned to their nearest cluster (in terms of Euclidean distance). Next, the centroids of the clusters are recalculated by averaging over all Examples of one cluster. The previous steps are repeated for the new centroids until the centroids no longer move or **max iterations ** is reached.
The procedure is repeated max iterations times with each time a different set of start points. The set of clusters is delivered which has the minimal sum of squared distances of all Examples to their corresponding centroids.
This Operator supports the automatic inference of k. The estimate k parameter specifies whether to estimate the number of clusters (<=maximum k) iteratively (independent of the seed) and deterministically (beginning with k=1,2,3...). If enabled, for each k that, the estimate will go up to max iterations. This option is disabled by default.
Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This Operator can take care of standardization as well. This option is enabled by default.
Differentiation
K-Medoids
In case of the k-medoids algorithm the centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm.
k-Means
The k-Means (H2O) Operator uses H2O's k-means implementation. This is generally faster and more flexible (e.g. provides option for standardization and estimation of k) than the old k-Means Operator. However, the k-Means (H2O) Operator can only use Euclidean distance as distance measure.
K-Means (Kernel)
Kernel k-means uses kernels to estimate distances between Examples and clusters. Because of the nature of kernels it is necessary to sum over all Examples of a cluster to calculate one distance. So this algorithm is quadratic in number of Examples and does not return a Centroid Cluster Model (on the contrary the K-Means operator returns a Centroid Cluster Model).
Input
example set
The input port expects an ExampleSet (labelled or unlabelled). Special Attributes (like label, cluster or id) are ignored during clustering.
Output
cluster model
This port delivers the cluster model. It contains the information which Examples are part of which cluster. It also stores the position of the centroids of the clusters. It can be used by the Apply Model Operator to perform the specified clustering on another ExampleSet. The cluster model can also be grouped together with other clustering models, preprocessing models and learning models by the Group Models Operator.
clustered set
Depending on the add cluster attribute and the add as label parameters an Attribute 'cluster' with special role 'Cluster' or an Attribute 'label' with special role 'Label' is added. The resulting ExampleSet is delivered at this output port.
Parameters
Add cluster attribute
If enabled, a cluster id is generated as new special attribute directly in this operator, otherwise this operator does not add an id attribute. In the latter case you have to use the Apply Model operator to generate the cluster attribute.
Add as label
If true the new Attribute with the cluster_id is called 'label' and has the special role 'Label'. If the parameter **add cluster attribute ** is false, no new Attribute is created.
Remove unlabeled
If set to true, Examples which cannot be assigned to a cluster are removed from the output ExampleSet.
Estimate k
This parameter specifies whether to estimate the number of clusters (less than or equal to maximum k) iteratively (independent of the seed) and deterministically (beginning with k=1,2,3...). If enabled, the estimation will run up to max iterations for each k value. Note that the algorithm prefers small number of clusters in the end result, and will likely not run using all possible k values up to maximum k. Always check your clustering results if they are line with your expectations.
K
This parameter specifies the number of clusters to determine.
Maximum k
This parameter specifies the maximum number of clusters to determine. This is a ceiling value for k estimation.
Standardize
Enable this option to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution.
Initialization mode
This parameter specifies the initialization mode.
- Random: Randomly samples the k-specified value of the Examples of the ExampleSet as cluster centers.
- PlusPlus: Chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.
- Furthest: Chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance.
Use local random seed
Indicates if a local random seed should be used for randomization.
Local random seed
Local random seed for random generation. This parameter is only available if the use local random seed parameter is set to true.
Nominal encoding
Encoding schemes for handling categorical features.
- AUTO: Allow the algorithm to decide. In K-Means, the algorithm will automatically perform Enum encoding.
- Enum: 1 column per categorical feature.
- One-Hot Explicit: N+1 new columns for categorical features with N levels.
- Binary: No more than 32 columns per categorical feature.
- Eigen: K columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only.
- LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.).
Max iterations
Specify the maximum number of training iterations.
Expert parameters
These parameters are for fine tuning the algorithm. Usually the default values provide a decent model, but in some cases it may be useful to change them. Please use true/false values for boolean parameters and the exact attribute name for columns. Arrays can be provided by splitting the values with the comma (,) character. More information on the parameters can be found in the H2O documentation.
- export_checkpoints_dir: Specify a directory to which generated models will automatically be exported. Type: String, Default: empty
- cluster_size_constraints: An array specifying the minimum number of points that should be in each cluster. The length of the constraints array has to be the same as the number of clusters. Type: array, default: empty
- ignore_const_cols: Specify whether to ignore constant training columns, since no information can be gained from them. This option is enabled by default. Type: boolean, Default: false
- max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable. Type: double, Default: 0.0
- score_each_iteration: Specify whether to score during each iteration of the model training. Type: boolean, Default: false