Algorithms
K-means
K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable `k`. The K-means algorithm uses the scikit-learn K-means implement…
K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable k. The K-means algorithm uses the scikit-learn K-means implementation. The cluster for each event is set in a new field named cluster. Use the K-means algorithm when you have unlabeled data and have at least approximate knowledge of the total number of groups into which the data can be divided.
Using the K-means algorithm has the following advantages:
- Computationally faster than most other clustering algorithms.
- Simple algorithm to explain and understand.
- Normally produces tighter clusters than hierarchical clustering.
Using the K-means algorithm has the following disadvantages:
- Difficult to determine optimal or true value of
k. See X-means. - Sensitive to scaling. See StandardScaler.
- Each clustering may be slightly different, unless you specify the
random_stateparameter. - Does not work well with clusters of different sizes and density.
For descriptions of default value of K, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Parameters
The k parameter specifies the number of clusters to divide the data into. By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.
Syntax
fit KMeans <fields> [into <model name>] [k=<int>] [random_state=<int>]
You can save K-means models using the into keyword when using the fit command.
You can apply the model to new data using the apply command.
... | apply cluster_model
You can inspect the model using the summary command.
... | summary cluster_model
Example
The following example uses K-means on a test set.
... | fit KMeans * k=3 | stats count by cluster
Local availability Permalink to this section
- Local class:
KMeans - Source file:
Splunk_ML_Toolkit/bin/algos/KMeans.py(in-repo pathSplunk_ML_Toolkit/bin/algos/KMeans.py) - algos.conf stanza:
[KMeans] - Class bases:
ClustererMixin,BaseAlgo
Source Permalink to this section
Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: clusterer).