Algorithms

K-means

K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable `k`. The K-means algorithm uses the scikit-learn K-means implement…

K-means clustering is a type of unsupervised learning. It is a clustering algorithm that groups similar data points, with the number of groups represented by the variable k. The K-means algorithm uses the scikit-learn K-means implementation. The cluster for each event is set in a new field named cluster. Use the K-means algorithm when you have unlabeled data and have at least approximate knowledge of the total number of groups into which the data can be divided.

Using the K-means algorithm has the following advantages:

  • Computationally faster than most other clustering algorithms.
  • Simple algorithm to explain and understand.
  • Normally produces tighter clusters than hierarchical clustering.

Using the K-means algorithm has the following disadvantages:

  • Difficult to determine optimal or true value of k. See X-means.
  • Sensitive to scaling. See StandardScaler.
  • Each clustering may be slightly different, unless you specify the random_state parameter.
  • Does not work well with clusters of different sizes and density.

For descriptions of default value of K, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Parameters

The k parameter specifies the number of clusters to divide the data into. By default, the cluster label field name is cluster. Change that behavior by using the as keyword to specify a different field name.

Syntax

fit KMeans <fields> [into <model name>]  [k=<int>]  [random_state=<int>]

You can save K-means models using the into keyword when using the fit command.

You can apply the model to new data using the apply command.

... | apply cluster_model

You can inspect the model using the summary command.

... | summary cluster_model

Example

The following example uses K-means on a test set.

... | fit KMeans * k=3 | stats count by cluster

Local availability Permalink to this section

  • Local class: KMeans
  • Source file: Splunk_ML_Toolkit/bin/algos/KMeans.py (in-repo path Splunk_ML_Toolkit/bin/algos/KMeans.py)
  • algos.conf stanza: [KMeans]
  • Class bases: ClustererMixin, BaseAlgo

Source Permalink to this section

Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: clusterer).

Press Cmd/Ctrl+K to focus search. Esc to close.

Type to search the portal.