Algorithms

K-fold cross-validation

In the `kfold_cv` parameter, the training set is randomly partitioned into k equal-sized subsamples. Then, each sub-sample takes a turn at becoming the validation (test) set, predicted by the other k-1 training sets. Each sample is used…

In the kfold_cv parameter, the training set is randomly partitioned into k equal-sized subsamples. Then, each sub-sample takes a turn at becoming the validation (test) set, predicted by the other k-1 training sets. Each sample is used exactly once in the validation set, and the variance of the resulting estimate is reduced as k is increased. The disadvantage of the kfold_cv parameter is that k different models have to be trained, leading to long execution times for large datasets and complex models.

The scores obtained from K-fold cross-validation are generally a less biased and less optimistic estimate of the model performance than a standard training and testing split.

The image is a representative diagram of how the K-fold parameter works. There are 5 rows representing iterations or folds. Each fold contains equal subsamples that each take a turn as testing and training data.

You can obtain k performance metrics, one for each training and testing split. These k performance metrics can then be averaged to obtain a single estimate of how well the model generalizes on unseen data.

Syntax

The kfold_cv parameter is applicable to to all classification and regression algorithms, and you can append the command to the end of an SPL search.

Here kfold_cv=<int> specifies that k=<int> folds is used. When you specify a classification algorithm, stratified k-fold is used instead of k-fold. In stratified k-fold, each fold contains approximately the same percentage of samples for each class.

..| fit <classification | regression algo> <targetVariable> from <featureVariables> [options] kfold_cv=<int>

CAUTION: The kfold_cv parameter cannot be used when saving a model.

Output

The kfold_cv parameter returns performance metrics on each fold using the same model specified in the SPL - including algorithm and hyper parameters. Its only function is to give you insight into how well you model generalizes. It does not perform any model selection or hyper parameter tuning.

Examples

The first example shows the kfold_cv parameter used in classification. Where the output is a set of metrics for each fold including accuracy, f1_weighted, precision_weighted, and recall_weighted.

This is a screen capture of the Statistics tab of the toolkit. There are three rows of results displayed.

This second example shows the kfold_cv parameter used in classification. Where the output is a set of metrics for each the neg_mean_squared_error and r^2 folds.

This is a screen capture of the Statistics tab of the toolkit. There are three rows of results displayed.

Local availability Permalink to this section

  • This algorithm is documented upstream but is not present in this repository's Splunk_ML_Toolkit/bin/algos/. Use it through the AI Toolkit's standard fit/apply SPL surface.

Source Permalink to this section

Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).

Press Cmd/Ctrl+K to focus search. Esc to close.

Type to search the portal.