Algorithms

TFIDF

The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data. For descriptions of `max_df`, `min_df`, `ngram_range`, `analyzer`,…

The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data. For descriptions of max_df, min_df, ngram_range, analyzer, norm, and token_pattern parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

CAUTION: TFIDF uses memory to create a dictionary of all terms including ngrams and words, and expands the Splunk search events with additional fields per event. If you are concerned with memory limits, consider using the HashingVectorizer algorithm.

Parameters

To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set stop_words to english. For other languages (for example, machine language) you can ignore the common words by setting max_df to a value greater than or equal to 0.7 and less than 1.0.

Syntax

fit TFIDF <field_to_convert> [into <model name>] [max_df=<int>] [min_df=<int>] [ngram_range=<int>-<int>]
[analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]

You can save TFIDF models using the into keyword and apply new data later using the apply command.

... | apply user_feedback_model

Syntax constraints

You cannot inspect the model learned by TFIDF with the summary command.

Example

The following example uses TFIDF to convert the text dataset to a matrix of TF-IDF features and then applies KMeans clustering (where k=3) on the matrix.

| inputlookup authorization.csv | fit TFIDF Logs ngram_range=1-2 ngram_range=1-2 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans Logs_tfidf* k=3 | fields cluster Logs | sample 6 by cluster | sort by cluster

Local availability Permalink to this section

Source Permalink to this section

Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).

Press Cmd/Ctrl+K to focus search. Esc to close.

Type to search the portal.