Algorithms
TFIDF
The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data. For descriptions of `max_df`, `min_df`, `ngram_range`, `analyzer`,…
The TFIDF algorithm uses the scikit-learn TfidfVectorizer to convert raw text data into a matrix making it possible to use other machine learning estimators on the data. For descriptions of max_df, min_df, ngram_range, analyzer, norm, and token_pattern parameters, see the scikit-learn documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.
CAUTION: TFIDF uses memory to create a dictionary of all terms including ngrams and words, and expands the Splunk search events with additional fields per event. If you are concerned with memory limits, consider using the HashingVectorizer algorithm.
Parameters
To configure the algorithm to ignore common English words (for example, "the", "it", "at", and "that"), set stop_words to english. For other languages (for example, machine language) you can ignore the common words by setting max_df to a value greater than or equal to 0.7 and less than 1.0.
Syntax
fit TFIDF <field_to_convert> [into <model name>] [max_df=<int>] [min_df=<int>] [ngram_range=<int>-<int>]
[analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english]
You can save TFIDF models using the into keyword and apply new data later using the apply command.
... | apply user_feedback_model
Syntax constraints
You cannot inspect the model learned by TFIDF with the summary command.
Example
The following example uses TFIDF to convert the text dataset to a matrix of TF-IDF features and then applies KMeans clustering (where k=3) on the matrix.
| inputlookup authorization.csv | fit TFIDF Logs ngram_range=1-2 ngram_range=1-2 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans Logs_tfidf* k=3 | fields cluster Logs | sample 6 by cluster | sort by cluster
Local availability Permalink to this section
- Local class:
TFIDF - Source file:
Splunk_ML_Toolkit/bin/algos/TFIDF.py(in-repo pathSplunk_ML_Toolkit/bin/algos/TFIDF.py) - algos.conf stanza:
[TFIDF] - Class bases:
BaseAlgo
Source Permalink to this section
Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).