Algorithms
HashingVectorizer
The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning tha…
The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning that it does not require building a vocabulary of the seen tokens. This reduces the memory footprint and allows for larger feature spaces.
HashingVectorizer is comparable with the TFIDF algorithm, as they share many of the same parameters. However HashingVectorizer is a better option for building models with large text fields provided you do not need to know term frequencies, and only want outcomes.
For descriptions of the ngram_range, analyzer, norm, and token_pattern parameters, see the scikit-learn documentation at https://scikit-learn.org/0.19/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Parameters
- The
reduceparameter is eitherTrueorFalseand determines whether or not to reduce the output to a smaller dimension using TruncatedSVD. - The
reduceparameter default is True. - The
k=<int>parameter sets the number of dimensions to reduce when thereduceparameter is set totrue. Default is 100. - The default for the
max_featuresparameter is 10,000. - The
n_itersparameter specifies the number of iterations to to use when performing dimensionality reduction. This is only used when thereduceparameter is set toTrue. Default is 5.
Syntax
fit HashingVectorizer <field_to_convert> [max_features=<int>] [n_iters=<int>]
[reduce=<bool>] [k=<int>] [ngram_range=<int>-<int>] [analyzer=<str>]
[norm=<str>] [token_pattern=<str>] [stop_words=english]
Syntax constraints
HashingVectorizer does not support saving models, incremental fit, or K-fold cross validation.
Example
The following example uses HashingVectorizer to hash the text dataset and applies KMeans clustering (where k=3) on the hashed fields.
| inputlookup authorization.csv | fit HashingVectorizer Logs ngram_range=1-2 k=50 stop_words=english | fit KMeans Logs_hashed* k=3 | fields cluster* Logs | sample 5 by cluster | sort by cluster
Local availability Permalink to this section
- Local class:
HashingVectorizer - Source file:
Splunk_ML_Toolkit/bin/algos/HashingVectorizer.py(in-repo pathSplunk_ML_Toolkit/bin/algos/HashingVectorizer.py) - algos.conf stanza:
[HashingVectorizer] - Class bases:
BaseAlgo
Source Permalink to this section
Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).