HashingVectorizer — AITK Info Portal

The HashingVectorizer algorithm converts text documents to a matrix of token occurrences. It uses a feature hashing strategy to allow for hash collisions when measuring the occurrence of tokens. It is a stateless transformer, meaning that it does not require building a vocabulary of the seen tokens. This reduces the memory footprint and allows for larger feature spaces.

HashingVectorizer is comparable with the TFIDF algorithm, as they share many of the same parameters. However HashingVectorizer is a better option for building models with large text fields provided you do not need to know term frequencies, and only want outcomes.

For descriptions of the ngram_range, analyzer, norm, and token_pattern parameters, see the scikit-learn documentation at https://scikit-learn.org/0.19/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

Parameters

The reduce parameter is either True or False and determines whether or not to reduce the output to a smaller dimension using TruncatedSVD.
The reduce parameter default is True.
The k=<int> parameter sets the number of dimensions to reduce when the reduce parameter is set to true. Default is 100.
The default for the max_features parameter is 10,000.
The n_iters parameter specifies the number of iterations to to use when performing dimensionality reduction. This is only used when the reduce parameter is set to True. Default is 5.

Syntax

fit HashingVectorizer <field_to_convert> [max_features=<int>] [n_iters=<int>]
[reduce=<bool>] [k=<int>] [ngram_range=<int>-<int>] [analyzer=<str>]
[norm=<str>] [token_pattern=<str>] [stop_words=english]

Syntax constraints

HashingVectorizer does not support saving models, incremental fit, or K-fold cross validation.

Example

The following example uses HashingVectorizer to hash the text dataset and applies KMeans clustering (where k=3) on the hashed fields.

| inputlookup authorization.csv | fit HashingVectorizer Logs ngram_range=1-2 k=50 stop_words=english | fit KMeans Logs_hashed* k=3 | fields cluster* Logs | sample 5 by cluster | sort by cluster

Local availability Permalink to this section

Local class: HashingVectorizer
Source file: Splunk_ML_Toolkit/bin/algos/HashingVectorizer.py (in-repo path Splunk_ML_Toolkit/bin/algos/HashingVectorizer.py)
algos.conf stanza: [HashingVectorizer]
Class bases: BaseAlgo

Source Permalink to this section

Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).

Local availability Permalink to this section#

Source Permalink to this section#

Local availability Permalink to this section

Source Permalink to this section