Algorithms
NPR
The Normalized Perlich Ratio (NPR) algorithm converts high cardinality categorical field values into numeric field entries while intelligently handling space optimization. NPR offers low computational costs to perform feature extraction…
The Normalized Perlich Ratio (NPR) algorithm converts high cardinality categorical field values into numeric field entries while intelligently handling space optimization. NPR offers low computational costs to perform feature extraction on variables with high cardinalities such as ZIP codes or IP addresses.
Note: NPR does not perform one-hot encoding unlike other algorithms that leverage the fit and apply commands.
Parameters
-
Use the
summarycommand to inspect the variance information of the saved model. -
After running NPR the transformed dataset has calculated ratios for all feature variables (
feature_field). Based on the training data NPR calculates a variable ofX_unobservedwhich can be used as a replacement value in the following two scenarios:- In conjunction with the
fitcommand NPR initially replaces missing values in the dataset forfeature_fieldwith the keywordunobservedwhich is then replaced by the calculated NPR value ofX_unobserved. - In conjunction with the
applycommand, any new value fortarget_fieldthat was not visible during model training but is encountered in the test dataset.
- In conjunction with the
-
The number of transformed columns created after running NPR is equal to the number of distinct values for
feature_fieldwithin the search string. -
From the saved model, use the
varianceoutput field to examine the contribution of a particular feature towards the accuracy of the prediction. Higher variance indicates highly important categorical values whereas low variance indicates the value being of lower importance towards the target prediction. Variance may assist in the process of discarding irrelevant feature variables.
Syntax
fit NPR <target_field> from <feature_field> [into <model name>]
You can couple NPR with existing MLTK algorithms to feed the transformed results to the model as a means to enhance predictions.
| fit NPR <target_field> from <feature_field> | fit SGDClassifier <target_field> from NPR
You can save NPR models using the into keyword and apply new data later using the apply command.
| input lookup disk_failures.csv | tail 1000 | apply npr_disk
You can inspect the model learned by NPR with the summary command.
| summary npr_disk
Syntax constraints
- The wildcard (*) character is not supported.
- The maximum matrix size calculated from |X| * |Y| where X is the feature_field and Y is the target_field is 10000000. For example, if number of distinct categorical feature values are 1000 and distinct categorical target values are 100 then the matrix size is 100000.
Examples
The following example uses NPR on a test set.
| inputlookup disk_failures.csv| head 5000 | fit NPR DiskFailure from Model into npr_disk
The following example couples NPR with another MLTK algorithm on a test set.
| inputlookup disk_failures.csv| head 5000 | fit NPR DiskFailure from Model | fit SGDClassifier DiskFailure from NPR_* random_state=42 n_iter=2 | score accuracy_score DiskFailure against predicted*
The following example uses NPR over multiple fields with additional uses of the fit command.
| inputlookup disk_failures.csv | head 5000
| fit NPR DiskFailure from Model into npr_disk_1
| fit NPR DiskFailure from SerialNumber into npr_disk_2
Local availability Permalink to this section
- Local class:
NPR - Source file:
Splunk_ML_Toolkit/bin/algos/NPR.py(in-repo pathSplunk_ML_Toolkit/bin/algos/NPR.py) - algos.conf stanza:
[NPR] - Class bases:
BaseAlgo
Class docstring Permalink to this section
Instance of NPR : Normalized Perich Ratio. It maps high cardinality categorical fields into numeric fields
in predictive models
Source Permalink to this section
Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).