Algorithms

NPR

The Normalized Perlich Ratio (NPR) algorithm converts high cardinality categorical field values into numeric field entries while intelligently handling space optimization. NPR offers low computational costs to perform feature extraction…

The Normalized Perlich Ratio (NPR) algorithm converts high cardinality categorical field values into numeric field entries while intelligently handling space optimization. NPR offers low computational costs to perform feature extraction on variables with high cardinalities such as ZIP codes or IP addresses.

Note: NPR does not perform one-hot encoding unlike other algorithms that leverage the fit and apply commands.

Parameters

  • Use the summary command to inspect the variance information of the saved model.

  • After running NPR the transformed dataset has calculated ratios for all feature variables (feature_field). Based on the training data NPR calculates a variable of X_unobserved which can be used as a replacement value in the following two scenarios:

    • In conjunction with the fit command NPR initially replaces missing values in the dataset for feature_field with the keyword unobserved which is then replaced by the calculated NPR value of X_unobserved.
    • In conjunction with the apply command, any new value for target_field that was not visible during model training but is encountered in the test dataset.
  • The number of transformed columns created after running NPR is equal to the number of distinct values for feature_field within the search string.

  • From the saved model, use the variance output field to examine the contribution of a particular feature towards the accuracy of the prediction. Higher variance indicates highly important categorical values whereas low variance indicates the value being of lower importance towards the target prediction. Variance may assist in the process of discarding irrelevant feature variables.

Syntax

fit NPR <target_field> from <feature_field> [into <model name>]

You can couple NPR with existing MLTK algorithms to feed the transformed results to the model as a means to enhance predictions.

| fit NPR <target_field> from <feature_field> | fit SGDClassifier <target_field> from NPR

You can save NPR models using the into keyword and apply new data later using the apply command.

| input lookup disk_failures.csv | tail 1000 | apply npr_disk

You can inspect the model learned by NPR with the summary command.

| summary npr_disk

Syntax constraints

  • The wildcard (*) character is not supported.
  • The maximum matrix size calculated from |X| * |Y| where X is the feature_field and Y is the target_field is 10000000. For example, if number of distinct categorical feature values are 1000 and distinct categorical target values are 100 then the matrix size is 100000.

Examples

The following example uses NPR on a test set.

| inputlookup disk_failures.csv| head 5000 | fit NPR DiskFailure from Model into npr_disk

The following example couples NPR with another MLTK algorithm on a test set.

| inputlookup disk_failures.csv| head 5000 | fit NPR DiskFailure from Model | fit SGDClassifier DiskFailure from NPR_* random_state=42 n_iter=2 | score accuracy_score DiskFailure against predicted*

The following example uses NPR over multiple fields with additional uses of the fit command.

| inputlookup disk_failures.csv | head 5000
| fit NPR DiskFailure from Model into npr_disk_1
| fit NPR DiskFailure from SerialNumber into npr_disk_2

Local availability Permalink to this section

Class docstring Permalink to this section

Instance of NPR : Normalized Perich Ratio. It maps high cardinality categorical fields into numeric fields
    in predictive models

Source Permalink to this section

Adapted from the Splunk AI Toolkit 5.6.4 documentation at /en/splunk-cloud-platform/apply-machine-learning/use-ai-toolkit/5.6.4/algorithms-and-scoring-metrics-in-the-ai-toolkit/algorithms-in-the-ai-toolkit (section: preprocessor).

Press Cmd/Ctrl+K to focus search. Esc to close.

Type to search the portal.