Discrimination Threshold

Caution

This visualizer only works for binary classification.

A visualization of precision, recall, f1 score, and queue rate with respect to the discrimination threshold of a binary classifier. The discrimination threshold is the probability or score at which the positive class is chosen over the negative class. Generally, this is set to 50% but the threshold can be adjusted to increase or decrease the sensitivity to false positives or to other application factors.

# Load a binary classification dataset
data = load_data("spam")
target = "is_spam"
features = [col for col in data.columns if col != target]

# Extract the instances and target from the dataset
X = data[features]
y = data[target]
from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold

# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = DiscriminationThreshold(logistic)

visualizer.fit(X, y)  # Fit the training data to the visualizer
visualizer.poof()     # Draw/show/poof the data
../../_images/spam_discrimination_threshold.png

One common use of binary classification algorithms is to use the score or probability they produce to determine cases that require special treatment. For example, a fraud prevention application might use a classification algorithm to determine if a transaction is likely fraudulent and needs to be investigated in detail. In the figure above, we present an example where a binary classifier determines if an email is “spam” (the positive case) or “not spam” (the negative case). Emails that are detected as spam are moved to a hidden folder and eventually deleted.

Many classifiers use either a decision_function to score the positive class or a predict_proba function to compute the probability of the positive class. If the score or probability is greater than some discrimination threshold then the positive class is selected, otherwise, the negative class is.

Generally speaking, the threshold is balanced between cases and set to 0.5 or 50% probability. However, this threshold may not be the optimal threshold: often there is an inverse relationship between precision and recall with respect to a discrimination threshold. By adjusting the threshold of the classifier, it is possible to tune the F1 score (the harmonic mean of precision and recall) to the best possible fit or to adjust the classifier to behave optimally for the specific application. Classifiers are tuned by considering the following metrics:

  • Precision: An increase in precision is a reduction in the number of false positives; this metric should be optimized when the cost of special treatment is high (e.g. wasted time in fraud preventing or missing an important email).
  • Recall: An increase in recall decrease the likelihood that the positive class is missed; this metric should be optimized when it is vital to catch the case even at the cost of more false positives.
  • F1 Score: The F1 score is the harmonic mean between precision and recall. The fbeta parameter determines the relative weight of precision and recall when computing this metric, by default set to 1 or F1. Optimizing this metric produces the best balance between precision and recall.
  • Queue Rate: The “queue” is the spam folder or the inbox of the fraud investigation desk. This metric describes the percentage of instances that must be reviewed. If review has a high cost (e.g. fraud prevention) then this must be minimized with respect to business requirements; if it doesn’t (e.g. spam filter), this could be optimized to ensure the inbox stays clean.

In the figure above we see the visualizer tuned to look for the optimal F1 score, which is annotated as a threshold of 0.43. The model is run multiple times over multiple train/test splits in order to account for the variability of the model with respect to the metrics (shown as the fill area around the median curve).

API Reference

DiscriminationThreshold visualizer for probabilistic classifiers.

class yellowbrick.classifier.threshold.DiscriminationThreshold(model, ax=None, n_trials=50, cv=0.1, fbeta=1.0, argmax='fscore', exclude=None, quantiles=array([0.1, 0.5, 0.9]), random_state=None, **kwargs)[source]

Bases: yellowbrick.base.ModelVisualizer

Visualizes how precision, recall, f1 score, and queue rate change as the discrimination threshold increases. For probabilistic, binary classifiers, the discrimination threshold is the probability at which you choose the positive class over the negative. Generally this is set to 50%, but adjusting the discrimination threshold will adjust sensitivity to false positives which is described by the inverse relationship of precision and recall with respect to the threshold.

The visualizer also accounts for variability in the model by running multiple trials with different train and test splits of the data. The variability is visualized using a band such that the curve is drawn as the median score of each trial and the band is from the 10th to 90th percentile.

The visualizer is intended to help users determine an appropriate threshold for decision making (e.g. at what threshold do we have a human review the data), given a tolerance for precision and recall or limiting the number of records to check (the queue rate).

Caution

This method only works for binary, probabilistic classifiers.

Parameters:
model : Classification Estimator

A binary classification estimator that implements predict_proba or decision_function methods. Will raise TypeError if the model cannot be used with the visualizer.

ax : matplotlib Axes, default: None

The axis to plot the figure on. If None is passed in the current axes will be used (or generated if required).

n_trials : integer, default: 50

Number of times to shuffle and split the dataset to account for noise in the threshold metrics curves. Note if cv provides > 1 splits, the number of trials will be n_trials * cv.get_n_splits()

cv : float or cross-validation generator, default: 0.1

Determines the splitting strategy for each trial. Possible inputs are:

  • float, to specify the percent of the test split
  • object to be used as cross-validation generator

This attribute is meant to give flexibility with stratified splitting but if a splitter is provided, it should only return one split and have shuffle set to True.

fbeta : float, 1.0 by default

The strength of recall versus precision in the F-score.

argmax : str, default: ‘fscore’

Annotate the threshold maximized by the supplied metric (see exclude for the possible metrics to use). If None, will not annotate the graph.

exclude : str or list, optional

Specify metrics to omit from the graph, can include:

  • "precision"
  • "recall"
  • "queue_rate"
  • "fscore"

All metrics not excluded will be displayed in the graph, nor will they be available in thresholds_; however, they will be computed on fit.

quantiles : sequence, default: np.array([0.1, 0.5, 0.9])

Specify the quantiles to view model variability across a number of trials. Must be monotonic and have three elements such that the first element is the lower bound, the second is the drawn curve, and the third is the upper bound. By default the curve is drawn at the median, and the bounds from the 10th percentile to the 90th percentile.

random_state : int, optional

Used to seed the random state for shuffling the data while composing different train and test splits. If supplied, the random state is incremented in a deterministic fashion for each split.

Note that if a splitter is provided, it’s random state will also be updated with this random state, even if it was previously set.

kwargs : dict

Keyword arguments that are passed to the base visualizer class.

Notes

The term “discrimination threshold” is rare in the literature. Here, we use it to mean the probability at which the positive class is selected over the negative class in binary classification.

Classification models must implement either a decision_function or predict_proba method in order to be used with this class. A YellowbrickTypeError is raised otherwise.

See also

For a thorough explanation of discrimination thresholds, see: Visualizing Machine Learning Thresholds to Make Better Business Decisions by Insight Data.

Attributes:
thresholds_ : array

The uniform thresholds identified by each of the trial runs.

cv_scores_ : dict of arrays of len(thresholds_)

The values for all included metrics including the upper and lower bounds of the metrics defined by quantiles.

draw()[source]

Draws the cv scores as a line chart on the current axes.

finalize(**kwargs)[source]

Finalize executes any subclass-specific axes finalization steps. The user calls poof and poof calls finalize.

Parameters:
kwargs: generic keyword arguments.
fit(X, y, **kwargs)[source]

Fit is the entry point for the visualizer. Given instances described by X and binary classes described in the target y, fit performs n trials by shuffling and splitting the dataset then computing the precision, recall, f1, and queue rate scores for each trial. The scores are aggregated by the quantiles expressed then drawn.

Parameters:
X : ndarray or DataFrame of shape n x m

A matrix of n instances with m features

y : ndarray or Series of length n

An array or series of target or class values. The target y must be a binary classification target.

kwargs: dict

keyword arguments passed to Scikit-Learn API.

Returns:
self : instance

Returns the instance of the visualizer

raises: YellowbrickValueError

If the target y is not a binary classification target.