Feature Dropping Curve

Visualizer

DroppingCurve

Quick Method

dropping_curve()

Models

Classification, Regression, Clustering

Workflow

Model Selection

A feature dropping curve (FDC) shows the relationship between the score and the number of features used. This visualizer randomly drops input features, showing how the estimator benefits from additional features of the same type. For example, how many air quality sensors are needed across a city to accurately predict city-wide pollution levels?

Feature dropping curves helpfully complement Recursive Feature Elimination (RFECV). In the air quality sensor example, RFECV finds which sensors to keep in the specific city. Feature dropping curves estimate how many sensors a similar-sized city might need to track pollution levels.

Feature dropping curves are common in the field of neural decoding, where they are called neuron dropping curves (example, panels C and H). Neural decoding research often quantifies how performance scales with neuron (or electrode) count. Because neurons do not correspond directly between participants, we use random neuron subsets to simulate what performance to expect when recording from other participants.

To show how this works in practice, consider an image classification example using handwritten digits.

from sklearn.svm import SVC
from sklearn.datasets import load_digits

from yellowbrick.model_selection import DroppingCurve

# Load dataset
X, y = load_digits(return_X_y=True)

# Initialize visualizer with estimator
visualizer = DroppingCurve(SVC())

# Fit the data to the visualizer
visualizer.fit(X, y)
# Finalize and render the figure
visualizer.show()

(Source code, png, pdf)

This figure shows an input feature dropping curve. Since the features are informative, the accuracy increases with more larger feature subsets. The shaded area represents the variability of cross-validation, one standard deviation above and below the mean accuracy score drawn by the curve.

The visualization can be interpreted as the performance if we knew some image pixels were corrupted. As an alternative interpretation, the dropping curve roughly estimates the accuracy if the image resolution was downsampled.

Quick Method

The same functionality can be achieved with the associated quick method dropping_curve. This method will build the DroppingCurve with the associated arguments, fit it, then (optionally) immediately show the visualization.

from sklearn.svm import SVC
from sklearn.datasets import load_digits

from yellowbrick.model_selection import dropping_curve

# Load dataset
X, y = load_digits(return_X_y=True)

dropping_curve(SVC(), X, y)

(Source code, png, pdf)

Dropping Curve Quick Method on the digits dataset

API Reference

Implements a random-input-dropout curve visualization for model selection. Another common name: neuron dropping curve (NDC), in neural decoding research

class yellowbrick.model_selection.dropping_curve.DroppingCurve(estimator, ax=None, feature_sizes=array([0.1, 0.325, 0.55, 0.775, 1.0]), groups=None, logx=False, cv=None, scoring=None, n_jobs=None, pre_dispatch='all', random_state=None, **kwargs)[source]

Bases: ModelVisualizer

Selects random subsets of features and estimates the training and crossvalidation performance. Subset sizes are swept to visualize a feature-dropping curve.

The visualization plots the score relative to each subset and shows the number of (randomly selected) features needed to achieve a score. The curve is often shaped like log(1+x). For example, see: https://www.frontiersin.org/articles/10.3389/fnsys.2014.00102/full

Parameters

estimatora scikit-learn estimator

An object that implements fit and predict, can be a classifier, regressor, or clusterer so long as there is also a valid associated scoring metric.

Note that the object is cloned for each validation.

feature_sizes: array-like, shape (n_values,)

default: np.linspace(0.1,1.0,5)

Relative or absolute numbers of input features that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum number of features, otherwise it is interpreted as absolute numbers of features.

groupsarray-like, with shape (n_samples,)

Optional group labels for the samples used while splitting the dataset into train/test sets.

axmatplotlib.Axes object, optional

The axes object to plot the figure on.

logxboolean, optional

If True, plots the x-axis with a logarithmic scale.

cvint, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 3-fold cross-validation,

integer, to specify the number of folds.

An object to be used as a cross-validation generator.

An iterable yielding train/test splits.

see the scikit-learn cross-validation guide for more information on the possible strategies that can be used here.

scoringstring, callable or None, optional, default: None

A string or scorer callable object / function with signature scorer(estimator, X, y). See scikit-learn model evaluation documentation for names of possible metrics.

n_jobsinteger, optional

Number of jobs to run in parallel (default 1).

pre_dispatchinteger or string, optional

Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The string can be an expression like ‘2*n_jobs’.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used to generate feature subsets.

kwargsdict

Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.

Notes

This visualizer is based on sklearn.model_selection.validation_curve

Examples

>>> from yellowbrick.model_selection import DroppingCurve
>>> from sklearn.naive_bayes import GaussianNB
>>> model = DroppingCurve(GaussianNB())
>>> model.fit(X, y)
>>> model.show()

Attributes

feature_sizes_array, shape = (n_unique_ticks,), dtype int: Numbers of features that have been used to generate the dropping curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.
train_scores_array, shape (n_ticks, n_cv_folds): Scores on training sets.
train_scores_mean_array, shape (n_ticks,): Mean training data scores for each training split
train_scores_std_array, shape (n_ticks,): Standard deviation of training data scores for each training split
valid_scores_array, shape (n_ticks, n_cv_folds): Scores on validation set.
valid_scores_mean_array, shape (n_ticks,): Mean scores for each validation split
valid_scores_std_array, shape (n_ticks,): Standard deviation of scores for each validation split

draw(**kwargs)[source]: Renders the training and validation learning curves.

finalize(**kwargs)[source]: Add the title, legend, and other visual final touches to the plot.

fit(X, y=None)[source]

Fits the feature dropping curve with the wrapped model to the specified data. Draws training and cross-validation score curves and saves the scores to the estimator.

Parameters

Xarray-like, shape (n_samples, n_features): Input vector, where n_samples is the number of samples and n_features is the number of features.
yarray-like, shape (n_samples) or (n_samples, n_features), optional: Target relative to X for classification or regression; None for unsupervised learning.

yellowbrick.model_selection.dropping_curve.dropping_curve(estimator, X, y, feature_sizes=array([0.1, 0.325, 0.55, 0.775, 1.0]), groups=None, ax=None, logx=False, cv=None, scoring=None, n_jobs=None, pre_dispatch='all', random_state=None, show=True, **kwargs) → DroppingCurve[source]

Displays a random-feature dropping curve, comparing feature size to training and cross validation scores. The dropping curve aims to show how a model improves with more information.

This helper function wraps the DroppingCurve class for one-off analysis.

Parameters

estimatora scikit-learn estimator

An object that implements fit and predict, can be a classifier, regressor, or clusterer so long as there is also a valid associated scoring metric.

Note that the object is cloned for each validation.

Xarray-like, shape (n_samples, n_features)

Input vector, where n_samples is the number of samples and n_features is the number of features.

yarray-like, shape (n_samples) or (n_samples, n_features), optional

Target relative to X for classification or regression; None for unsupervised learning.

feature_sizes: array-like, shape (n_values,)

default: np.linspace(0.1,1.0,5)

Relative or absolute numbers of input features that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum number of features, otherwise it is interpreted as absolute numbers of features.

groupsarray-like, with shape (n_samples,)

Optional group labels for the samples used while splitting the dataset into train/test sets.

axmatplotlib.Axes object, optional

The axes object to plot the figure on.

logxboolean, optional

If True, plots the x-axis with a logarithmic scale.

cvint, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 3-fold cross-validation,

integer, to specify the number of folds.

An object to be used as a cross-validation generator.

An iterable yielding train/test splits.

see the scikit-learn cross-validation guide for more information on the possible strategies that can be used here.

scoringstring, callable or None, optional, default: None

A string or scorer callable object / function with signature scorer(estimator, X, y). See scikit-learn model evaluation documentation for names of possible metrics.

n_jobsinteger, optional

Number of jobs to run in parallel (default 1).

pre_dispatchinteger or string, optional

Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The string can be an expression like ‘2*n_jobs’.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used to generate feature subsets.

kwargsdict

Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.

Returns

dcDroppingCurve: Returns the fitted visualizer.