Source code for yellowbrick.model_selection.rfecv

# yellowbrick.model_selection.rfecv
# Visualize the number of features selected with recursive feature elimination
#
# Author:  Benjamin Bengfort
# Created: Tue Apr 03 17:31:37 2018 -0400
#
# Copyright (C) 2018 The scikit-yb developers
# For license information, see LICENSE.txt
#
# ID: rfecv.py [a4599db] rebeccabilbro@users.noreply.github.com $

"""
Visualize the number of features selected using recursive feature elimination
"""

##########################################################################
## Imports
##########################################################################

import numpy as np

from yellowbrick.base import ModelVisualizer
from yellowbrick.exceptions import YellowbrickValueError

from sklearn.utils import check_X_y
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score


##########################################################################
## Recursive Feature Elimination
##########################################################################


[docs]class RFECV(ModelVisualizer): """ Recursive Feature Elimination, Cross-Validated (RFECV) feature selection. Selects the best subset of features for the supplied estimator by removing 0 to N features (where N is the number of features) using recursive feature elimination, then selecting the best subset based on the cross-validation score of the model. Recursive feature elimination eliminates n features from a model by fitting the model multiple times and at each step, removing the weakest features, determined by either the ``coef_`` or ``feature_importances_`` attribute of the fitted model. The visualization plots the score relative to each subset and shows trends in feature elimination. If the feature elimination CV score is flat, then potentially there are not enough features in the model. An ideal curve is when the score jumps from low to high as the number of features removed increases, then slowly decreases again from the optimal number of features. Parameters ---------- estimator : a scikit-learn estimator An object that implements ``fit`` and provides information about the relative importance of features with either a ``coef_`` or ``feature_importances_`` attribute. Note that the object is cloned for each validation. ax : matplotlib.Axes object, optional The axes object to plot the figure on. step : int or float, optional (default=1) If greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration. groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross-validation, - integer, to specify the number of folds. - An object to be used as a cross-validation generator. - An iterable yielding train/test splits. see the scikit-learn `cross-validation guide <http://scikit-learn.org/stable/modules/cross_validation.html>`_ for more information on the possible strategies that can be used here. scoring : string, callable or None, optional, default: None A string or scorer callable object / function with signature ``scorer(estimator, X, y)``. See scikit-learn model evaluation documentation for names of possible metrics. kwargs : dict Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers. Attributes ---------- n_features_ : int The number of features in the selected subset support_ : array of shape [n_features] A mask of the selected features ranking_ : array of shape [n_features] The feature ranking, such that ``ranking_[i]`` corresponds to the ranked position of feature i. Selected features are assigned rank 1. cv_scores_ : array of shape [n_subsets_of_features, n_splits] The cross-validation scores for each subset of features and splits in the cross-validation strategy. rfe_estimator_ : sklearn.feature_selection.RFE A fitted RFE estimator wrapping the original estimator. All estimator functions such as ``predict()`` and ``score()`` are passed through to this estimator (it rewraps the original model). n_feature_subsets_ : array of shape [n_subsets_of_features] The number of features removed on each iteration of RFE, computed by the number of features in the dataset and the step parameter. Notes ----- This model wraps ``sklearn.feature_selection.RFE`` and not ``sklearn.feature_selection.RFECV`` because access to the internals of the CV and RFE estimators is required for the visualization. The visualizer does take similar arguments, however it does not expose the same internal attributes. Additionally, the RFE model can be accessed via the ``rfe_estimator_`` attribute. Once fitted, the visualizer acts as a wrapper for this estimator and not for the original model passed to the model. This way the visualizer model can be used to make predictions. .. caution:: This visualizer requires a model that has either a ``coef_`` or ``feature_importances_`` attribute when fitted. """ def __init__( self, estimator, ax=None, step=1, groups=None, cv=None, scoring=None, **kwargs ): # Initialize the model visualizer super(RFECV, self).__init__(estimator, ax=ax, **kwargs) # Set parameters self.step = step self.groups = groups self.cv = cv self.scoring = scoring
[docs] def fit(self, X, y=None): """ Fits the RFECV with the wrapped model to the specified data and draws the rfecv curve with the optimal number of features found. Parameters ---------- X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression. Returns ------- self : instance Returns the instance of the RFECV visualizer. """ X, y = check_X_y(X, y, "csr") n_features = X.shape[1] # This check is kind of unnecessary since RFE will do it, but it's # nice to get it out of the way ASAP and raise a meaningful error. if 0.0 < self.step < 1.0: step = int(max(1, self.step * n_features)) else: step = int(self.step) if step <= 0: raise YellowbrickValueError("step must be >0") # Create the RFE model rfe = RFE(self.estimator, step=step) self.n_feature_subsets_ = np.arange(1, n_features + step, step) # Create the cross validation params # TODO: handle random state cv_params = {key: self.get_params()[key] for key in ("groups", "cv", "scoring")} # Perform cross-validation for each feature subset scores = [] for n_features_to_select in self.n_feature_subsets_: rfe.set_params(n_features_to_select=n_features_to_select) scores.append(cross_val_score(rfe, X, y, **cv_params)) # Convert scores to array self.cv_scores_ = np.array(scores) # Find the best RFE model bestidx = self.cv_scores_.mean(axis=1).argmax() self.n_features_ = self.n_feature_subsets_[bestidx] # Fit the final RFE model for the number of features self.rfe_estimator_ = rfe self.rfe_estimator_.set_params(n_features_to_select=self.n_features_) self.rfe_estimator_.fit(X, y) # Rewrap the visualizer to use the rfe estimator self._wrapped = self.rfe_estimator_ # Hoist the RFE params to the visualizer self.support_ = self.rfe_estimator_.support_ self.ranking_ = self.rfe_estimator_.ranking_ self.draw() return self
[docs] def draw(self, **kwargs): """ Renders the rfecv curve. """ # Compute the curves x = self.n_feature_subsets_ means = self.cv_scores_.mean(axis=1) sigmas = self.cv_scores_.std(axis=1) # Plot one standard deviation above and below the mean self.ax.fill_between(x, means - sigmas, means + sigmas, alpha=0.25) # Plot the curve self.ax.plot(x, means, "o-") # Plot the maximum number of features self.ax.axvline( self.n_features_, c="k", ls="--", label="n_features = {}\nscore = {:0.3f}".format( self.n_features_, self.cv_scores_.mean(axis=1).max() ), ) return self.ax
[docs] def finalize(self, **kwargs): """ Add the title, legend, and other visual final touches to the plot. """ # Set the title of the figure self.set_title("RFECV for {}".format(self.name)) # Add the legend self.ax.legend(frameon=True, loc="best") # Set the axis labels self.ax.set_xlabel("Number of Features Selected") self.ax.set_ylabel("Score")
########################################################################## ## Quick Methods ##########################################################################
[docs]def rfecv( estimator, X, y, ax=None, step=1, groups=None, cv=None, scoring=None, show=True, **kwargs ): """ Performs recursive feature elimination with cross-validation to determine an optimal number of features for a model. Visualizes the feature subsets with respect to the cross-validation score. This helper function is a quick wrapper to utilize the RFECV visualizer for one-off analysis. Parameters ---------- estimator : a scikit-learn estimator An object that implements ``fit`` and provides information about the relative importance of features with either a ``coef_`` or ``feature_importances_`` attribute. Note that the object is cloned for each validation. X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression. ax : matplotlib.Axes object, optional The axes object to plot the figure on. step : int or float, optional (default=1) If greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration. groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 3-fold cross-validation, - integer, to specify the number of folds. - An object to be used as a cross-validation generator. - An iterable yielding train/test splits. see the scikit-learn `cross-validation guide <http://scikit-learn.org/stable/modules/cross_validation.html>`_ for more information on the possible strategies that can be used here. scoring : string, callable or None, optional, default: None A string or scorer callable object / function with signature ``scorer(estimator, X, y)``. See scikit-learn model evaluation documentation for names of possible metrics. show: bool, default: True If True, calls ``show()``, which in turn calls ``plt.show()`` however you cannot call ``plt.savefig`` from this signature, nor ``clear_figure``. If False, simply calls ``finalize()`` kwargs : dict Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers. These arguments are also passed to the `show()` method, e.g. can pass a path to save the figure to. Returns ------- viz : RFECV Returns the fitted, finalized visualizer. """ # Initialize the visualizer oz = RFECV( estimator, ax=ax, step=step, groups=groups, cv=cv, scoring=scoring, show=show ) # Fit and show the visualizer oz.fit(X, y) if show: oz.show() else: oz.finalize() # Return the visualizer object return oz