Source code for yellowbrick.model_selection.learning_curve

# yellowbrick.model_selection.learning_curve
# Implements a learning curve visualization for model selection.
#
# Author:   Jason Keung
# Created:  Mon May 22 09:22:00 2017 -0500
#
# Copyright (C) 2017 The scikit-yb developers
# For license information, see LICENSE.txt
#
# ID: learning_curve.py [c5355ee] benjamin@bengfort.com $

"""
Implements a learning curve visualization for model selection.
"""

##########################################################################
## Imports
##########################################################################

import numpy as np

from yellowbrick.base import ModelVisualizer
from yellowbrick.style import resolve_colors
from yellowbrick.exceptions import YellowbrickValueError

from sklearn.model_selection import learning_curve as sk_learning_curve


# Default ticks for the learning curve train sizes
DEFAULT_TRAIN_SIZES = np.linspace(0.1, 1.0, 5)


##########################################################################
# LearningCurve Visualizer
##########################################################################


[docs]class LearningCurve(ModelVisualizer):
    """
    Visualizes the learning curve for both test and training data for
    different training set sizes. These curves can act as a proxy to
    demonstrate the implied learning rate with experience (e.g. how much data
    is required to make an adequate model). They also demonstrate if the model
    is more sensitive to error due to bias vs. error due to variance and can
    be used to quickly check if a model is overfitting.

    The visualizer evaluates cross-validated training and test scores for
    different training set sizes. These curves are plotted so that the x-axis
    is the training set size and the y-axis is the score.

    The cross-validation generator splits the whole dataset k times, scores
    are averaged over all k runs for the training subset. The curve plots the
    mean score for the k splits, and the filled in area suggests the
    variability of the cross-validation by plotting one standard deviation
    above and below the mean for each split.

    Parameters
    ----------
    estimator : a scikit-learn estimator
        An object that implements ``fit`` and ``predict``, can be a
        classifier, regressor, or clusterer so long as there is also a valid
        associated scoring metric.

        Note that the object is cloned for each validation.

    ax : matplotlib.Axes object, optional
        The axes object to plot the figure on.

    groups : array-like, with shape (n_samples,)
        Optional group labels for the samples used while splitting the dataset
        into train/test sets.

    train_sizes : array-like, shape (n_ticks,)
        default: ``np.linspace(0.1,1.0,5)``

        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as
        a fraction of the maximum size of the training set, otherwise it is
        interpreted as absolute sizes of the training sets.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        see the scikit-learn
        `cross-validation guide <https://bit.ly/2MMQAI7>`_
        for more information on the possible strategies that can be used here.

    scoring : string, callable or None, optional, default: None
        A string or scorer callable object / function with signature
        ``scorer(estimator, X, y)``. See scikit-learn model evaluation
        documentation for names of possible metrics.

    exploit_incremental_learning : boolean, default: False
        If the estimator supports incremental learning, this will be used to
        speed up fitting for different training set sizes.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).

    pre_dispatch : integer or string, optional
        Number of predispatched jobs for parallel execution (default is
        all). The option can reduce the allocated memory. The string can
        be an expression like '2*n_jobs'.

    shuffle : boolean, optional
        Whether to shuffle training data before taking prefixes of it
        based on``train_sizes``.

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`. Used when ``shuffle`` is True.

    kwargs : dict
        Keyword arguments that are passed to the base class and may influence
        the visualization as defined in other Visualizers.

    Attributes
    ----------
    train_sizes_ : array, shape = (n_unique_ticks,), dtype int
        Numbers of training examples that has been used to generate the
        learning curve. Note that the number of ticks might be less
        than n_ticks because duplicate entries will be removed.

    train_scores_ : array, shape (n_ticks, n_cv_folds)
        Scores on training sets.

    train_scores_mean_ : array, shape (n_ticks,)
        Mean training data scores for each training split

    train_scores_std_ : array, shape (n_ticks,)
        Standard deviation of training data scores for each training split

    test_scores_ : array, shape (n_ticks, n_cv_folds)
        Scores on test set.

    test_scores_mean_ : array, shape (n_ticks,)
        Mean test data scores for each test split

    test_scores_std_ : array, shape (n_ticks,)
        Standard deviation of test data scores for each test split

    Examples
    --------

    >>> from yellowbrick.model_selection import LearningCurve
    >>> from sklearn.naive_bayes import GaussianNB
    >>> model = LearningCurve(GaussianNB())
    >>> model.fit(X, y)
    >>> model.show()

    Notes
    -----
    This visualizer is essentially a wrapper for the
    ``sklearn.model_selection.learning_curve utility``, discussed in the
    `validation curves <https://bit.ly/2KlumeB>`__
    documentation.

    .. seealso:: The documentation for the
        `learning_curve <https://bit.ly/2Yz9sBB>`__
        function, which this visualizer wraps.
    """

    def __init__(
        self,
        estimator,
        ax=None,
        groups=None,
        train_sizes=DEFAULT_TRAIN_SIZES,
        cv=None,
        scoring=None,
        exploit_incremental_learning=False,
        n_jobs=1,
        pre_dispatch="all",
        shuffle=False,
        random_state=None,
        **kwargs
    ):

        # Initialize the model visualizer
        super(LearningCurve, self).__init__(estimator, ax=ax, **kwargs)

        # Validate the train sizes
        train_sizes = np.asarray(train_sizes)
        if train_sizes.ndim != 1:
            raise YellowbrickValueError(
                "must specify array of train sizes, '{}' is not valid".format(
                    repr(train_sizes)
                )
            )

        # Set the metric parameters to be used later
        self.groups = groups
        self.train_sizes = train_sizes
        self.cv = cv
        self.scoring = scoring
        self.exploit_incremental_learning = exploit_incremental_learning
        self.n_jobs = n_jobs
        self.pre_dispatch = pre_dispatch
        self.shuffle = shuffle
        self.random_state = random_state

[docs]    def fit(self, X, y=None):
        """
        Fits the learning curve with the wrapped model to the specified data.
        Draws training and test score curves and saves the scores to the
        estimator.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Training vector, where n_samples is the number of samples and
            n_features is the number of features.

        y : array-like, shape (n_samples) or (n_samples, n_features), optional
            Target relative to X for classification or regression;
            None for unsupervised learning.

        Returns
        -------
        self : instance
            Returns the instance of the learning curve visualizer for use in
            pipelines and other sequential transformers.
        """
        # arguments to pass to sk_learning_curve
        sklc_kwargs = {
            key: self.get_params()[key]
            for key in (
                "groups",
                "train_sizes",
                "cv",
                "scoring",
                "exploit_incremental_learning",
                "n_jobs",
                "pre_dispatch",
                "shuffle",
                "random_state",
            )
        }

        # compute the learning curve and store the scores on the estimator
        curve = sk_learning_curve(self.estimator, X, y, **sklc_kwargs)
        self.train_sizes_, self.train_scores_, self.test_scores_ = curve

        # compute the mean and standard deviation of the training data
        self.train_scores_mean_ = np.mean(self.train_scores_, axis=1)
        self.train_scores_std_ = np.std(self.train_scores_, axis=1)

        # compute the mean and standard deviation of the test data
        self.test_scores_mean_ = np.mean(self.test_scores_, axis=1)
        self.test_scores_std_ = np.std(self.test_scores_, axis=1)

        # draw the curves on the current axes
        self.draw()
        return self

[docs]    def draw(self, **kwargs):
        """
        Renders the training and test learning curves.
        """
        # Specify the curves to draw and their labels
        labels = ("Training Score", "Cross Validation Score")
        curves = (
            (self.train_scores_mean_, self.train_scores_std_),
            (self.test_scores_mean_, self.test_scores_std_),
        )

        # Get the colors for the train and test curves
        colors = resolve_colors(n_colors=2)

        # Plot the fill betweens first so they are behind the curves.
        for idx, (mean, std) in enumerate(curves):
            # Plot one standard deviation above and below the mean
            self.ax.fill_between(
                self.train_sizes_, mean - std, mean + std, alpha=0.25, color=colors[idx]
            )

        # Plot the mean curves so they are in front of the variance fill
        for idx, (mean, _) in enumerate(curves):
            self.ax.plot(
                self.train_sizes_, mean, "o-", color=colors[idx], label=labels[idx]
            )

        return self.ax

[docs]    def finalize(self, **kwargs):
        """
        Add the title, legend, and other visual final touches to the plot.
        """
        # Set the title of the figure
        self.set_title("Learning Curve for {}".format(self.name))

        # Add the legend
        self.ax.legend(frameon=True, loc="best")

        # Set the axis labels
        self.ax.set_xlabel("Training Instances")
        self.ax.set_ylabel("Score")


##########################################################################
# Quick Method
##########################################################################


[docs]def learning_curve(
    estimator,
    X,
    y,
    ax=None,
    groups=None,
    train_sizes=DEFAULT_TRAIN_SIZES,
    cv=None,
    scoring=None,
    exploit_incremental_learning=False,
    n_jobs=1,
    pre_dispatch="all",
    shuffle=False,
    random_state=None,
    show=True,
    **kwargs
):
    """
    Displays a learning curve based on number of samples vs training and
    cross validation scores. The learning curve aims to show how a model
    learns and improves with experience.

    This helper function is a quick wrapper to utilize the LearningCurve
    for one-off analysis.

    Parameters
    ----------
    estimator : a scikit-learn estimator
        An object that implements ``fit`` and ``predict``, can be a
        classifier, regressor, or clusterer so long as there is also a valid
        associated scoring metric.

        Note that the object is cloned for each validation.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ax : matplotlib.Axes object, optional
        The axes object to plot the figure on.

    groups : array-like, with shape (n_samples,)
        Optional group labels for the samples used while splitting the dataset
        into train/test sets.

    train_sizes : array-like, shape (n_ticks,)
        default: ``np.linspace(0.1,1.0,5)``

        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as
        a fraction of the maximum size of the training set, otherwise it is
        interpreted as absolute sizes of the training sets.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        see the scikit-learn
        `cross-validation guide <https://bit.ly/2MMQAI7>`_
        for more information on the possible strategies that can be used here.

    scoring : string, callable or None, optional, default: None
        A string or scorer callable object / function with signature
        ``scorer(estimator, X, y)``. See scikit-learn model evaluation
        documentation for names of possible metrics.

    exploit_incremental_learning : boolean, default: False
        If the estimator supports incremental learning, this will be used to
        speed up fitting for different training set sizes.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).

    pre_dispatch : integer or string, optional
        Number of predispatched jobs for parallel execution (default is
        all). The option can reduce the allocated memory. The string can
        be an expression like '2*n_jobs'.

    shuffle : boolean, optional
        Whether to shuffle training data before taking prefixes of it
        based on``train_sizes``.

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`. Used when ``shuffle`` is True.

    show : bool, default: True
        If True, calls ``show()``, which in turn calls ``plt.show()`` however
        you cannot call ``plt.savefig`` from this signature, nor
        ``clear_figure``. If False, simply calls ``finalize()``

    kwargs : dict
        Keyword arguments that are passed to the base class and may influence
        the visualization as defined in other Visualizers. These arguments are
        also passed to the `show()` method, e.g. can pass a path to save the
        figure to.

    Returns
    -------
    visualizer : LearningCurve
        Returns the fitted visualizer.
    """
    # Initialize the visualizer
    oz = LearningCurve(
        estimator,
        ax=ax,
        groups=groups,
        train_sizes=train_sizes,
        cv=cv,
        scoring=scoring,
        n_jobs=n_jobs,
        pre_dispatch=pre_dispatch,
        shuffle=shuffle,
        random_state=random_state,
        exploit_incremental_learning=exploit_incremental_learning,
        **kwargs
    )

    # Fit and show the visualizer
    oz.fit(X, y)
    if show:
        oz.show()
    else:
        oz.finalize()
    return oz