Example Datasets

Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository to present the examples used throughout this documentation. If you haven’t downloaded the data, you can do so by running:

$ python -m yellowbrick.download

This should create a folder named data in your current working directory that contains all of the datasets. You can load a specified dataset with pandas.read_csv as follows:

import pandas as pd

data = pd.read_csv('data/concrete/concrete.csv')

The following code snippet can be found at the top of the examples/examples.ipynb notebook in Yellowbrick. Please reference this code when trying to load a specific data set:

import os

from yellowbrick.download import download_all

## The path to the test data sets
FIXTURES  = os.path.join(os.getcwd(), "data")

## Dataset loading mechanisms
datasets = {
    "bikeshare": os.path.join(FIXTURES, "bikeshare", "bikeshare.csv"),
    "concrete": os.path.join(FIXTURES, "concrete", "concrete.csv"),
    "credit": os.path.join(FIXTURES, "credit", "credit.csv"),
    "energy": os.path.join(FIXTURES, "energy", "energy.csv"),
    "game": os.path.join(FIXTURES, "game", "game.csv"),
    "mushroom": os.path.join(FIXTURES, "mushroom", "mushroom.csv"),
    "occupancy": os.path.join(FIXTURES, "occupancy", "occupancy.csv"),
    "spam": os.path.join(FIXTURES, "spam", "spam.csv"),
}


def load_data(name, download=True):
    """
    Loads and wrangles the passed in dataset by name.
    If download is specified, this method will download any missing files.
    """

    # Get the path from the datasets
    path = datasets[name]

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        if download:
            download_all()
        else:
            raise ValueError((
                "'{}' dataset has not been downloaded, "
                "use the download.py module to fetch datasets"
            ).format(name))


    # Return the data frame
    return pd.read_csv(path)

Unless otherwise specified, most of the examples currently use one or more of the listed datasets. Each dataset has a README.md with detailed information about the data source, attributes, and target. Here is a complete listing of all datasets in Yellowbrick and their associated analytical tasks:

  • bikeshare: suitable for regression
  • concrete: suitable for regression
  • credit: suitable for classification/clustering
  • energy: suitable for regression
  • game: suitable for classification
  • hobbies: suitable for text analysis
  • mushroom: suitable for classification/clustering
  • occupancy: suitable for classification
  • spam: suitable for binary classification

API Reference

Helper functions for looking up dataset paths.

yellowbrick.datasets.path.cleanup_dataset(dataset, data_home=None, ext='.zip')[source]

Removes the dataset directory and archive file from the data home directory.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

ext : str, default: “.zip”

The extension of the archive file.

Returns:
removed : int

The number of objects removed from data_home.

yellowbrick.datasets.path.dataset_archive(dataset, signature, data_home=None, ext='.zip')[source]

Checks to see if the dataset archive file exists in the data home directory, found with get_data_home. By specifying the signature, this function also checks to see if the archive is the latest version by comparing the sha256sum of the local archive with the specified signature.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

signature : str

The SHA 256 signature of the dataset, used to determine if the archive is the latest version of the dataset or not.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

ext : str, default: “.zip”

The extension of the archive file.

Returns:
exists : bool

True if the dataset archive exists and is the latest version.

yellowbrick.datasets.path.dataset_exists(dataset, data_home=None)[source]

Checks to see if a directory with the name of the specified dataset exists in the data home directory, found with get_data_home.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

Returns:
exists : bool

If a folder with the dataset name is in the data home directory.

yellowbrick.datasets.path.find_dataset_path(dataset, data_home=None, fname=None, ext='.csv.gz', raises=True)[source]

Looks up the path to the dataset specified in the data home directory, which is found using the get_data_home function. By default data home is colocated with the code, but can be modified with the YELLOWBRICK_DATA environment variable, or passing in a different directory.

The file returned will be by default, the name of the dataset in compressed CSV format. Other files and extensions can be passed in to locate other data types or auxilliary files.

If the dataset is not found a DatasetsError is raised by default.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

fname : str, optional

The filename to look up in the dataset path, by default it will be the name of the dataset. The fname must include an extension.

ext : str, default: “.csv.gz”

The extension of the data to look up in the dataset path, if the fname is specified then the ext parameter is ignored. If ext is None then the directory of the dataset will be returned.

raises : bool, default: True

If the path does not exist, raises a DatasetsError unless this flag is set to False, at which point None is returned (e.g. for checking if the path exists or not).

Returns:
path : str or None

A path to the requested file, guaranteed to exist if an exception is not raised during processing of the request (unless None is returned).

raises : DatasetsError

If raise is True and the path does not exist, raises a DatasetsError.

yellowbrick.datasets.path.get_data_home(path=None)[source]

Return the path of the Yellowbrick data directory. This folder is used by dataset loaders to avoid downloading data several times.

By default, this folder is colocated with the code in the install directory so that data shipped with the package can be easily located. Alternatively it can be set by the YELLOWBRICK_DATA environment variable, or programmatically by giving a folder path. Note that the ‘~’ symbol is expanded to the user home directory, and environment variables are also expanded when resolving the path.