Example Datasets

Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository to present the examples used throughout this documentation. If you haven’t downloaded the data, you can do so by running:

$ python -m yellowbrick.download

This should create a folder named data in your current working directory that contains all of the datasets. You can load a specified dataset with pandas.read_csv as follows:

import pandas as pd

data = pd.read_csv('data/concrete/concrete.csv')

The following code snippet can be found at the top of the examples/examples.ipynb notebook in Yellowbrick. Please reference this code when trying to load a specific data set:

import os

from yellowbrick.download import download_all

## The path to the test data sets
FIXTURES  = os.path.join(os.getcwd(), "data")

## Dataset loading mechanisms
datasets = {
    "bikeshare": os.path.join(FIXTURES, "bikeshare", "bikeshare.csv"),
    "concrete": os.path.join(FIXTURES, "concrete", "concrete.csv"),
    "credit": os.path.join(FIXTURES, "credit", "credit.csv"),
    "energy": os.path.join(FIXTURES, "energy", "energy.csv"),
    "game": os.path.join(FIXTURES, "game", "game.csv"),
    "mushroom": os.path.join(FIXTURES, "mushroom", "mushroom.csv"),
    "occupancy": os.path.join(FIXTURES, "occupancy", "occupancy.csv"),
}


def load_data(name, download=True):
    """
    Loads and wrangles the passed in dataset by name.
    If download is specified, this method will download any missing files.
    """

    # Get the path from the datasets
    path = datasets[name]

    # Check if the data exists, otherwise download or raise
    if not os.path.exists(path):
        if download:
            download_all()
        else:
            raise ValueError((
                "'{}' dataset has not been downloaded, "
                "use the download.py module to fetch datasets"
            ).format(name))


    # Return the data frame
    return pd.read_csv(path)

Unless otherwise specified, most of the examples currently use one or more of the listed datasets. Each dataset has a README.md with detailed information about the data source, attributes, and target. Here is a complete listing of all datasets in Yellowbrick and their associated analytical tasks: - bikeshare: suitable for regression - concrete: suitable for regression - credit: suitable for classification/clustering - energy: suitable for regression - game: suitable for classification - hobbies: suitable for text analysis - mushroom: suitable for classification/clustering - occupancy: suitable for classification