Example Datasets

Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository to present the examples used throughout this documentation. These datasets are hosted in our CDN and must be downloaded for use. Typically, when a user calls one of the data loader functions, e.g. load_bikeshare() the data is automatically downloaded if it’s not already on the user’s computer. However, for development and testing, or if you know you will be working without internet access, it might be easier to simply download all the data at once.

The data downloader script can be run as follows:

$ python -m yellowbrick.download

This will download all of the data to the fixtures directory inside of the Yellowbrick site packages. You can specify the location of the download either as an argument to the downloader script (use –help for more details) or by setting the $YELLOWBRICK_DATA environment variable. This is the preferred mechanism because this will also influence how data is loaded in Yellowbrick.

Note

Developers who have downloaded data from Yellowbrick versions earlier than v1.0 may experience some problems with the older data format. If this occurs, you can clear out your data cache by running python -m yellowbrick.download --cleanup. This will remove old datasets and download the new ones. You can also use the --no-download flag to simply clear the cache without re-downloading data. Users who are having difficulty with datasets can also use this or they can uninstall and reinstall Yellowbrick using pip.

Once you have downloaded the example datasets, you can load and use them as follows:

from yellowbrick.datasets import load_bikeshare

X, y = load_bikeshare() # returns features and targets for the bikeshare dataset

Unless otherwise specified, most of the examples currently use one or more of the listed datasets. Each dataset has a README.md with detailed information about the data source, attributes, and target. Here is a complete listing of all datasets in Yellowbrick and the analytical tasks with which they are most commonly associated:

  • bikeshare: suitable for regression
  • concrete: suitable for regression
  • credit: suitable for classification/clustering
  • energy: suitable for regression
  • game: suitable for classification
  • hobbies: suitable for text analysis/classification
  • mushroom: suitable for classification/clustering
  • occupancy: suitable for classification
  • spam: suitable for binary classification
  • walking: suitable for time series analysis/clustering

API Reference

Helper functions for looking up dataset paths.

yellowbrick.datasets.path.cleanup_dataset(dataset, data_home=None, ext='.zip')[source]

Removes the dataset directory and archive file from the data home directory.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

ext : str, default: “.zip”

The extension of the archive file.

Returns:
removed : int

The number of objects removed from data_home.

yellowbrick.datasets.path.dataset_archive(dataset, signature, data_home=None, ext='.zip')[source]

Checks to see if the dataset archive file exists in the data home directory, found with get_data_home. By specifying the signature, this function also checks to see if the archive is the latest version by comparing the sha256sum of the local archive with the specified signature.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

signature : str

The SHA 256 signature of the dataset, used to determine if the archive is the latest version of the dataset or not.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

ext : str, default: “.zip”

The extension of the archive file.

Returns:
exists : bool

True if the dataset archive exists and is the latest version.

yellowbrick.datasets.path.dataset_exists(dataset, data_home=None)[source]

Checks to see if a directory with the name of the specified dataset exists in the data home directory, found with get_data_home.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

Returns:
exists : bool

If a folder with the dataset name is in the data home directory.

yellowbrick.datasets.path.find_dataset_path(dataset, data_home=None, fname=None, ext='.csv.gz', raises=True)[source]

Looks up the path to the dataset specified in the data home directory, which is found using the get_data_home function. By default data home is colocated with the code, but can be modified with the YELLOWBRICK_DATA environment variable, or passing in a different directory.

The file returned will be by default, the name of the dataset in compressed CSV format. Other files and extensions can be passed in to locate other data types or auxilliary files.

If the dataset is not found a DatasetsError is raised by default.

Parameters:
dataset : str

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable.

data_home : str, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

fname : str, optional

The filename to look up in the dataset path, by default it will be the name of the dataset. The fname must include an extension.

ext : str, default: “.csv.gz”

The extension of the data to look up in the dataset path, if the fname is specified then the ext parameter is ignored. If ext is None then the directory of the dataset will be returned.

raises : bool, default: True

If the path does not exist, raises a DatasetsError unless this flag is set to False, at which point None is returned (e.g. for checking if the path exists or not).

Returns:
path : str or None

A path to the requested file, guaranteed to exist if an exception is not raised during processing of the request (unless None is returned).

raises : DatasetsError

If raise is True and the path does not exist, raises a DatasetsError.

yellowbrick.datasets.path.get_data_home(path=None)[source]

Return the path of the Yellowbrick data directory. This folder is used by dataset loaders to avoid downloading data several times.

By default, this folder is colocated with the code in the install directory so that data shipped with the package can be easily located. Alternatively it can be set by the YELLOWBRICK_DATA environment variable, or programmatically by giving a folder path. Note that the ‘~’ symbol is expanded to the user home directory, and environment variables are also expanded when resolving the path.