Example Datasets

Yellowbrick hosts several datasets wrangled from the UCI Machine Learning Repository to present the examples used throughout this documentation. These datasets are hosted in our CDN and must be downloaded for use. Typically, when a user calls one of the data loader functions, e.g. load_bikeshare() the data is automatically downloaded if it’s not already on the user’s computer. However, for development and testing, or if you know you will be working without internet access, it might be easier to simply download all the data at once.

The data downloader script can be run as follows:

$ python -m yellowbrick.download

This will download all of the data to the fixtures directory inside of the Yellowbrick site packages. You can specify the location of the download either as an argument to the downloader script (use --help for more details) or by setting the $YELLOWBRICK_DATA environment variable. This is the preferred mechanism because this will also influence how data is loaded in Yellowbrick.

Note

Developers who have downloaded data from Yellowbrick versions earlier than v1.0 may experience some problems with the older data format. If this occurs, you can clear out your data cache by running python -m yellowbrick.download --cleanup. This will remove old datasets and download the new ones. You can also use the --no-download flag to simply clear the cache without re-downloading data. Users who are having difficulty with datasets can also use this or they can uninstall and reinstall Yellowbrick using pip.

Once you have downloaded the example datasets, you can load and use them as follows:

from yellowbrick.datasets import load_bikeshare

X, y = load_bikeshare() # returns features and targets for the bikeshare dataset

Each dataset has a README.md with detailed information about the data source, attributes, and target as well as other metadata. To get access to the metadata or to more precisely control your data access you can return the dataset directly from the loader as follows:

dataset = load_bikeshare(return_dataset=True)
print(dataset.README)

df = dataset.to_dataframe()
df.head()

Datasets

Unless otherwise specified, most of the documentation examples currently use one or more of the listed datasets. Here is a complete listing of all datasets in Yellowbrick and the analytical tasks with which they are most commonly associated:

  • Bikeshare: suitable for regression

  • Concrete: suitable for regression

  • Credit: suitable for classification/clustering

  • Energy: suitable for regression

  • Game: suitable for multi-class classification

  • Hobbies: suitable for text analysis/classification

  • Mushroom: suitable for classification/clustering

  • Occupancy: suitable for classification

  • Spam: suitable for binary classification

  • Walking: suitable for time series analysis/clustering

  • NFL: suitable for clustering

Yellowbrick has included these datasets in our package for demonstration purposes only. The datasets have been repackaged with the permission of the authors or in accordance with the terms of use of the source material. If you use a Yellowbrick wrangled dataset, please be sure to cite the original author.

API Reference

By default, the dataset loaders return a features table, X, and a target vector y when called. If the user has Pandas installed, the data types will be a pd.DataFrame and pd.Series respectively, otherwise the data will be returned as numpy arrays. This functionality ensures that the primary use of the datasets, to follow along with the documentation examples, is as simple as possible. However, advanced users may note that there does exist an underlying object with advanced functionality that can be accessed as follows:

dataset = load_occupancy(return_dataset=True)

There are two basic types of dataset, the Dataset which is used for tabular data loaded from a CSV and the Corpus, used to load text corpora from disk. Both types of dataset give access to a readme file, a citation in BibTex format, json metadata that describe the fields and target, and different data types associated with the underlying datasset. Both objects are also responsible for locating the dataset on disk and downloading it safely if it doesn’t exist yet. For more on how Yellowbrick downloads and stores data, please see Local Storage.

Tabular Data

Most example datasets are returned as tabular data structures loaded either from a .csv file (using Pandas) or from dtype encoded .npz file to ensure correct numpy arrays are returned. The Dataset object loads the data from these stored files, preferring to use Pandas if it is installed. It then uses metadata to slice the DataFrame into a feature matrix and target array. Using the dataset directly provides extra functionality, and can be retrieved as follows:

from yellowbrick.datasets import load_concrete
dataset = load_concrete(return_dataset=True)

For example if you wish to get the raw data frame you can do so as follows:

df = dataset.to_dataframe()
df.head()

There may be additional columns in the DataFrame that were part of the original dataset but were excluded from the featureset. For example, the energy dataset contains two targets, the heating and the cooling load, but only the heating load is returned by default. The api documentation that follows describes in details the metadata properties and other functionality associated with the Dataset:

class yellowbrick.datasets.base.Dataset(name, url=None, signature=None, data_home=None)[source]

Bases: BaseDataset

Datasets contain a reference to data on disk and provide utilities for quickly loading files and objects into a variety of formats. The most common use of the Dataset object is to load example datasets provided by Yellowbrick to run the examples in the documentation.

The dataset by default will return the data as a numpy array, however if Pandas is installed, it is possible to access the data as a DataFrame and Series object. In either case, the data is represented by a features table, X and a target vector, y.

Parameters
namestr

The name of the dataset; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable. This name is used to perform all lookups and identify the dataset externally.

data_homestr, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

urlstr, optional

The web location where the archive file of the dataset can be downloaded from.

signaturestr, optional

The signature of the data archive file, used to verify that the latest version of the data has been downloaded and that the download hasn’t been corrupted or modified in anyway.

property README

Returns the contents of the README.md file that describes the dataset in detail and contains attribution information.

property citation

Returns the contents of the citation.bib file that describes the source and provenance of the dataset or to cite for academic work.

contents()

Contents returns a list of the files in the data directory.

download(replace=False)

Download the dataset from the hosted Yellowbrick data store and save it to the location specified by get_data_home. The downloader verifies the download completed successfully and safely by comparing the expected signature with the SHA 256 signature of the downloaded archive file.

Parameters
replacebool, default: False

If the data archive already exists, replace the dataset. If this is False and the dataset exists, an exception is raised.

property meta

Returns the contents of the meta.json file that describes important attributes about the dataset and modifies the behavior of the loader.

to_data()[source]

Returns the data contained in the dataset as X and y where X is the features matrix and y is the target vector. If pandas is installed, the data will be returned as DataFrame and Series objects. Otherwise, the data will be returned as two numpy arrays.

Returns
Xarray-like with shape (n_instances, n_features)

A pandas DataFrame or numpy array describing the instance features.

yarray-like with shape (n_instances,)

A pandas Series or numpy array describing the target vector.

to_dataframe()[source]

Returns the entire dataset as a single pandas DataFrame.

Returns
dfDataFrame with shape (n_instances, n_columns)

A pandas DataFrame containing the complete original data table including all targets (specified by the meta data) and all features (including those that might have been filtered out).

to_numpy()[source]

Returns the dataset as two numpy arrays: X and y.

Returns
Xarray-like with shape (n_instances, n_features)

A numpy array describing the instance features.

yarray-like with shape (n_instances,)

A numpy array describing the target vector.

to_pandas()[source]

Returns the dataset as two pandas objects: X and y.

Returns
XDataFrame with shape (n_instances, n_features)

A pandas DataFrame containing feature data and named columns.

ySeries with shape (n_instances,)

A pandas Series containing target data and an index that matches the feature DataFrame index.

Text Corpora

Yellowbrick supports many text-specific machine learning visualizations in the yellowbrick.text module. To facilitate these examples and show an end-to-end visual diagnostics workflow that includes text preprocessing, Yellowbrick supports a Corpus dataset loader that provides access to raw text data from individual documents. Most notably used with the hobbies corpus, a collection of blog posts from different topics that can be used for text classification tasks.

A text corpus is composed of individual documents that are stored on disk in a directory structure that also identifies document relationships. The file name of each document is a unique file ID (e.g. the MD5 hash of its contents). For example, the hobbies corpus is structured as follows:

data/hobbies
├── README.md
└── books
|   ├── 56d62a53c1808113ffb87f1f.txt
|   └── 5745a9c7c180810be6efd70b.txt
└── cinema
|   ├── 56d629b5c1808113ffb87d8f.txt
|   └── 57408e5fc180810be6e574c8.txt
└── cooking
|   ├── 56d62b25c1808113ffb8813b.txt
|   └── 573f0728c180810be6e2575c.txt
└── gaming
|   ├── 56d62654c1808113ffb87938.txt
|   └── 574585d7c180810be6ef7ffc.txt
└── sports
    ├── 56d62adec1808113ffb88054.txt
    └── 56d70f17c180810560aec345.txt

Unlike the Dataset, corpus dataset loaders do not return X and y specially prepared for machine learning. Instead, these loaders return a Corpus object, which can be used to get a more detailed view of the dataset. For example, to list the unique categories in the corpus, you would access the labels property as follows:

from yellowbrick.datasets import load_hobbies

corpus = load_hobbies()
corpus.labels

Addtionally, you can access the list of the absolute paths of each file, which allows you to read individual documents or to use scikit-learn utilties that read the documents streaming one at a time rather than loading them into memory all at once.

with open(corpus.files[8], 'r') as f:
    print(f.read())

To get the raw text data and target labels, use the data and target properties.

X, y = corpus.data, corpus.target

For more details on the other metadata properties associated with the Corpus, please refer to the API reference below. For more detail on text analytics and machine learning with scikit-learn, please refer to “Working with Text Data” in the scikit-learn documentation.

class yellowbrick.datasets.base.Corpus(name, url=None, signature=None, data_home=None)[source]

Bases: BaseDataset

Corpus datasets contain a reference to documents on disk and provide utilities for quickly loading text data for use in machine learning workflows. The most common use of the corpus is to load the text analysis examples from the Yellowbrick documentation.

Parameters
namestr

The name of the corpus; should either be a folder in data home or specified in the yellowbrick.datasets.DATASETS variable. This name is used to perform all lookups and identify the corpus externally.

data_homestr, optional

The path on disk where data is stored. If not passed in, it is looked up from YELLOWBRICK_DATA or the default returned by get_data_home.

urlstr, optional

The web location where the archive file of the corpus can be downloaded from.

signaturestr, optional

The signature of the data archive file, used to verify that the latest version of the data has been downloaded and that the download hasn’t been corrupted or modified in anyway.

property README

Returns the contents of the README.md file that describes the dataset in detail and contains attribution information.

property citation

Returns the contents of the citation.bib file that describes the source and provenance of the dataset or to cite for academic work.

contents()

Contents returns a list of the files in the data directory.

property data

Read all of the documents from disk into an in-memory list.

download(replace=False)

Download the dataset from the hosted Yellowbrick data store and save it to the location specified by get_data_home. The downloader verifies the download completed successfully and safely by comparing the expected signature with the SHA 256 signature of the downloaded archive file.

Parameters
replacebool, default: False

If the data archive already exists, replace the dataset. If this is False and the dataset exists, an exception is raised.

property files

Returns the list of file names for all documents.

property labels

Return the unique labels assigned to the documents.

property meta

Returns the contents of the meta.json file that describes important attributes about the dataset and modifies the behavior of the loader.

property root

Discovers and caches the root directory of the corpus.

property target

Returns the label associated with each item in data.

Local Storage

Yellowbrick datasets are stored in a compressed format in the cloud to ensure that the install process is as streamlined and lightweight as possible. When you request a dataset via the loader module, Yellowbrick checks to see if it has been downloaded already, and if not, it downloads it to your local disk.

By default the dataset is stored, uncompressed, in the site-packages folder of your Python installation alongside the Yellowbrick code. This means that if you install Yellowbrick in multiple virtual environments, the datasets will be downloaded multiple times in each environment.

To cleanup downloaded datasets, you may use the download module as a command line tool. Note, however, that this will only cleanup the datasets in the yellowbrick package that is on the $PYTHON_PATH of the environment you’re currently in.

$ python -m yellowbrick.download --cleanup --no-download

Alternatively, because the data is stored in the same directory as the code, you can simply pip uninstall yellowbrick to cleanup the data.

A better option may be to use a single dataset directory across all virtual environments. To specify this directory, you must set the $YELLOWBRICK_DATA environment variable, usually by adding it to your bash profile so it is exported every time you open a terminal window. This will ensure that you have only downloaded the data once.

$ export YELLOWBRICK_DATA="~/.yellowbrick"
$ python -m yellowbrick.download -f
$ ls $YELLOWBRICK_DATA

To identify the location that the Yellowbrick datasets are stored for your installation of Python/Yellowbrick, you can use the get_data_home function:

yellowbrick.datasets.path.get_data_home(path=None)[source]

Return the path of the Yellowbrick data directory. This folder is used by dataset loaders to avoid downloading data several times.

By default, this folder is colocated with the code in the install directory so that data shipped with the package can be easily located. Alternatively it can be set by the $YELLOWBRICK_DATA environment variable, or programmatically by giving a folder path. Note that the '~' symbol is expanded to the user home directory, and environment variables are also expanded when resolving the path.