As in the previous sections, Yellowbrick has provided a sample dataset to run the following cells. In particular, we are going to use a text corpus wrangled from the Baleen RSS Corpus to present the following examples. If you haven’t already downloaded the data, you can do so by running:

\$ python -m yellowbrick.download


Note that this will create a directory called data in your current working directory that contains subdirectories with the provided datasets.

Note

If you’ve already followed the instructions from downloading example datasets, you don’t have to repeat these steps here. Simply check to ensure there is a directory called hobbies in your data directory.

The following code snippet creates a utility that will load the corpus from disk into a scikit-learn Bunch object. This method creates a corpus that is exactly the same as the one found in the “working with text data” example on the scikit-learn website, hopefully making the examples easier to use.

import os
from sklearn.datasets.base import Bunch

"""
Loads and wrangles the passed in text corpus by path.
"""

if not os.path.exists(path):
raise ValueError((
).format(path))

# Read the directories in the directory as the categories.
categories = [
cat for cat in os.listdir(path)
if os.path.isdir(os.path.join(path, cat))
]

files  = [] # holds the file names relative to the root
data   = [] # holds the text read from the file
target = [] # holds the string of the category

# Load the data from the files in the corpus
for cat in categories:
for name in os.listdir(os.path.join(path, cat)):
files.append(os.path.join(path, cat, name))
target.append(cat)

with open(os.path.join(path, cat, name), 'r') as f:

# Return the data bunch for use similar to the newsgroups example
return Bunch(
categories=categories,
files=files,
data=data,
target=target,
)


This is a fairly long bit of code, so let’s walk through it step by step. The data in the corpus directory is stored as follows:

data/hobbies
└── books
|   ├── 56d62a53c1808113ffb87f1f.txt
|   └── 5745a9c7c180810be6efd70b.txt
└── cinema
|   ├── 56d629b5c1808113ffb87d8f.txt
|   └── 57408e5fc180810be6e574c8.txt
└── cooking
|   ├── 56d62b25c1808113ffb8813b.txt
|   └── 573f0728c180810be6e2575c.txt
└── gaming
|   ├── 56d62654c1808113ffb87938.txt
|   └── 574585d7c180810be6ef7ffc.txt
└── sports

corpus = load_corpus("data/hobbies")