The Baleen hobbies corpus contains 448 files in 5 categories.

Samples total





strings (tokens)


str: {“books”, “cinema”, “cooking”, “gaming”, “sports”}


classification, clustering


The hobbies corpus is a text corpus wrangled from the Baleen RSS Corpus in order to enable students and readers to practice different techniques in Natural Language Processing. For more information see Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning and the associated code repository. It is structured as:

Documents and File Size

  • books: 72 docs (4.1MiB)

  • cinema: 100 docs (9.2MiB)

  • cooking: 30 docs (3.0MiB)

  • gaming: 128 docs (8.8MiB)

  • sports: 118 docs (15.9MiB)

Document Structure


  • 7,420 paragraphs (16.562 mean paragraphs per file)

  • 14,251 sentences (1.921 mean sentences per paragraph).

By Category:

  • books: 844 paragraphs and 2,030 sentences

  • cinema: 1,475 paragraphs and 3,047 sentences

  • cooking: 1,190 paragraphs and 2,425 sentences

  • gaming: 1,802 paragraphs and 3,373 sentences

  • sports: 2,109 paragraphs and 3,376 sentences

Words and Vocabulary

Word count of 288,520 with a vocabulary of 23,738 (12.154 lexical diversity).

  • books: 41,851 words with a vocabulary size of 7,838

  • cinema: 69,153 words with a vocabulary size of 10,274

  • cooking: 37,854 words with a vocabulary size of 5,038

  • gaming: 70,778 words with a vocabulary size of 9,120

  • sports: 68,884 words with a vocabulary size of 8,028


The hobbies corpus loader returns a Corpus object with the raw text associated with the data set. This must be vectorized into a numeric form for use with scikit-learn. For example, you could use the sklearn.feature_extraction.text.TfidfVectorizer as follows:

from yellowbrick.datasets import load_hobbies

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split as tts

corpus = load_hobbies()
X = TfidfVectorizer().fit_transform(corpus.data)
y = LabelEncoder().fit_transform(corpus.target)

X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2)

model = MultinomialNB().fit(X_train, y_train)
model.score(X_test, y_test)

For more detail on text analytics and machine learning with scikit-learn, please refer to “Working with Text Data” in the scikit-learn documentation.


Exported from S3 on: Jan 21, 2017 at 06:42.

Bengfort, Benjamin, Rebecca Bilbro, and Tony Ojeda. Applied Text Analysis with Python: Enabling Language-aware Data Products with Machine Learning. ” O’Reilly Media, Inc.”, 2018.



Loads the hobbies text corpus that is well suited to classification, clustering, and text analysis tasks. The dataset contains 448 documents in 5 categories with 7420 paragraphs, 14251 sentences, 288520 words, and a vocabulary of 23738.

The Yellowbrick datasets are hosted online and when requested, the dataset is downloaded to your local computer for use. Note that if the dataset hasn’t been downloaded before, an Internet connection is required. However, if the data is cached locally, no data will be downloaded. Yellowbrick checks the known signature of the dataset with the data downloaded to ensure the download completes successfully.

Datasets are stored alongside the code, but the location can be specified with the data_home parameter or the $YELLOWBRICK_DATA envvar.

data_homestr, optional

The path on disk where data is stored. If not passed in, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home.


The Yellowbrick Corpus object provides an interface to accessing the text documents and metadata associated with the corpus.