Spam

Classifying Email as Spam or Non-Spam.

Samples total	4601
Dimensionality	57
Features	real, integer
Targets	int: {1 for spam, 0 for not spam}
Task(s)	classification

Description

The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography…

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

Determine whether a given email is spam or not.

~7% misclassification error. False positives (marking good mail as spam) are very undesirable.If we insist on zero false positives in the training/testing set, 20-25% of the spam passed through the filter.

Citation

Downloaded from the UCI Machine Learning Repository on March 23, 2018.

Cranor, Lorrie Faith, and Brian A. LaMacchia. “Spam!.” Communications of the ACM 41.8 (1998): 74-83.

Loader

yellowbrick.datasets.loaders.load_spam(data_home=None, return_dataset=False)[source]

Loads the email spam dataset that is weill suited to binary classification and threshold tasks. The dataset contains 4600 instances with 57 integer and real valued attributes and a discrete target.

The Yellowbrick datasets are hosted online and when requested, the dataset is downloaded to your local computer for use. Note that if the dataset hasn’t been downloaded before, an Internet connection is required. However, if the data is cached locally, no data will be downloaded. Yellowbrick checks the known signature of the dataset with the data downloaded to ensure the download completes successfully.

Datasets are stored alongside the code, but the location can be specified with the data_home parameter or the $YELLOWBRICK_DATA envvar.

Parameters

data_homestr, optional: The path on disk where data is stored. If not passed in, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home.
return_datasetbool, default=False: Return the raw dataset object instead of X and y numpy arrays to get access to alternative targets, extra features, content and meta.

Returns

Xarray-like with shape (n_instances, n_features) if return_dataset=False: A pandas DataFrame or numpy array describing the instance features.
yarray-like with shape (n_instances,) if return_dataset=False: A pandas Series or numpy array describing the target vector.
datasetDataset instance if return_dataset=True: The Yellowbrick Dataset object provides an interface to accessing the data in a variety of formats as well as associated metadata and content.