UMAP Corpus Visualization

UMAP is a nonlinear dimensionality reduction method that is well suited to embedding in two or three dimensions for visualization as a scatter plot. UMAP is a relatively new technique but very effective for visualizing clusters or groups of data points and their relative proximities. It does a good job of learning the local structure within your data but also attempts to preserve the relationships between your groups as can be seen in its exploration of MNIST. It is fast, scalable, and can be applied directly to sparse matrices, eliminating the need to run a truncatedSVD as a pre-processing step. Additionally, it supports a wide variety of distance measures allowing for easy exploration of your data. For a more detailed explanation of the algorithm the paper can be found here.

In this example we represent documents via a term frequency inverse document frequency (TF-IDF) vector. Then use UMAP to find a low dimensional representation of these documents. The yellowbrick visualizer then plots the scatter plot, coloring by cluster or by class, or neither if a structural analysis is required.

from yellowbrick.text import UMAPVisualizer
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.datasets.utils import load_corpus

After importing the required tools, we can load the corpus and vectorize the text using TF-IDF.

# Load the data and create document vectors
corpus = load_corpus('hobbies')
tfidf  = TfidfVectorizer()
docs   = tfidf.fit_transform(corpus.data)
labels = corpus.target

Now that the corpus is vectorized we can visualize it, showing the distribution of classes.

umap   = UMAPVisualizer()
umap.fit(docs,labels)
umap.poof()
../../_images/umap_all_docs_euclidean.png

Alternatively, if we believed that cosine distance was a more appropriate metric on our feature space we could specify that via a metric paramater passed through to the underlying UMAP function by the UMAPVisualizer.

umap   = UMAPVisualizer(metric='cosine')
umap.fit(docs,labels)
umap.poof()
../../_images/umap_all_docs_cosine.png

If we omit the target during fit, we can visualize the whole dataset to see if any meaningful patterns are observed.

# Don't color points with their classes
umap = UMAPVisualizer(labels=["documents"], metric='cosine')
umap.fit(docs)
umap.poof()
../../_images/umap_no_labels.png

This means we don’t have to use class labels at all. Instead we can use cluster membership from K-Means to label each document. This will allow us to look for clusters of related text by their contents:

# Apply clustering instead of class names.
from sklearn.cluster import KMeans

clusters = KMeans(n_clusters=5)
clusters.fit(docs)

umap = UMAPVisualizer()
umap.fit(docs, ["c{}".format(c) for c in clusters.labels_])
umap.poof()
../../_images/umap_kmeans.png

On one hand, these clusters aren’t particularly well concentrated by the two dimensional embedding of UMAP, on the other hand, the true labels for this data are. That is a good indication that your data does indeed live on a manifold in your TF-IDF space and that structure is being ignored by the kmeans algorithms. Clustering can be quite tricky in high dimensional spaces and it is often a good idea to reduce your dimension before running clustering algorithms on your data.

UMAP, it should be noted, is a manifold learning technique and as such does not seek to preserve the distances between your data points in high space but instead to learn the distances along an underlying manifold on which your data points lie. As such one shouldn’t be too surprised when it disagrees with a non-manifold based clustering technique. A detailed explanation of this phenomenon can be found in this UMAP documentation.

API Reference

Implements UMAP visualizations of documents in 2D space.

class yellowbrick.text.umap_vis.UMAPVisualizer(ax=None, labels=None, classes=None, colors=None, colormap=None, random_state=None, alpha=0.7, **kwargs)[source]

Bases: yellowbrick.text.base.TextVisualizer

Display a projection of a vectorized corpus in two dimensions using UMAP, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. UMAP is a relatively new technique but is often used to visualize clusters or groups of data points and their relative proximities. It typically is fast, scalable, and can be applied directly to sparse matrices eliminating the need to run a truncatedSVD as a pre-processing step.

UMAP will return a scatter plot of the vectorized corpus, such that each point represents a document or utterance. By default, the distance between two points in the visual space is embedded using the cosine distance between the high dimensional feature vectors. Thus, UMAP shows the clusters of similar documents and the relationships between groups of documents as a scatter plot.

UMAP can be used with either clustering or classification; by specifying the classes argument, points will be colored based on their similar traits. For example, by passing cluster.labels_ as y in fit(), all points in the same cluster will be grouped together. This extends the neighbor embedding with more information about similarity, and can allow better interpretation of both clusters and classes.

The current default for UMAP is Euclidean distance. Hellinger distance would be a more appropriate distance function to use with CountVectorize data. That will be released in the next version of UMAP. In the meantime cosine distance is likely a better text default that Euclidean and can be set using the keyword argument metric=’cosine’.

For more, see https://github.com/lmcinnes/umap

Parameters:
ax : matplotlib axes

The axes to plot the figure on.

labels : list of strings

The names of the classes in the target, used to create a legend. Labels must match names of classes in sorted order.

colors : list or tuple of colors

Specify the colors for each individual class

colormap : string or matplotlib cmap

Sequential colormap for continuous target

random_state : int, RandomState instance or None, optional, default: None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. The random state is applied to the preliminary decomposition as well as UMAP.

alpha : float, default: 0.7

Specify a transparency where 1 is completely opaque and 0 is completely transparent. This property makes densely clustered points more visible.

kwargs : dict

Pass any additional keyword arguments to the UMAP transformer.

Examples

>>> model = MyVisualizer(metric='cosine')
>>> model.fit(X)
>>> model.poof()
NULL_CLASS = None
draw(points, target=None, **kwargs)[source]

Called from the fit method, this method draws the UMAP scatter plot, from a set of decomposed points in 2 dimensions. This method also accepts a third dimension, target, which is used to specify the colors of each of the points. If the target is not specified, then the points are plotted as a single cloud to show similar documents.

finalize(**kwargs)[source]

Finalize the drawing by adding a title and legend, and removing the axes objects that do not convey information about UMAP.

fit(X, y=None, **kwargs)[source]

The fit method is the primary drawing input for the UMAP projection since the visualization requires both X and an optional y value. The fit method expects an array of numeric vectors, so text documents must be vectorized before passing them to this method.

Parameters:
X : ndarray or DataFrame of shape n x m

A matrix of n instances with m features representing the corpus of vectorized documents to visualize with UMAP.

y : ndarray or Series of length n

An optional array or series of target or class values for instances. If this is specified, then the points will be colored according to their class. Often cluster labels are passed in to color the documents in cluster space, so this method is used both for classification and clustering methods.

kwargs : dict

Pass generic arguments to the drawing method

Returns:
self : instance

Returns the instance of the transformer/visualizer

make_transformer(umap_kwargs={})[source]

Creates an internal transformer pipeline to project the data set into 2D space using UMAP. This method will reset the transformer on the class.

Returns:
transformer : Pipeline

Pipelined transformer for UMAP projections