UMAP Corpus Visualization¶
UMAP is a nonlinear
dimensionality reduction method that is well suited to embedding in two
or three dimensions for visualization as a scatter plot. UMAP is a
relatively new technique but is very effective for visualizing clusters or
groups of data points and their relative proximities. It does a good job
of learning the local structure within your data but also attempts to
preserve the relationships between your groups as can be seen in its
It is fast, scalable, and can be applied directly to sparse matrices,
eliminating the need to run
TruncatedSVD as a pre-processing step.
Additionally, it supports a wide variety of distance measures allowing
for easy exploration of your data. For a more detailed explanation of the algorithm
the paper can be found here.
In this example, we represent documents via a term frequency inverse document frequency (TF-IDF) vector. Then use UMAP to find a low dimensional representation of these documents. The Yellowbrick visualizer then plots the scatter plot, coloring by cluster or by class, or neither if a structural analysis is required.
After importing the required tools, we can load the corpus and vectorize the text using TF-IDF. Once the corpus is vectorized we can visualize it, showing the distribution of classes.
from sklearn.feature_extraction.text import TfidfVectorizer from yellowbrick.datasets import load_hobbies from yellowbrick.text import UMAPVisualizer # Load the text data corpus = load_hobbies() tfidf = TfidfVectorizer() docs = tfidf.fit_transform(corpus.data) labels = corpus.target # Instantiate the text visualizer umap = UMAPVisualizer() umap.fit(docs, labels) umap.poof()
Alternatively, if we believed that cosine distance was a more
appropriate metric on our feature space we could specify that via a
metric paramater passed through to the underlying UMAP function by
umap = UMAPVisualizer(metric='cosine') umap.fit(docs, labels) umap.poof()
If we omit the target during fit, we can visualize the whole dataset to see if any meaningful patterns are observed.
This means we don’t have to use class labels at all. Instead, we can use cluster membership from K-Means to label each document. This will allow us to look for clusters of related text by their contents:
On one hand, these clusters aren’t particularly well concentrated by the two dimensional embedding of UMAP; while on the other hand, the true labels for this data are. That is a good indication that your data does indeed live on a manifold in your TF-IDF space and that structure is being ignored by the KMeans algorithm. Clustering can be quite tricky in high dimensional spaces and it is often a good idea to reduce your dimension before running clustering algorithms on your data.
UMAP, it should be noted, is a manifold learning technique and as such does not seek to preserve the distances between your data points in high space but instead to learn the distances along an underlying manifold on which your data points lie. As such, one shouldn’t be too surprised when it disagrees with a non-manifold based clustering technique. A detailed explanation of this phenomenon can be found in this UMAP documentation.
Implements UMAP visualizations of documents in 2D space.
UMAPVisualizer(ax=None, labels=None, classes=None, colors=None, colormap=None, random_state=None, alpha=0.7, **kwargs)¶
Display a projection of a vectorized corpus in two dimensions using UMAP, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. UMAP is a relatively new technique but is often used to visualize clusters or groups of data points and their relative proximities. It typically is fast, scalable, and can be applied directly to sparse matrices eliminating the need to run a truncatedSVD as a pre-processing step.
UMAP will return a scatter plot of the vectorized corpus, such that each point represents a document or utterance. By default, the distance between two points in the visual space is embedded using the cosine distance between the high dimensional feature vectors. Thus, UMAP shows the clusters of similar documents and the relationships between groups of documents as a scatter plot.
UMAP can be used with either clustering or classification; by specifying the
classesargument, points will be colored based on their similar traits. For example, by passing
fit(), all points in the same cluster will be grouped together. This extends the neighbor embedding with more information about similarity, and can allow better interpretation of both clusters and classes.
The current default for UMAP is Euclidean distance. Hellinger distance would be a more appropriate distance function to use with CountVectorize data. That will be released in the next version of UMAP. In the meantime cosine distance is likely a better text default that Euclidean and can be set using the keyword argument metric=’cosine’.
For more, see https://github.com/lmcinnes/umap
- ax : matplotlib axes
The axes to plot the figure on.
- labels : list of strings
The names of the classes in the target, used to create a legend. Labels must match names of classes in sorted order.
- colors : list or tuple of colors
Specify the colors for each individual class
- colormap : string or matplotlib cmap
Sequential colormap for continuous target
- random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. The random state is applied to the preliminary decomposition as well as UMAP.
- alpha : float, default: 0.7
Specify a transparency where 1 is completely opaque and 0 is completely transparent. This property makes densely clustered points more visible.
- kwargs : dict
Pass any additional keyword arguments to the UMAP transformer.
>>> model = MyVisualizer(metric='cosine') >>> model.fit(X) >>> model.poof()
draw(self, points, target=None, **kwargs)¶
Called from the fit method, this method draws the UMAP scatter plot, from a set of decomposed points in 2 dimensions. This method also accepts a third dimension, target, which is used to specify the colors of each of the points. If the target is not specified, then the points are plotted as a single cloud to show similar documents.
Finalize the drawing by adding a title and legend, and removing the axes objects that do not convey information about UMAP.
fit(self, X, y=None, **kwargs)¶
The fit method is the primary drawing input for the UMAP projection since the visualization requires both X and an optional y value. The fit method expects an array of numeric vectors, so text documents must be vectorized before passing them to this method.
- X : ndarray or DataFrame of shape n x m
A matrix of n instances with m features representing the corpus of vectorized documents to visualize with UMAP.
- y : ndarray or Series of length n
An optional array or series of target or class values for instances. If this is specified, then the points will be colored according to their class. Often cluster labels are passed in to color the documents in cluster space, so this method is used both for classification and clustering methods.
- kwargs : dict
Pass generic arguments to the drawing method
- self : instance
Returns the instance of the transformer/visualizer
Creates an internal transformer pipeline to project the data set into 2D space using UMAP. This method will reset the transformer on the class.
- transformer : Pipeline
Pipelined transformer for UMAP projections