yellowbrick.text package¶
Submodules¶
yellowbrick.text.base module¶
Base classes for text feature visualizers and text feature selection tools.

class
yellowbrick.text.base.
TextVisualizer
(ax=None, **kwargs)[source]¶ Bases:
yellowbrick.base.Visualizer
,sklearn.base.TransformerMixin
Base class for text feature visualization to investigate documents individually or as a full corpus.
TextVisualizers are used after a text corpus has been transformed in some way (e.g. normalized through stemming or lemmatization, via stopwords removal, or through vectorization). Thus a TextVisualizer is itself a transformer and can be used in a ScikitLearn Pipeline to perform automatic visual analysis during build.
Accepts as input a DataFrame or Numpy array.

fit
(X, y=None, **fit_params)[source]¶ This method performs preliminary computations in order to set up the figure, compute statistics, or perform other analyses. It can also call drawing methods in order to set up various noninstancerelated figure elements.
Parameters: X : ndarray or DataFrame of shape n x m
A matrix of n instances with m features
y : ndarray or Series of length n
An array or series of target or class values
fit_params: dict
keyword arguments for parameter fitting.
Returns: self : instance
Returns the instance of the transformer/visualizer

fit_transform_poof
(X, y=None, **kwargs)[source]¶ Fit to data, transform it, then visualize it.
Fits the text visualizer to X and y with optional parameters by passing in all of kwargs, then calls poof with the same kwargs. This method must return the result of the transform method.
Parameters: X : ndarray or DataFrame of shape n x m
A matrix of n instances with m features
y : ndarray or Series of length n
An array or series of target or class values
kwargs : dict
Pass generic arguments to the drawing method
Returns: X : numpy array
This method must return a numpy array with the same shape as X.

transform
(X)[source]¶ Primarily a passthrough to ensure that the text visualizer will work in a pipeline setting. This method can also call drawing methods in order to ensure that the visualization is constructed.
Returns: X : numpy array
This method must return a numpy array with the same shape as X.

yellowbrick.text.tsne module¶
Implements TSNE visualizations of documents in 2D space.

class
yellowbrick.text.tsne.
TSNEVisualizer
(ax=None, decompose=’svd’, decompose_by=50, classes=None, colors=None, colormap=None, **kwargs)[source]¶ Bases:
yellowbrick.text.base.TextVisualizer
Display a projection of a vectorized corpus in two dimensions using TSNE, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. TSNE is widely used in text analysis to show clusters or groups of documents or utterances and their relative proximities.
TSNE will return a scatter plot of the vectorized corpus, such that each point represents a document or utterance. The distance between two points in the visual space is embedded using the probability distribution of pairwise similarities in the higher dimensionality; thus TSNE shows clusters of similar documents and the relationships between groups of documents as a scatter plot.
TSNE can be used with either clustering or classification; by specifying the
classes
argument, points will be colored based on their similar traits. For example, by passingcluster.labels_
asy
infit()
, all points in the same cluster will be grouped together. This extends the neighbor embedding with more information about similarity, and can allow better interpretation of both clusters and classes.For more, see https://lvdmaaten.github.io/tsne/
Parameters: ax : matplotlib axes
The axes to plot the figure on.
decompose : string or None
A preliminary decomposition is often used prior to TSNE to make the projection faster. Specify “svd” for sparse data or “pca” for dense data. If decompose is None, the original data set will be used.
decompose_by : int
Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.
classes : list of strings
The names of the classes in the target, used to create a legend.
colors : list or tuple of colors
Specify the colors for each individual class
colormap : string or matplotlib cmap
Sequential colormap for continuous target
kwargs : dict
Pass any additional keyword arguments to the TSNE transformer.

draw
(points, target=None, **kwargs)[source]¶ Called from the fit method, this method draws the TSNE scatter plot, from a set of decomposed points in 2 dimensions. This method also accepts a third dimension, target, which is used to specify the colors of each of the points. If the target is not specified, then the points are plotted as a single cloud to show similar documents.

finalize
(**kwargs)[source]¶ Finalize the drawing by adding a title and legend, and removing the axes objects that do not convey information about TNSE.

fit
(X, y=None, **kwargs)[source]¶ The fit method is the primary drawing input for the TSNE projection since the visualization requires both X and an optional y value. The fit method expects an array of numeric vectors, so text documents must be vectorized before passing them to this method.
Parameters: X : ndarray or DataFrame of shape n x m
A matrix of n instances with m features representing the corpus of vectorized documents to visualize with tsne.
y : ndarray or Series of length n
An optional array or series of target or class values for instances. If this is specified, then the points will be colored according to their class. Often cluster labels are passed in to color the documents in cluster space, so this method is used both for classification and clustering methods.
kwargs : dict
Pass generic arguments to the drawing method
Returns: self : instance
Returns the instance of the transformer/visualizer

make_transformer
(decompose=’svd’, decompose_by=50, tsne_kwargs={})[source]¶ Creates an internal transformer pipeline to project the data set into 2D space using TSNE, applying an predecomposition technique ahead of embedding if necessary. This method will reset the transformer on the class, and can be used to explore different decompositions.
Parameters: decompose : string or None
A preliminary decomposition is often used prior to TSNE to make the projection faster. Specify “svd” for sparse data or “pca” for dense data. If decompose is None, the original data set will be used.
decompose_by : int
Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.
Returns: transformer : Pipeline
Pipelined transformer for TSNE projections


yellowbrick.text.tsne.
tsne
(X, y=None, ax=None, decompose=’svd’, decompose_by=50, classes=None, colors=None, colormap=None, **kwargs)[source]¶ Display a projection of a vectorized corpus in two dimensions using TSNE, a nonlinear dimensionality reduction method that is particularly well suited to embedding in two or three dimensions for visualization as a scatter plot. TSNE is widely used in text analysis to show clusters or groups of documents or utterances and their relative proximities.
Parameters: X : ndarray or DataFrame of shape n x m
A matrix of n instances with m features representing the corpus of vectorized documents to visualize with tsne.
y : ndarray or Series of length n
An optional array or series of target or class values for instances. If this is specified, then the points will be colored according to their class. Often cluster labels are passed in to color the documents in cluster space, so this method is used both for classification and clustering methods.
ax : matplotlib axes
The axes to plot the figure on.
decompose : string or None
A preliminary decomposition is often used prior to TSNE to make the projection faster. Specify “svd” for sparse data or “pca” for dense data. If decompose is None, the original data set will be used.
decompose_by : int
Specify the number of components for preliminary decomposition, by default this is 50; the more components, the slower TSNE will be.
classes : list of strings
The names of the classes in the target, used to create a legend.
colors : list or tuple of colors
Specify the colors for each individual class
colormap : string or matplotlib cmap
Sequential colormap for continuous target
kwargs : dict
Pass any additional keyword arguments to the TSNE transformer.
Returns: ax : matplotlib axes
Returns the axes that the parallel coordinates were drawn on.
yellowbrick.text.freqdist module¶
Implementations of frequency distributions for text visualization

class
yellowbrick.text.freqdist.
FreqDistVisualizer
(ax=None, color=None, N=50, **kwargs)[source]¶ Bases:
yellowbrick.text.base.TextVisualizer
A frequency distribution tells us the frequency of each vocabulary item in the text. In general, it could count any kind of observable event. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.
Parameters: ax : matplotlib axes
The axes to plot the figure on.
color : list or tuple of colors
Specify color for bars
N: integer
Top N tokens to be plotted.
kwargs : dict
Pass any additional keyword arguments to the super class.
These parameters can be influenced later on in the visualization
process, but can and should be set as early as possible.

draw
(**kwargs)[source]¶ Called from the fit method, this method creates the canvas and draws the distribution plot on it.
Parameters: kwargs: generic keyword arguments.

finalize
(**kwargs)[source]¶ The finalize method executes any subclassspecific axes finalization steps. The user calls poof & poof calls finalize.
Parameters: kwargs: generic keyword arguments.

fit
(docs, features)[source]¶ The fit method is the primary drawing input for the frequency distribution visualization. It requires vectorized lists of documents and a list of features, which are the actual words from the original corpus (needed to label the xaxis ticks).
Parameters: docs : ndarray or DataFrame of shape n x m
A matrix of n instances with m features representing the corpus of vectorized documents.
features : list
List of corpus vocabulary words
Text documents must be vectorized before passing to fit()


yellowbrick.text.freqdist.
freqdist
(X, y=None, ax=None, color=None, N=50, **kwargs)[source]¶ Displays frequency distribution plot for text.
This helper function is a quick wrapper to utilize the FreqDist Visualizer (Transformer) for oneoff analysis.
Parameters: X: ndarray or DataFrame of shape n x m
A matrix of n instances with m features. In the case of text, X is a list of list of already preprocessed words
y: ndarray or Series of length n
An array or series of target or class values
ax: matplotlib axes
The axes to plot the figure on.
color: string
Specify color for barchart
N: integer
Top N tokens to be plotted.
kwargs: dict
Keyword arguments passed to the super class.
Returns: ax: matplotlib axes
Returns the axes that the plot was drawn on.