Word Correlation Plot

Word correlation illustrates the extent to which words or phrases co-appear across the documents in a corpus. This can be useful for understanding the relationships between known text features in a corpus with many documents. WordCorrelationPlot allows for the visualization of the document occurrence correlations between select words in a corpus. For a number of features n, the plot renders an n x n heatmap containing correlation values.

The correlation values are computed using the phi coefficient metric, which is a measure of the association between two binary variables. A value close to 1 or -1 indicates that the occurrences of the two features are highly positively or negatively correlated, while a value close to 0 indicates no relationship between the two features.

Visualizer

WordCorrelationPlot

Quick Method

word_correlation()

Models

Text Modeling

Workflow

Feature Engineering

from yellowbrick.datasets import load_hobbies
from yellowbrick.text.correlation import WordCorrelationPlot

# Load the text corpus
corpus = load_hobbies()

# Create the list of words to plot
words = ["Tatsumi Kimishima", "Nintendo", "game", "play", "man", "woman"]

# Instantiate the visualizer and draw the plot
viz = WordCorrelationPlot(words)
viz.fit(corpus.data)
viz.show()

(Source code, png, pdf)

Word Correlation Plot

Quick Method

The same functionality above can be achieved with the associated quick method word_correlation. This method will build the Word Correlation Plot object with the associated arguments, fit it, then (optionally) immediately show the visualization.

from yellowbrick.datasets import load_hobbies
from yellowbrick.text.correlation import word_correlation

# Load the text corpus
corpus = load_hobbies()

# Create the list of words to plot
words = ["Game", "player", "score", "oil"]

# Draw the plot
word_correlation(words, corpus.data)

(Source code, png, pdf)

Word Correlation Plot

API Reference

Implementation of word correlation for text visualization.

class yellowbrick.text.correlation.WordCorrelationPlot(words, ignore_case=False, ax=None, cmap='RdYlBu', colorbar=True, fontsize=None, **kwargs)[source]

Bases: TextVisualizer

Word correlation illustrates the extent to which words in a corpus appear in the same documents.

WordCorrelationPlot visualizes the binary correlation between words across documents as a heatmap. The correlation is defined using the mean square contingency coefficient (phi-coefficient) between any two words m and n. The coefficient is a value between -1 and 1, inclusive. A value close to 1 or -1 indicates strong positive or negative correlation between m and n, while a value close to 0 indicates little or no correlation. The constructor takes one required argument, which is the list of words or n-grams to be plotted.

Parameters
wordslist of str

The list of words or n-grams to be plotted. The words must be present in the provided corpus on fit().

ignore_casebool, default: False

If True, all words will be converted to lowercase before processing.

axmatplotlib Axes, default: None

The axes to plot the figure on.

cmapstr or cmap, default: “RdYlBu”

Colormap to use for the heatmap.

colorbarbool, default: True

If True, a colorbar will be added to the heatmap.

fontsizeint, default: None

Font size to use for the labels on the axes.

kwargsdict

Pass any additional keyword arguments to the super class.

Attributes
self.doc_term_matrix_array of shape (n_docs, n_features)

The computed sparse document-term matrix containing binary values indicating if a word is present in a document.

self.num_docs_int

The number of observed documents in the corpus.

self.vocab_dict

A dictionary mapping words to their indices in the document-term matrix.

self.num_features_int

The number of features (word labels) in the resulting plot.

self.correlation_matrix_ndarray of shape (n_features, n_features)

The computed matrix containing the phi-coefficients between all features.

draw(X)[source]

Called from the fit() method, this metod draws the heatmap on the figure using the computed correlation matrix.

finalize()[source]

Prepares the figure for rendering by adding the title. This method is usually called from show() and not directly by the user.

fit(X, y=None)[source]

The fit method is the primary drawing input for the word correlation visualization.

Parameters
Xlist of str or generator

Should be provided as a list of strings or a generator yielding strings that represent the documents in the corpus.

yNone

Labels are not used for the word correlation visualization.

Returns
self: instance

Returns the instance of the transformer/visualizer.

Attributes
self.doc_term_matrix_array of shape (n_docs, n_features)

The computed sparse document-term matrix containing binary values indicating if a word is present in a document.

self.num_docs_int

The number of observed documents in the corpus.

self.vocab_dict

A dictionary mapping words to their indices in the document-term matrix.

self.num_features_int

The number of features (word labels) in the resulting plot.

self.correlation_matrix_ndarray of shape (n_features, n_features)

The computed matrix containing the phi-coefficients between all features.

yellowbrick.text.correlation.word_correlation(words, corpus, ignore_case=True, ax=None, cmap='RdYlBu', show=True, colorbar=True, fontsize=None, **kwargs)[source]

Word Correlation

Displays the binary correlation between the given words across the documents in a corpus. For a list of words with length n, this produces an n x n heatmap of correlation values in the range [-1, 1].

Parameters
wordslist of str

The corpus words to display in the heatmap.

corpuslist of str or generator

The corpus as a list of documents or a generator yielding documents.

ignore_casebool, default: True

If True, all words will be converted to lowercase before proessing.

axmatplotlib axes, default: None

The axes to plot the figure on.

cmapstr, default: “RdYlBu”

Colormap to use for the heatmap.

showbool, default: True

If True, calls show(), which in turn calls plt.show() however you cannot call plt.savefig from this signature, nor clear_figure. If False, simply calls finalize()

colorbarbool, default: True

If True, adds a colorbar to the figure.

fontsizeint, default: None

If not None, sets the font size of the labels.