Word Correlation Plot

Word correlation illustrates the extent to which words or phrases co-appear across the documents in a corpus. This can be useful for understanding the relationships between known text features in a corpus with many documents. WordCorrelationPlot allows for the visualization of the document occurrence correlations between select words in a corpus. For a number of features n, the plot renders an n x n heatmap containing correlation values.

The correlation values are computed using the phi coefficient metric, which is a measure of the association between two binary variables. A value close to 1 or -1 indicates that the occurrences of the two features are highly positively or negatively correlated, while a value close to 0 indicates no relationship between the two features.



Quick Method



Text Modeling


Feature Engineering

from yellowbrick.datasets import load_hobbies
from yellowbrick.text.correlation import WordCorrelationPlot

# Load the text corpus
corpus = load_hobbies()

# Create the list of words to plot
words = ["Tatsumi Kimishima", "Nintendo", "game", "play", "man", "woman"]

# Instantiate the visualizer and draw the plot
viz = WordCorrelationPlot(words)

(Source code, png, pdf)

Word Correlation Plot

Quick Method

The same functionality above can be achieved with the associated quick method word_correlation. This method will build the Word Correlation Plot object with the associated arguments, fit it, then (optionally) immediately show the visualization.

from yellowbrick.datasets import load_hobbies
from yellowbrick.text.correlation import word_correlation

# Load the text corpus
corpus = load_hobbies()

# Create the list of words to plot
words = ["Game", "player", "score", "oil"]

# Draw the plot
word_correlation(words, corpus.data)

(Source code, png, pdf)

Word Correlation Plot

API Reference

Implementation of word correlation for text visualization.

class yellowbrick.text.correlation.WordCorrelationPlot(words, ignore_case=False, ax=None, cmap='RdYlBu', colorbar=True, fontsize=None, **kwargs)[fuente]

Bases: TextVisualizer

Word correlation illustrates the extent to which words in a corpus appear in the same documents.

WordCorrelationPlot visualizes the binary correlation between words across documents as a heatmap. The correlation is defined using the mean square contingency coefficient (phi-coefficient) between any two words m and n. The coefficient is a value between -1 and 1, inclusive. A value close to 1 or -1 indicates strong positive or negative correlation between m and n, while a value close to 0 indicates little or no correlation. The constructor takes one required argument, which is the list of words or n-grams to be plotted.

wordslist of str

The list of words or n-grams to be plotted. The words must be present in the provided corpus on fit().

ignore_casebool, default: False

If True, all words will be converted to lowercase before processing.

axmatplotlib Axes, default: None

The axes to plot the figure on.

cmapstr or cmap, default: «RdYlBu»

Colormap to use for the heatmap.

colorbarbool, default: True

If True, a colorbar will be added to the heatmap.

fontsizeint, default: None

Font size to use for the labels on the axes.


Pass any additional keyword arguments to the super class.

self.doc_term_matrix_array of shape (n_docs, n_features)

The computed sparse document-term matrix containing binary values indicating if a word is present in a document.


The number of observed documents in the corpus.


A dictionary mapping words to their indices in the document-term matrix.


The number of features (word labels) in the resulting plot.

self.correlation_matrix_ndarray of shape (n_features, n_features)

The computed matrix containing the phi-coefficients between all features.


Called from the fit() method, this metod draws the heatmap on the figure using the computed correlation matrix.


Prepares the figure for rendering by adding the title. This method is usually called from show() and not directly by the user.

fit(X, y=None)[fuente]

The fit method is the primary drawing input for the word correlation visualization.

Xlist of str or generator

Should be provided as a list of strings or a generator yielding strings that represent the documents in the corpus.


Labels are not used for the word correlation visualization.

self: instance

Returns the instance of the transformer/visualizer.

self.doc_term_matrix_array of shape (n_docs, n_features)

The computed sparse document-term matrix containing binary values indicating if a word is present in a document.


The number of observed documents in the corpus.


A dictionary mapping words to their indices in the document-term matrix.


The number of features (word labels) in the resulting plot.

self.correlation_matrix_ndarray of shape (n_features, n_features)

The computed matrix containing the phi-coefficients between all features.

yellowbrick.text.correlation.word_correlation(words, corpus, ignore_case=True, ax=None, cmap='RdYlBu', show=True, colorbar=True, fontsize=None, **kwargs)[fuente]

Word Correlation

Displays the binary correlation between the given words across the documents in a corpus. For a list of words with length n, this produces an n x n heatmap of correlation values in the range [-1, 1].

wordslist of str

The corpus words to display in the heatmap.

corpuslist of str or generator

The corpus as a list of documents or a generator yielding documents.

ignore_casebool, default: True

If True, all words will be converted to lowercase before proessing.

axmatplotlib axes, default: None

The axes to plot the figure on.

cmapstr, default: «RdYlBu»

Colormap to use for the heatmap.

showbool, default: True

If True, calls show(), which in turn calls plt.show() however you cannot call plt.savefig from this signature, nor clear_figure. If False, simply calls finalize()

colorbarbool, default: True

If True, adds a colorbar to the figure.

fontsizeint, default: None

If not None, sets the font size of the labels.