Intercluster Distance Maps

Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.

from sklearn.datasets import make_blobs

# Make 12 blobs dataset
X, y = make_blobs(centers=12, n_samples=1000, n_features=16, shuffle=True)
from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance

# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))

visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data
../../_images/icdm.png

API Reference

Implements Intercluster Distance Map visualizations.

class yellowbrick.cluster.icdm.InterclusterDistance(model, ax=None, min_size=400, max_size=25000, embedding='mds', scoring='membership', legend=True, legend_loc='lower left', legend_size=1.5, random_state=None, **kwargs)[source]

Bases: yellowbrick.cluster.base.ClusteringScoreVisualizer

Intercluster distance maps display an embedding of the cluster centers in 2 dimensions with the distance to other centers preserved. E.g. the closer to centers are in the visualization, the closer they are in the original feature space. The clusters are sized according to a scoring metric. By default, they are sized by membership, e.g. the number of instances that belong to each center. This gives a sense of the relative importance of clusters. Note however, that because two clusters overlap in the 2D space, it does not imply that they overlap in the original feature space.

Parameters:
model : a Scikit-Learn clusterer

Should be an instance of a centroidal clustering algorithm (or a hierarchical algorithm with a specified number of clusters). Also accepts some other models like LDA for text clustering. If it is not a clusterer, an exception is raised.

ax : matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes will be used (or generated if required).

min_size : int, default: 400

The size, in points, of the smallest cluster drawn on the graph. Cluster sizes will be scaled between the min and max sizes.

max_size : int, default: 25000

The size, in points, of the largest cluster drawn on the graph. Cluster sizes will be scaled between the min and max sizes.

embedding : default: ‘mds’

The algorithm used to embed the cluster centers in 2 dimensional space so that the distance between clusters is represented equivalently to their relationship in feature spaceself. Embedding algorithm options include:

  • mds: multidimensional scaling
  • tsne: stochastic neighbor embedding
scoring : default: ‘membership’

The scoring method used to determine the size of the clusters drawn on the graph so that the relative importance of clusters can be viewed. Scoring method options include:

  • membership: number of instances belonging to each cluster
legend : bool, default: True

Whether or not to draw the size legend onto the graph, omit the legend to more easily see clusters that overlap.

legend_loc : str, default: “lower left”

The location of the legend on the graph, used to move the legend out of the way of clusters into open space. The same legend location options for matplotlib are used here.

legend_size : float, default: 1.5

The size, in inches, of the size legend to inset into the graph.

random_state : int or RandomState, default: None

Fixes the random state for stochastic embedding algorithms.

kwargs : dict

Keyword arguments passed to the base class and may influence the feature visualization properties.

Notes

Currently the only two embeddings supportted are MDS and TSNE. Soon to follow will be PCoA and a customized version of PCoA for LDA. The only supported scoring metric is membership, but in the future, silhouette scores and cluster diameter will be added.

In terms of algorithm support, right now any clustering algorithm that has a learned cluster_centers_ and labels_ attribute will work with the visualizer. In the future, we will update this to work with hierarchical clusterers that have n_components and LDA.

Attributes:
cluster_centers_ : array of shape (n_clusters, n_features)

Searches for or creates cluster centers for the specified clustering algorithm.

embedded_centers_ : array of shape (n_clusters, 2)

The positions of all the cluster centers on the graph.

scores_ : array of shape (n_clusters,)

The scores of each cluster that determine its size on the graph.

fit_time_ : Timer

The time it took to fit the clustering model and perform the embedding.

cluster_centers_

Searches for or creates cluster centers for the specified clustering algorithm. This algorithm ensures that that the centers are appropriately drawn and scaled so that distance between clusters are maintained.

draw()[source]

Draw the embedded centers with their sizes on the visualization.

finalize()[source]

Finalize the visualization to create an “origin grid” feel instead of the default matplotlib feel. Set the title, remove spines, and label the grid with components. This function also adds a legend from the sizes if required.

fit(X, y=None)[source]

Fit the clustering model, computing the centers then embeds the centers into 2D space using the embedding method specified.

lax

Returns the legend axes, creating it only on demand by creating a 2” by 2” inset axes that has no grid, ticks, spines or face frame (e.g is mostly invisible). The legend can then be drawn on this axes.

transformer

Creates the internal transformer that maps the cluster center’s high dimensional space to its two dimensional space.