yellowbrick.cluster package

Submodules

yellowbrick.cluster.base module

Base class for cluster visualizers.

class yellowbrick.cluster.base.ClusteringScoreVisualizer(model, ax=None, **kwargs)[source]

Bases: yellowbrick.base.ScoreVisualizer

Base class for all ScoreVisualizers that evaluate a clustering estimator.

The primary functionality of this class is to perform a check to ensure that the wrapped estimator is a cluster estimator, otherwise a YewllowbrickTypeError exception is raised.

yellowbrick.cluster.elbow module

Implements the elbow method for determining the optimal number of clusters. https://bl.ocks.org/rpgove/0060ff3b656618e9136b

class yellowbrick.cluster.elbow.KElbowVisualizer(model, ax=None, k=10, metric=’distortion’, timings=True, **kwargs)[source]

Bases: yellowbrick.cluster.base.ClusteringScoreVisualizer

The K-Elbow Visualizer implements the “elbow” method of selecting the optimal number of clusters for K-means clustering. K-means is a simple unsupervised machine learning algorithm that groups data into a specified number (k) of clusters. Because the user must specify in advance what k to choose, the algorithm is somewhat naive – it assigns all members to k clusters even if that is not the right k for the dataset.

The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the distortion_score is computed, the sum of square distances from each point to its assigned center. Other metrics can also be used such as the silhouette_score, the mean silhouette coefficient for all samples or the calinski_harabaz_score, which computes the ratio of dispersion between and within clusters.

When these overall metrics for each model are plotted, it is possible to visually determine the best value for K. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.

Parameters:

model : a Scikit-Learn clusterer

Should be an instance of a clusterer, specifically KMeans or MiniBatchKMeans. If it is not a clusterer, an exception is raised.

ax : matplotlib Axes, default: None

The axes to plot the figure on. If None is passed in the current axes will be used (or generated if required).

k : integer or tuple

The range of k to compute silhouette scores for. If a single integer is specified, then will compute the range (2,k) otherwise the specified range in the tuple is used.

metric : string, default: "distortion"

Select the scoring metric to evaluate the clusters. The default is the mean distortion, defined by the sum of squared distances between each observation and its closest centroid. Other metrics include:

  • distortion: mean sum of squared distances to centers
  • silhouette: mean ratio of intra-cluster and nearest-cluster distance
  • calinski_harabaz: ratio of within to between cluster dispersion

timings : bool, default: True

Display the fitting time per k to evaluate the amount of time required to train the clustering model.

kwargs : dict

Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.

Notes

If you get a visualizer that doesn’t have an elbow or inflection point, then this method may not be working. The elbow method does not work well if the data is not very clustered; in this case you might see a smooth curve and the value of k is unclear. Other scoring methods such as BIC or SSE also can be used to explore if clustering is a correct choice.

For a discussion on the Elbow method, read more at Robert Gove’s Block.

Examples

>>> from yellowbrick.cluster import KElbowVisualizer
>>> from sklearn.cluster import KMeans
>>> model = KElbowVisualizer(KMeans(), k=10)
>>> model.fit(X)
>>> model.poof()
draw()[source]

Draw the elbow curve for the specified scores and values of K.

finalize()[source]

Prepare the figure for rendering by setting the title as well as the X and Y axis labels and adding the legend.

fit(X, y=None, **kwargs)[source]

Fits n KMeans models where n is the length of self.k_values_, storing the silhoutte scores in the self.k_scores_ attribute. This method finishes up by calling draw to create the plot.

yellowbrick.cluster.elbow.distortion_score(X, labels, metric=’euclidean’)[source]

Compute the mean distortion of all samples.

The distortion is computed as the the sum of the squared distances between each observation and its closest centroid. Logically, this is the metric that K-Means attempts to minimize as it is fitting the model.

Parameters:

X : array, shape = [n_samples, n_features] or [n_samples_a, n_samples_a]

Array of pairwise distances between samples if metric == “precomputed” or a feature array for computing distances against the labels.

labels : array, shape = [n_samples]

Predicted labels for each sample

metric : string

The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by sklearn.metrics.pairwise.pairwise_distances

.. todo:: add sample_size and random_state kwds similar to silhouette_score

yellowbrick.cluster.silhouette module

Implements visualizers that use the silhouette metric for cluster evaluation.

class yellowbrick.cluster.silhouette.SilhouetteVisualizer(model, ax=None, **kwargs)[source]

Bases: yellowbrick.cluster.base.ClusteringScoreVisualizer

TODO: Document this class!

draw(labels)[source]

Draw the silhouettes for each sample and the average score.

Parameters:

labels : array-like

An array with the cluster label for each silhouette sample, usually computed with predict(). Labels are not stored on the visualizer so that the figure can be redrawn with new data.

finalize()[source]

Prepare the figure for rendering by setting the title and adjusting the limits on the axes, adding labels and a legend.

fit(X, y=None, **kwargs)[source]

Fits the model and generates the the silhouette visualization.

TODO: decide to use this method or the score method to draw. NOTE: Probably this would be better in score, but the standard score is a little different and I’m not sure how it’s used.