Elbow Method¶
The KElbowVisualizer
implements the “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for \(K\). If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.
To demonstrate, in the following example the KElbowVisualizer
fits the KMeans
model for a range of \(K\) values from 4 to 11 on a sample twodimensional dataset with 8 random clusters of points. When the model is fit with 8 clusters, we can see an “elbow” in the graph, which in this case we know to be the optimal number.
from sklearn.datasets import make_blobs
# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
By default, the scoring parameter metric
is set to distortion
, which
computes the sum of squared distances from each point to its assigned center.
However, two other metrics can also be used with the KElbowVisualizer
– silhouette
and calinski_harabaz
. The silhouette
score calculates the mean Silhouette Coefficient of all samples, while the calinski_harabaz
score computes the ratio of dispersion between and within clusters.
The KElbowVisualizer
also displays the amount of time to train the clustering model per \(K\) as a dashed green line, but is can be hidden by setting timings=False
. In the following example, we’ll use the calinski_harabaz
score and hide the time to fit the model.
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(
model, k=(4,12), metric='calinski_harabaz', timings=False
)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
It is important to remember that the “elbow” method does not work well if the data is not very clustered. In this case, you might see a smooth curve and the optimal value of \(K\) will be unclear.
API Reference¶
Implements the elbow method for determining the optimal number of clusters. https://bl.ocks.org/rpgove/0060ff3b656618e9136b

class
yellowbrick.cluster.elbow.
KElbowVisualizer
(model, ax=None, k=10, metric='distortion', timings=True, **kwargs)[source]¶ Bases:
yellowbrick.cluster.base.ClusteringScoreVisualizer
The KElbow Visualizer implements the “elbow” method of selecting the optimal number of clusters for Kmeans clustering. Kmeans is a simple unsupervised machine learning algorithm that groups data into a specified number (k) of clusters. Because the user must specify in advance what k to choose, the algorithm is somewhat naive – it assigns all members to k clusters even if that is not the right k for the dataset.
The elbow method runs kmeans clustering on the dataset for a range of values for k (say from 110) and then for each value of k computes an average score for all clusters. By default, the
distortion
score is computed, the sum of square distances from each point to its assigned center. Other metrics can also be used such as thesilhouette
score, the mean silhouette coefficient for all samples or thecalinski_harabaz
score, which computes the ratio of dispersion between and within clusters.When these overall metrics for each model are plotted, it is possible to visually determine the best value for K. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.
Parameters:  model : a ScikitLearn clusterer
Should be an instance of a clusterer, specifically
KMeans
orMiniBatchKMeans
. If it is not a clusterer, an exception is raised. ax : matplotlib Axes, default: None
The axes to plot the figure on. If None is passed in the current axes will be used (or generated if required).
 k : integer, tuple, or iterable
The k values to compute silhouette scores for. If a single integer is specified, then will compute the range (2,k). If a tuple of 2 integers is specified, then k will be in np.arange(k[0], k[1]). Otherwise, specify an iterable of integers to use as values for k.
 metric : string, default:
"distortion"
Select the scoring metric to evaluate the clusters. The default is the mean distortion, defined by the sum of squared distances between each observation and its closest centroid. Other metrics include:
 distortion: mean sum of squared distances to centers
 silhouette: mean ratio of intracluster and nearestcluster distance
 calinski_harabaz: ratio of within to between cluster dispersion
 timings : bool, default: True
Display the fitting time per k to evaluate the amount of time required to train the clustering model.
 kwargs : dict
Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.
Notes
If you get a visualizer that doesn’t have an elbow or inflection point, then this method may not be working. The elbow method does not work well if the data is not very clustered; in this case, you might see a smooth curve and the value of k is unclear. Other scoring methods, such as BIC or SSE, also can be used to explore if clustering is a correct choice.
For a discussion on the Elbow method, read more at Robert Gove’s Block.
See also
The scikitlearn documentation for the silhouette_score and calinski_harabaz_score. The default, distortion_score, is implemented in`yellowbrick.cluster.elbow`.
Examples
>>> from yellowbrick.cluster import KElbowVisualizer >>> from sklearn.cluster import KMeans >>> model = KElbowVisualizer(KMeans(), k=10) >>> model.fit(X) >>> model.poof()