Manifold Visualization

The Manifold visualizer provides high dimensional visualization using manifold learning to embed instances described by many dimensions into 2, thus allowing the creation of a scatter plot that shows latent structures in data. Unlike decomposition methods such as PCA and SVD, manifolds generally use nearest-neighbors approaches to embedding, allowing them to capture non-linear structures that would be otherwise lost. The projections that are produced can then be analyzed for noise or separability to determine if it is possible to create a decision space in the data.

../../_images/concrete_tsne_manifold.png

The Manifold visualizer allows access to all currently available scikit-learn manifold implementations by specifying the manifold as a string to the visualizer. The currently implemented default manifolds are as follows:

Manifold Description
"lle" Locally Linear Embedding (LLE) uses many local linear decompositions to preserve globally non-linear structures.
"ltsa" LTSA LLE: local tangent space alignment is similar to LLE in that it uses locality to preserve neighborhood distances.
"hessian" Hessian LLE an LLE regularization method that applies a hessian-based quadratic form at each neighborhood
"modified" Modified LLE applies a regularization parameter to LLE.
"isomap" Isomap seeks a lower dimensional embedding that maintains geometric distances between each instance.
"mds" MDS: multi-dimensional scaling uses similarity to plot points that are near to each other close in the embedding.
"spectral" Spectral Embedding a discrete approximation of the low dimensional manifold using a graph representation.
"tsne" t-SNE: converts the similarity of points into probabilities then uses those probabilities to create an embedding.

Each manifold algorithm produces a different embedding and takes advantage of different properties of the underlying data. Generally speaking, it requires multiple attempts on new data to determine the manifold that works best for the structures latent in your data. Note however, that different manifold algorithms have different time, complexity, and resource requirements.

Manifolds can be used on many types of problems, and the color used in the scatter plot can describe the target instance. In an unsupervised or clustering problem, a single color is used to show structure and overlap. In a classification problem discrete colors are used for each class. In a regression problem, a color map can be used to describe points as a heat map of their regression values.

Discrete Target

In a classification or clustering problem, the instances can be described by discrete labels - the classes or categories in the supervised problem, or the clusters they belong to in the unsupervised version. The manifold visualizes this by assigning a color to each label and showing the labels in a legend.

# Load the classification data set
data = load_data('occupancy')

# Specify the features of interest
features = [
    "temperature", "relative humidity", "light", "C02", "humidity"
]

# Extract the data from the data frame.
X = data[features]
y = data.occupancy
from yellowbrick.features.manifold import Manifold

visualizer = Manifold(manifold='tsne', target='discrete')
visualizer.fit_transform(X,y)
visualizer.poof()
../../_images/occupancy_tsne_manifold.png

The visualization also displays the amount of time it takes to generate the embedding; as you can see, this can take a long time even for relatively small datasets. One tip is scale your data using the StandardScalar; another is to sample your instances (e.g. using train_test_split to preserve class stratification) or to filter features to decrease sparsity in the dataset.

One common mechanism is to use SelectKBest to select the features that have a statistical correlation with the target dataset. For example, we can use the f_classif score to find the 3 best features in our occupancy dataset.

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

model = Pipeline([
    ("selectk", SelectKBest(k=3, score_func=f_classif)),
    ("viz", Manifold(manifold='isomap', target='discrete')),
])

X, y = load_occupancy_data()
model.fit(X, y)
model.named_steps['viz'].poof()
../../_images/occupancy_select_k_best_isomap_manifold.png

Continuous Target

For a regression target or to specify color as a heat-map of continuous values, specify target='continuous'. Note that by default the param target='auto' is set, which determines if the target is discrete or continuous by counting the number of unique values in y.

# Specify the features of interest
feature_names = [
    'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]
target_name = 'strength'

# Get the X and y data from the DataFrame
X = data[feature_names]
y = data[target_name]
visualizer = Manifold(manifold='isomap', target='continuous')
visualizer.fit_transform(X,y)
visualizer.poof()
../../_images/concrete_isomap_manifold.png

API Reference

Use manifold algorithms for high dimensional visualization.

class yellowbrick.features.manifold.Manifold(ax=None, manifold='lle', n_neighbors=10, colors=None, target='auto', alpha=0.7, random_state=None, **kwargs)[source]

Bases: yellowbrick.features.base.FeatureVisualizer

The Manifold visualizer provides high dimensional visualization for feature analysis by embedding data into 2 dimensions using the sklearn.manifold package for manifold learning. In brief, manifold learning algorithms are unsuperivsed approaches to non-linear dimensionality reduction (unlike PCA or SVD) that help visualize latent structures in data.

The manifold algorithm used to do the embedding in scatter plot space can either be a transformer or a string representing one of the already specified manifolds as follows:

Manifold Description
"lle" Locally Linear Embedding
"ltsa" LTSA LLE
"hessian" Hessian LLE
"modified" Modified LLE
"isomap" Isomap
"mds" Multi-Dimensional Scaling
"spectral" Spectral Embedding
"tsne" t-SNE

Each of these algorithms embeds non-linear relationships in different ways, allowing for an exploration of various structures in the feature space. Note however, that each of these algorithms has different time, memory and complexity requirements; take special care when using large datasets!

The Manifold visualizer also shows the specified target (if given) as the color of the scatter plot. If a classification or clustering target is given, then discrete colors will be used with a legend. If a regression or continuous target is specified, then a colormap and colorbar will be shown.

Parameters:
ax : matplotlib Axes, default: None

The axes to plot the figure on. If None, the current axes will be used or generated if required.

manifold : str or Transformer, default: “lle”

Specify the manifold algorithm to perform the embedding. Either one of the strings listed in the table above, or an actual scikit-learn transformer. The constructed manifold is accessible with the manifold property, so as to modify hyperparameters before fit.

n_neighbors : int, default: 10

Many manifold algorithms are nearest neighbors based, for those that are, this parameter specfies the number of neighbors to use in the embedding. If the manifold algorithm doesn’t use nearest neighbors, then this parameter is ignored.

colors : str or list of colors, default: None

Specify the colors used, though note that the specification depends very much on whether the target is continuous or discrete. If continuous, colors must be the name of a colormap. If discrete, then colors can be the name of a palette or a list of colors to use for each class in the target.

target : str, default: “auto”

Specify the type of target as either “discrete” (classes) or “continuous” (real numbers, usually for regression). If “auto”, the Manifold will attempt to determine the type by counting the number of unique values.

If the target is discrete, points will be colored by the target class and a legend will be displayed. If continuous, points will be displayed with a colormap and a color bar will be displayed. In either case, if no target is specified, only a single color will be drawn.

alpha : float, default: 0.7

Specify a transparency where 1 is completely opaque and 0 is completely transparent. This property makes densely clustered points more visible.

random_state : int or RandomState, default: None

Fixes the random state for stochastic manifold algorithms.

kwargs : dict

Keyword arguments passed to the base class and may influence the feature visualization properties.

Notes

Specifying the target as 'continuous' or 'discrete' will influence how the visualizer is finally displayed, don’t rely on the automatic determination from the Manifold!

Scaling your data with the standard scalar before applying it to the visualizer is a great way of increasing performance. Additionally using the SelectKBest transformer may also improve performance and lead to better visualizations.

Warning

Manifold visualizers have extremly varying time, resource, and complexity requirements. Sampling data or features may be necessary in order to finish a manifold computation.

See also

The Scikit-Learn discussion on Manifold Learning.

Examples

>>> viz = Manifold(manifold='isomap', target='discrete')
>>> viz.fit_transform(X, y)
>>> viz.poof()
Attributes:
fit_time_ : float

The amount of time in seconds it took to fit the Manifold.

classes_ : np.ndarray, optional

If discrete, the classes identified in the target y.

range_ : tuple of floats, optional

If continuous, the maximum and minimum values in the target y.

ALGORITHMS = {'hessian': LocallyLinearEmbedding(eigen_solver='auto', hessian_tol=0.0001, max_iter=100, method='hessian', modified_tol=1e-12, n_components=2, n_jobs=1, n_neighbors=5, neighbors_algorithm='auto', random_state=None, reg=0.001, tol=1e-06), 'isomap': Isomap(eigen_solver='auto', max_iter=None, n_components=2, n_jobs=1, n_neighbors=5, neighbors_algorithm='auto', path_method='auto', tol=0), 'lle': LocallyLinearEmbedding(eigen_solver='auto', hessian_tol=0.0001, max_iter=100, method='standard', modified_tol=1e-12, n_components=2, n_jobs=1, n_neighbors=5, neighbors_algorithm='auto', random_state=None, reg=0.001, tol=1e-06), 'ltsa': LocallyLinearEmbedding(eigen_solver='auto', hessian_tol=0.0001, max_iter=100, method='ltsa', modified_tol=1e-12, n_components=2, n_jobs=1, n_neighbors=5, neighbors_algorithm='auto', random_state=None, reg=0.001, tol=1e-06), 'mds': MDS(dissimilarity='euclidean', eps=0.001, max_iter=300, metric=True, n_components=2, n_init=4, n_jobs=1, random_state=None, verbose=0), 'modified': LocallyLinearEmbedding(eigen_solver='auto', hessian_tol=0.0001, max_iter=100, method='modified', modified_tol=1e-12, n_components=2, n_jobs=1, n_neighbors=5, neighbors_algorithm='auto', random_state=None, reg=0.001, tol=1e-06), 'spectral': SpectralEmbedding(affinity='nearest_neighbors', eigen_solver=None, gamma=None, n_components=2, n_jobs=1, n_neighbors=None, random_state=None), 'tsne': TSNE(angle=0.5, early_exaggeration=12.0, init='pca', learning_rate=200.0, method='barnes_hut', metric='euclidean', min_grad_norm=1e-07, n_components=2, n_iter=1000, n_iter_without_progress=300, perplexity=30.0, random_state=None, verbose=0)}
draw(X, y=None)[source]

Draws the points described by X and colored by the points in y. Can be called multiple times before finalize to add more scatter plots to the axes, however fit() must be called before use.

Parameters:
X : array-like of shape (n, 2)

The matrix produced by the transform() method.

y : array-like of shape (n,), optional

The target, used to specify the colors of the points.

Returns:
self.ax : matplotlib Axes object

Returns the axes that the scatter plot was drawn on.

finalize()[source]

Add title and modify axes to make the image ready for display.

fit(X, y=None)[source]

Fits the manifold on X and transforms the data to plot it on the axes. The optional y specified can be used to declare discrete colors. If the target is set to ‘auto’, this method also determines the target type, and therefore what colors will be used.

Note also that fit records the amount of time it takes to fit the manifold and reports that information in the visualization.

Parameters:
X : array-like of shape (n, m)

A matrix or data frame with n instances and m features where m > 2.

y : array-like of shape (n,), optional

A vector or series with target values for each instance in X. This vector is used to determine the color of the points in X.

Returns:
self : Manifold

Returns the visualizer object.

manifold

Property containing the manifold transformer constructed from the supplied hyperparameter. Use this property to modify the manifold before fit with manifold.set_params().

transform(X)[source]

Returns the transformed data points from the manifold embedding.

Parameters:
X : array-like of shape (n, m)

A matrix or data frame with n instances and m features

Returns:
Xprime : array-like of shape (n, 2)

Returns the 2-dimensional embedding of the instances.