Balanced Binning Reference

Frequently, machine learning problems in the real world suffer from the curse of dimensionality; you have fewer training instances than you’d like and the predictive signal is distributed (often unpredictably!) across many different features.

Sometimes when the your target variable is continuously-valued, there simply aren’t enough instances to predict these values to the precision of regression. In this case, we can sometimes transform the regression problem into a classification problem by binning the continuous values into makeshift classes.

To help the user select the optimal number of bins, the BalancedBinningReference visualizer takes the target variable y as input and generates a histogram with vertical lines indicating the recommended value points to ensure that the data is evenly distributed into each bin.

from import BalancedBinningReference

# Load the a regression data set
data = load_data("concrete")

# Extract the target of interest
y = data["strength"]

# Instantiate the visualizer
visualizer = BalancedBinningReference()          # Fit the data to the visualizer
visualizer.poof()          # Draw/show/poof the data

See also

To learn more, please read Rebecca Bilbro’s article “Creating Categorical Variables from Continuous Data.”

API Reference

Implements histogram with vertical lines to help with balanced binning.

class, target=None, bins=4, **kwargs)[source]


BalancedBinningReference generates a histogram with vertical lines showing the recommended value point to bin your data so they can be evenly distributed in each bin.

ax : matplotlib Axes, default: None

This is inherited from FeatureVisualizer and is defined within BalancedBinningReference.

target : string, default: “Frequency”

The name of the y variable

bins : number of bins to generate the histogram, default: 4
kwargs : dict

Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.


These parameters can be influenced later on in the visualization process, but can and should be set as early as possible.


>>> visualizer = BalancedBinningReference()
>>> visualizer.poof()
bin_edges : binning reference values
draw(y, **kwargs)[source]

Draws a histogram with the reference value for binning as vertical lines.

y : an array of one dimension or a pandas Series

Finalize executes any subclass-specific axes finalization steps. The user calls poof and poof calls finalize.

kwargs: generic keyword arguments.
fit(y, **kwargs)[source]

Sets up y for the histogram and checks to ensure that y is of the correct data type. Fit calls draw.

y : an array of one dimension or a pandas Series
kwargs : dict

keyword arguments passed to scikit-learn API.


Creates the labels for the feature and target variables.