The dataset was created by Angeliki Xifara (angxifara ‘@’ gmail.com, Civil/Structural Engineer) and was processed by Athanasios Tsanas (tsanasthanasis ‘@’ gmail.com, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK).
We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.
The energy dataset contains a multi-target supervised dataset for both the heating and the cooling load of buildings. By default only the heating load is returned for most examples. To perform a multi-target regression, simply access the dataframe and select both the heating and cooling load columns as follows:
from yellowbrick.datasets import load_energy from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split as tts features = [ "relative compactness", "surface area", "wall area", "roof area", "overall height", "orientation", "glazing area", "glazing area distribution", ] target = ["heating load", "cooling load"] df = load_energy(return_dataset=True).to_dataframe() X, y = df[features], df[target] X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2) model = RandomForestRegressor().fit(X_train, y_train) model.score(X_test, y_test)
Note that not all regressors support multi-target regression, one simple strategy in this case is to use a
sklearn.multioutput.MultiOutputRegressor, which fits an estimator for each target.
Downloaded from the UCI Machine Learning Repository March 23, 2015.
Tsanas, A. Xifara: ‘Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools’, Energy and Buildings, Vol. 49, pp. 560-567, 2012
For further details on the data analysis methodology:
Tsanas, ‘Accurate telemonitoring of Parkinson’s disease symptom severity using nonlinear speech signal processing and statistical machine learning’, D.Phil. thesis, University of Oxford, 2012
- yellowbrick.datasets.loaders.load_energy(data_home=None, return_dataset=False)
Loads the energy multivariate dataset that is well suited to multi-output regression and classification tasks. The dataset contains 768 instances and 8 real valued attributes with two continous targets.
The Yellowbrick datasets are hosted online and when requested, the dataset is downloaded to your local computer for use. Note that if the dataset hasn’t been downloaded before, an Internet connection is required. However, if the data is cached locally, no data will be downloaded. Yellowbrick checks the known signature of the dataset with the data downloaded to ensure the download completes successfully.
Datasets are stored alongside the code, but the location can be specified with the
data_homeparameter or the
- data_homestr, optional
The path on disk where data is stored. If not passed in, it is looked up from
$YELLOWBRICK_DATAor the default returned by
- return_datasetbool, default=False
Return the raw dataset object instead of X and y numpy arrays to get access to alternative targets, extra features, content and meta.
- Xarray-like with shape (n_instances, n_features) if return_dataset=False
A pandas DataFrame or numpy array describing the instance features.
- yarray-like with shape (n_instances,) if return_dataset=False
A pandas Series or numpy array describing the target vector.
- datasetDataset instance if return_dataset=True
The Yellowbrick Dataset object provides an interface to accessing the data in a variety of formats as well as associated metadata and content.