Credit

This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.

Samples total	30000
Dimensionality	24
Features	real, int
Targets	int, 0 or 1
Task(s)	classification

Description

This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel “Sorting Smoothing Method” to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.

Citation

Downloaded from the UCI Machine Learning Repository on October 13, 2016.

Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

Loader

yellowbrick.datasets.loaders.load_credit(data_home=None, return_dataset=False)[source]

Loads the credit multivariate dataset that is well suited to binary classification tasks. The dataset contains 30000 instances and 23 integer and real value attributes with a discrete target.

The Yellowbrick datasets are hosted online and when requested, the dataset is downloaded to your local computer for use. Note that if the dataset hasn’t been downloaded before, an Internet connection is required. However, if the data is cached locally, no data will be downloaded. Yellowbrick checks the known signature of the dataset with the data downloaded to ensure the download completes successfully.

Datasets are stored alongside the code, but the location can be specified with the data_home parameter or the $YELLOWBRICK_DATA envvar.

Parameters

data_homestr, optional: The path on disk where data is stored. If not passed in, it is looked up from $YELLOWBRICK_DATA or the default returned by get_data_home.
return_datasetbool, default=False: Return the raw dataset object instead of X and y numpy arrays to get access to alternative targets, extra features, content and meta.

Returns

Xarray-like with shape (n_instances, n_features) if return_dataset=False: A pandas DataFrame or numpy array describing the instance features.
yarray-like with shape (n_instances,) if return_dataset=False: A pandas Series or numpy array describing the target vector.
datasetDataset instance if return_dataset=True: The Yellowbrick Dataset object provides an interface to accessing the data in a variety of formats as well as associated metadata and content.