Results for Standard Classification and Regression Machine Learning Datasets

Author: Jason Brownlee

It is important that beginner machine learning practitioners practice on small real-world datasets.

So-called standard machine learning datasets contain actual observations, fit into memory, and are well studied and well understood. As such, they can be used by beginner practitioners to quickly test, explore, and practice data preparation and modeling techniques.

A practitioner can confirm whether they have the data skills required to achieve a good result on a standard machine learning dataset. A good result is a result that is above the 80th or 90th percentile result of what may be technically possible for a given dataset.

The skills developed by practitioners on standard machine learning datasets can provide the foundation for tackling larger, more challenging projects.

In this post, you will discover standard machine learning datasets for classification and regression and the baseline and good results that one may expect to achieve on each.

After reading this post, you will know:

  • The importance of standard machine learning datasets.
  • How to systematically evaluate a model on a standard machine learning dataset.
  • Standard datasets for classification and regression and the baseline and good performance expected on each.

Let’s get started.

Results for Standard Classification and Regression Machine Learning Datasets

Results for Standard Classification and Regression Machine Learning Datasets
Photo by Don Dearing, some rights reserved.

Overview

This tutorial is divided into seven parts; they are:

  1. Value of Small Machine Learning Datasets
  2. Definition of a Standard Machine Learning Dataset
  3. Standard Machine Learning Datasets
  4. Good Results for Standard Datasets
  5. Model Evaluation Methodology
  6. Results for Classification Datasets
    1. Binary Classification Datasets
      1. Ionosphere
      2. Pima Indian Diabetes
      3. Sonar
      4. Wisconsin Breast Cancer
      5. Horse Colic
    2. Multiclass Classification Datasets
      1. Iris Flowers
      2. Glass
      3. Wine
      4. Wheat Seeds
  7. Results for Regression Datasets
    1. Housing
    2. Auto Insurance
    3. Abalone
    4. Auto Imports

Value of Small Machine Learning Datasets

There are a number of small machine learning datasets for classification and regression predictive modeling problems that are frequently reused.

Sometimes the datasets are used as the basis for demonstrating a machine learning or data preparation technique. Other times, they are used as a basis for comparing different techniques.

These datasets were collected and made publicly available in the early days of applied machine learning when data and real-world datasets were scarce. As such, they have become a standard or canonized from their wide adoption and reuse alone, not for any intrinsic interestingness in the problems.

Finding a good model on one of these datasets does not mean you have “solved” the general problem. Also, some of the datasets may contain names or indicators that might be considered questionable or culturally insensitive (which was very likely not the intent when the data was collected). As such, they are also sometimes referred to as “toy” datasets.

Such datasets are not really useful for points of comparison for machine learning algorithms, as most empirical experiments are nearly impossible to reproduce.

Nevertheless, such datasets are valuable in the field of applied machine learning today. Even in the era of standard machine learning libraries, big data, and the abundance of data.

There are three main reasons why they are valuable; they are:

  1. The datasets are real.
  2. The datasets are small.
  3. The datasets are understood.

Real datasets are useful as compared to contrived datasets because they are messy. There may be and are measurement errors, missing values, mislabeled examples, and more. Some or all of these issues must be searched for and addressed, and are some of the properties we may encounter when working on our own projects.

Small datasets are useful as compared to large datasets that may be many gigabytes in size. Small datasets can easily fit into memory and allow for the testing and exploration of many different data visualization, data preparation, and modeling algorithms easily and quickly. Speed of testing ideas and getting feedback is critical for beginners, and small datasets facilitate exactly this.

Understood datasets are useful as compared to new or newly created datasets. The features are well defined, the units of the features are specified, the source of the data is known, and the dataset has been well studied in tens, hundreds, and in some cases, thousands of research projects and papers. This provides a context in which results can be compared and evaluated, a property not available in entirely new domains.

Given these properties, I strongly advocate machine learning beginners (and practitioners that are new to a specific technique) start with standard machine learning datasets.

Definition of a Standard Machine Learning Dataset

I would like to go one step further and define some more specific properties of a “standard” machine learning dataset.

A standard machine learning dataset has the following properties.

  • Less than 10,000 rows (samples).
  • Less than 100 columns (features).
  • Last column is the target variable.
  • Stored in a single file with CSV format and without header line.
  • Missing values marked with a question mark character (‘?’)
  • It is possible to achieve a better than naive result.

Now that we have a clear definition of a dataset, let’s look at what a “good” result means.

Standard Machine Learning Datasets

A dataset is a standard machine learning dataset if it is frequently used in books, research papers, tutorials, presentations, and more.

The best repository for these so-called classical or standard machine learning datasets is the University of California at Irvine (UCI) machine learning repository. This website categorizes datasets by type and provides a download of the data and additional information about each dataset and references relevant papers.

I have chosen five or fewer datasets for each problem type as a starting point.

All standard datasets used in this post are available on GitHub here:

Download links are also provided for each dataset and for additional details about the dataset (the so-called a “.name” file).

Each code example will automatically download a given dataset for you. If this is a problem, you can download the CSV file manually, place it in the same directory as the code example, then change the code example to use the filename instead of the URL.

For example:

...
# load dataset
dataframe = read_csv('ionosphere.csv', header=None)

Good Results for Standard Datasets

A challenge for beginners when working with standard machine learning datasets is what represents a good result.

In general, a model is skillful if it can demonstrate a performance that is better than a naive method, such as predicting the majority class in classification or the mean value in regression. This is called a baseline model or a baseline of performance that provides a relative measure of performance specific to a dataset. You can learn more about this here:

Given that we now have a method for determining whether a model has skill on a dataset, beginners remain interested in the upper limits of performance for a given dataset. This is required information to know whether you are “getting good” at the process of applied machine learning.

Good does not mean perfect predictions. All models will have prediction errors, and perfect predictions are not possible (tractable?) on real-world datasets.

Defining “good” or “best” results for a dataset is challenging because it is dependent generally on the model evaluation methodology, and specifically on the versions of the dataset and libraries used in the evaluation.

Good means “good-enough” given available resources. Often, this means a skill score that is above the 80th or 90th percentile of what might be possible for a dataset given unbounded skill, time, and computational resources.

In this tutorial, you will discover how to calculate the baseline performance and “good” (near-best) performance that is possible on each dataset. You will also discover how to specify the data preparation and model used to achieve the performance.

Rather than explain how to do this, a short Python code example is given that you can use to reproduce the baseline and good result.

Model Evaluation Methodology

The evaluation methodology is simple and fast, and generally recommended when working with small predictive modeling problems.

The procedure is evaluated as follows:

  • A model is evaluated using 10-fold cross-validation.
  • The evaluation procedure is repeated three times.
  • The random seed for the cross-validation split is the repeat number (1, 2, or 3).

This results in 30 estimates of model performance from which a mean and standard deviation can be calculated to summarize the performance of a given model.

Using the repeat number as the seed for each cross-validation split ensures that each algorithm evaluated on the dataset gets the same splits of the data, ensuring a fair direct comparison.

Using the scikit-learn Python machine learning library, the example below can be used to evaluate a given model (or Pipeline). The RepeatedStratifiedKFold class defines the number of folds and repeats for classification, and the cross_val_score() function defines the score and performs the evaluation and returns a list of scores from which a mean and standard deviation can be calculated.

...
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

For regression we can use the RepeatedKFold class and the MAE score.

...
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')

The “good” scores reported are the best that I can get out of my own personal set of “get a good result fast on a given dataset” scripts. I believe the scores represent good scores that can be achieved on each dataset, perhaps in the 90th or 95th percentile of what is possible for each dataset, if not better.

That being said, I am not claiming that they are the best possible scores as I have not performed hyperparameter tuning for the well-performing models. I leave this as an exercise for interested practitioners. Best scores are not required if a practitioner can address a given dataset as getting a top percentile score is more than sufficient to demonstrate competence.

Note: I will update the results and models as I improve my own personal scripts and achieve better scores.

Can you get a better score for a dataset?
I would love to know. Share your model and score in the comments below and I will try to reproduce it and update the post (and give you full credit!)

Let’s dive in.

Results for Classification Datasets

Classification is a predictive modeling problem that predicts one label given one or more input variables.

The baseline model for classification tasks is a model that predicts the majority label. This can be achieved in scikit-learn using the DummyClassifier class with the ‘most_frequent‘ strategy; for example:

...
model = DummyClassifier(strategy='most_frequent')

The standard evaluation for classification models is classification accuracy, although this is not ideal for imbalanced and some multi-class problems. Nevertheless, for better or worse, this score will be used (for now).

Accuracy is reported as a fraction between 0 (0% or no skill) and 1 (100% or perfect skill).

There are two main types of classification tasks: binary and multi-class classification, divided based on the number of labels to be predicted for a given dataset as two or more than two respectively. Given the prevalence of classification tasks in machine learning, we will treat these two subtypes of classification problems separately.

Binary Classification Datasets

In this section, we will review the baseline and good performance on the following binary classification predictive modeling datasets:

  1. Ionosphere
  2. Pima Indian Diabetes
  3. Sonar
  4. Wisconsin Breast Cancer
  5. Horse Colic

Ionosphere

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Ionosphere
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = SVC(kernel='rbf', gamma='scale', C=10)
steps = [('s',StandardScaler()), ('n',MinMaxScaler()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (351, 34), (351,)
Baseline: 0.641 (0.006)
Good: 0.948 (0.033)

Pima Indian Diabetes

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Pima Indian Diabetes
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = LogisticRegression(solver='newton-cg',penalty='l2',C=1)
m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Note: you may see some warnings, but they can be safely ignored.

Shape: (768, 8), (768,)
Baseline: 0.651 (0.003)
Good: 0.774 (0.055)

Sonar

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Sonar
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = KNeighborsClassifier(n_neighbors=2, metric='minkowski', weights='distance')
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (208, 60), (208,)
Baseline: 0.534 (0.012)
Good: 0.882 (0.071)

Wisconsin Breast Cancer

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Wisconsin Breast Cancer
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer-wisconsin.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = SVC(kernel='sigmoid', gamma='scale', C=0.1)
steps = [('i',SimpleImputer(strategy='median')), ('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Note: you may see some warnings, but they can be safely ignored.

Shape: (699, 9), (699,)
Baseline: 0.655 (0.003)
Good: 0.973 (0.019)

Horse Colic

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Horse Colic
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyClassifier
from xgboost import XGBClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = XGBClassifier(learning_rate=0.1, n_estimators=100, subsample=1, max_depth=3, colsample_bynode=0.4)
steps = [('i',SimpleImputer(strategy='median')), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (300, 27), (300,)
Baseline: 0.670 (0.007)
Good: 0.852 (0.048)

Multiclass Classification Datasets

In this section, we will review the baseline and good performance on the following multiclass classification predictive modeling datasets:

  1. Iris Flowers
  2. Glass
  3. Wine
  4. Wheat Seeds

Iris Flowers

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Iris
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = LinearDiscriminantAnalysis()
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (150, 4), (150,)
Baseline: 0.333 (0.000)
Good: 0.980 (0.039)

Glass

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Glass
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = RandomForestClassifier(n_estimators=100,max_features=2)
m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (214, 9), (214,)
Baseline: 0.356 (0.013)
Good: 0.744 (0.085)

Wine

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Wine
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.dummy import DummyClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = QuadraticDiscriminantAnalysis()
steps = [('s',StandardScaler()), ('n',MinMaxScaler()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (178, 13), (178,)
Baseline: 0.399 (0.017)
Good: 0.992 (0.020)

Wheat Seeds

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Wine
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import RidgeClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wheat-seeds.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# evaluate naive
naive = DummyClassifier(strategy='most_frequent')
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = RidgeClassifier(alpha=0.2)
steps = [('s',StandardScaler()), ('m',model)]
pipeline = Pipeline(steps=steps)
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (210, 7), (210,)
Baseline: 0.333 (0.000)
Good: 0.973 (0.036)

Results for Regression Datasets

Regression is a predictive modeling problem that predicts a numerical value given one or more input variables.

The baseline model for classification tasks is a model that predicts the mean or median value. This can be achieved in scikit-learn using the DummyRegressor class using the ‘median‘ strategy; for example:

...
model = DummyRegressor(strategy='median')

The standard evaluation for regression models is mean absolute error (MAE), although this is not ideal for all regression problems. Nevertheless, for better or worse, this score will be used (for now).

MAE is reported as an error score between 0 (perfect skill) and a very large number or infinity (no skill).

In this section, we will review the baseline and good performance on the following regression predictive modeling datasets:

  1. Housing
  2. Auto Insurance
  3. Abalone
  4. Auto Imports

Housing

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Housing
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.dummy import DummyRegressor
from xgboost import XGBRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = XGBRegressor(learning_rate=0.1, n_estimators=100, subsample=0.7, max_depth=9, colsample_bynode=0.6, objective='reg:squarederror')
m_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (506, 13), (506,)
Baseline: 6.660 (0.706)
Good: 1.955 (0.279)

Auto Insurance

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Auto Insurance
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import PowerTransformer
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import HuberRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
X = X.astype('float32')
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(naive, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = HuberRegressor(epsilon=1.0, alpha=0.001)
steps = [('p',PowerTransformer()), ('m',model)]
pipeline = Pipeline(steps=steps)
target = TransformedTargetRegressor(regressor=pipeline, transformer=PowerTransformer())
m_scores = cross_val_score(target, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (63, 1), (63,)
Baseline: 66.624 (19.303)
Good: 28.358 (9.747)

Abalone

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Abalone
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyRegressor
from sklearn.svm import SVR
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/abalone.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
# minimally prepare dataset
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
transform = ColumnTransformer(transformers=[('c', OneHotEncoder(), [0])], remainder='passthrough')
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',naive)])
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = SVR(kernel='rbf',gamma='scale',C=10)
target = TransformedTargetRegressor(regressor=model, transformer=PowerTransformer(), check_inverse=False)
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',target)])
m_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (4177, 8), (4177,)
Baseline: 2.363 (0.116)
Good: 1.460 (0.075)

Auto Imports

The complete code example for achieving baseline and a good result on this dataset is listed below.

# baseline and good result for Auto Imports
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto_imports.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print('Shape: %s, %s' % (X.shape,y.shape))
y = y.astype('float32')
# evaluate naive
naive = DummyRegressor(strategy='median')
cat_ix = [2,3,4,5,6,7,8,14,15,17]
num_ix = [0,1,9,10,11,12,13,16,18,19,20,21,22,23,24]
steps = [('c', Pipeline(steps=[('s',SimpleImputer(strategy='most_frequent')),('oe',OneHotEncoder(handle_unknown='ignore'))]), cat_ix), ('n', SimpleImputer(strategy='median'), num_ix)]
transform = ColumnTransformer(transformers=steps, remainder='passthrough')
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',naive)])
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
n_scores = absolute(n_scores)
print('Baseline: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# evaluate model
model = RandomForestRegressor(n_estimators=100,max_features=10)
pipeline = Pipeline(steps=[('ColumnTransformer',transform), ('Model',model)])
m_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
m_scores = absolute(m_scores)
print('Good: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example, you should see the following results.

Shape: (201, 25), (201,)
Baseline: 5880.718 (1197.967)
Good: 1405.420 (317.683)

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Articles

Summary

In this post, you discovered standard machine learning datasets for classification and regression and the baseline and good results that one may expect to achieve on each.

Specifically, you learned:

  • The importance of standard machine learning datasets.
  • How to systematically evaluate a model on a standard machine learning dataset.
  • Standard datasets for classification and regression and the baseline and good performance expected on each.

Did I miss your favorite dataset?
Let me know in the comments and I will calculate a score for it, or perhaps even add it to this post.

Can you get a better score for a dataset?
I would love to know; share your model and score in the comments below and I will try to reproduce it and update the post (and give you full credit!)

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Results for Standard Classification and Regression Machine Learning Datasets appeared first on Machine Learning Mastery.

Go to Source