Author: Jason Brownlee
Light Gradient Boosted Machine, or LightGBM for short, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.
LightGBM extends the gradient boosting algorithm by adding a type of automatic feature selection as well as focusing on boosting examples with larger gradients. This can result in a dramatic speedup of training and improved predictive performance.
As such, LightGBM has become a de facto algorithm for machine learning competitions when working with tabular data for regression and classification predictive modeling tasks. As such, it owns a share of the blame for the increased popularity and wider adoption of gradient boosting methods in general, along with Extreme Gradient Boosting (XGBoost).
In this tutorial, you will discover how to develop Light Gradient Boosted Machine ensembles for classification and regression.
After completing this tutorial, you will know:
- Light Gradient Boosted Machine (LightGBM) is an efficient open-source implementation of the stochastic gradient boosting ensemble algorithm.
- How to develop LightGBM ensembles for classification and regression with the scikit-learn API.
- How to explore the effect of LightGBM model hyperparameters on model performance.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Light Gradient Boosted Machine Algorithm
- LightGBM Scikit-Learn API
- LightGBM Ensemble for Classification
- LightGBM Ensemble for Regression
- LightGBM Hyperparameters
- Explore Number of Trees
- Explore Tree Depth
- Explore Learning Rate
- Explore Boosting Type
Light Gradient Boosted Machine Algorithm
Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.
Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.
Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.
For more on gradient boosting, see the tutorial:
Light Gradient Boosted Machine, or LightGBM for short, is an open-source implementation of gradient boosting designed to be efficient and perhaps more effective than other implementations.
As such, LightGBM refers to the open-source project, the software library, and the machine learning algorithm. In this way, it is very similar to the Extreme Gradient Boosting or XGBoost technique.
LightGBM was described by Guolin Ke, et al. in the 2017 paper titled “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” The implementation introduces two key ideas: GOSS and EFB.
Gradient-based One-Side Sampling, or GOSS for short, is a modification to the gradient boosting method that focuses attention on those training examples that result in a larger gradient, in turn speeding up learning and reducing the computational complexity of the method.
With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size.
— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
Exclusive Feature Bundling, or EFB for short, is an approach for bundling sparse (mostly zero) mutually exclusive features, such as categorical variable inputs that have been one-hot encoded. As such, it is a type of automatic feature selection.
… we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features.
— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
Together, these two changes can accelerate the training time of the algorithm by up to 20x. As such, LightGBM may be considered gradient boosting decision trees (GBDT) with the addition of GOSS and EFB.
We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy
— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
LightGBM Scikit-Learn API
LightGBM can be installed as a standalone library and the LightGBM model can be developed using the scikit-learn API.
The first step is to install the LightGBM library, if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:
sudo pip install lightgbm
You can then confirm that the LightGBM library was installed correctly and can be used by running the following script.
# check lightgbm version import lightgbm print(lightgbm.__version__)
Running the script will print your version of the LightGBM library you have installed.
Your version should be the same or higher. If not, you must upgrade your version of the LightGBM library.
2.3.1
If you require specific instructions for your development environment, see the tutorial:
The LightGBM library has its own custom API, although we will use the method via the scikit-learn wrapper classes: LGBMRegressor and LGBMClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.
Both models operate the same way and take the same arguments that influence how the decision trees are created and added to the ensemble.
Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.
When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.
Let’s take a look at how to develop a LightGBM ensemble for both classification and regression.
LightGBM Ensemble for Classification
In this section, we will look at using LightGBM for a classification problem.
First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.
The complete example is listed below.
# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # summarize the dataset print(X.shape, y.shape)
Running the example creates the dataset and summarizes the shape of the input and output components.
(1000, 20) (1000,)
Next, we can evaluate a LightGBM algorithm on this dataset.
We will evaluate the model using repeated stratified k-fold cross-validation with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.
# evaluate lightgbm algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from lightgbm import LGBMClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # define the model model = LGBMClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Running the example reports the mean and standard deviation accuracy of the model.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see the LightGBM ensemble with default hyperparameters achieves a classification accuracy of about 92.5 percent on this test dataset.
Accuracy: 0.925 (0.031)
We can also use the LightGBM model as a final model and make predictions for classification.
First, the LightGBM ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.
The example below demonstrates this on our binary classification dataset.
# make predictions using lightgbm for classification from sklearn.datasets import make_classification from lightgbm import LGBMClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) # define the model model = LGBMClassifier() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [0.2929949,-4.21223056,-1.288332,-2.17849815,-0.64527665,2.58097719,0.28422388,-7.1827928,-1.91211104,2.73729512,0.81395695,3.96973717,-2.66939799,3.34692332,4.19791821,0.99990998,-0.30201875,-4.43170633,-2.82646737,0.44916808] yhat = model.predict([row]) print('Predicted Class: %d' % yhat[0])
Running the example fits the LightGBM ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.
Predicted Class: 1
Now that we are familiar with using LightGBM for classification, let’s look at the API for regression.
LightGBM Ensemble for Regression
In this section, we will look at using LightGBM for a regression problem.
First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.
The complete example is listed below.
# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # summarize the dataset print(X.shape, y.shape)
Running the example creates the dataset and summarizes the shape of the input and output components.
(1000, 20) (1000,)
Next, we can evaluate a LightGBM algorithm on this dataset.
As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.
The complete example is listed below.
# evaluate lightgbm ensemble for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from lightgbm import LGBMRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = LGBMRegressor() # evaluate the model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Running the example reports the mean and standard deviation accuracy of the model.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see the LightGBM ensemble with default hyperparameters achieves a MAE of about 60.
MAE: -60.004 (2.887)
We can also use the LightGBM model as a final model and make predictions for regression.
First, the LightGBM ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.
The example below demonstrates this on our regression dataset.
# gradient lightgbm for making predictions for regression from sklearn.datasets import make_regression from lightgbm import LGBMRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7) # define the model model = LGBMRegressor() # fit the model on the whole dataset model.fit(X, y) # make a single prediction row = [0.20543991,-0.97049844,-0.81403429,-0.23842689,-0.60704084,-0.48541492,0.53113006,2.01834338,-0.90745243,-1.85859731,-1.02334791,-0.6877744,0.60984819,-0.70630121,-1.29161497,1.32385441,1.42150747,1.26567231,2.56569098,-0.11154792] yhat = model.predict([row]) print('Prediction: %d' % yhat[0])
Running the example fits the LightGBM ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.
Prediction: 52
Now that we are familiar with using the scikit-learn API to evaluate and use LightGBM ensembles, let’s look at configuring the model.
LightGBM Hyperparameters
In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the LightGBM ensemble and their effect on model performance.
There are many hyperparameters we can look at for LightGBM, although in this case, we will look at the number of trees and tree depth, the learning rate, and the boosting type.
For good general advice on tuning LightGBM hyperparameters, see the documentation:
Explore Number of Trees
An important hyperparameter for the LightGBM ensemble algorithm is the number of decision trees used in the ensemble.
Recall that decision trees are added to the model sequentially in an effort to correct and improve upon the predictions made by prior trees. As such, more trees are often better.
The number of trees can be set via the “n_estimators” argument and defaults to 100.
The example below explores the effect of the number of trees with values between 10 to 5,000.
# explore lightgbm number of trees effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from lightgbm import LGBMClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() trees = [10, 50, 100, 500, 1000, 5000] for n in trees: models[str(n)] = LGBMClassifier(n_estimators=n) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()
Running the example first reports the mean accuracy for each configured number of decision trees.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that that performance improves on this dataset until about 500 trees, after which performance appears to level off.
>10 0.857 (0.033) >50 0.916 (0.032) >100 0.925 (0.031) >500 0.938 (0.026) >1000 0.938 (0.028) >5000 0.937 (0.028)
A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.
We can see the general trend of increasing model performance and ensemble size.
Explore Tree Depth
Varying the depth of each tree added to the ensemble is another important hyperparameter for gradient boosting.
The tree depth controls how specialized each tree is to the training dataset: how general or overfit it might be. Trees are preferred that are not too shallow and general (like AdaBoost) and not too deep and specialized (like bootstrap aggregation).
Gradient boosting generally performs well with trees that have a modest depth, finding a balance between skill and generality.
Tree depth is controlled via the “max_depth” argument and defaults to an unspecified value as the default mechanism for controlling how complex trees are is to use the number of leaf nodes.
There are two main ways to control tree complexity: the max depth of the trees and the maximum number of terminal nodes (leaves) in the tree. In this case, we are exploring the number of leaves so we need to increase the number of leaves to support deeper trees by setting the “num_leaves” argument.
The example below explores tree depths between 1 and 10 and the effect on model performance.
# explore lightgbm tree depth effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from lightgbm import LGBMClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(1,11): models[str(i)] = LGBMClassifier(max_depth=i, num_leaves=2**i) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()
Running the example first reports the mean accuracy for each configured tree depth.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that performance improves with tree depth, perhaps all the way to 10 levels. It might be interesting to explore even deeper trees.
>1 0.833 (0.028) >2 0.870 (0.033) >3 0.899 (0.032) >4 0.912 (0.026) >5 0.925 (0.031) >6 0.924 (0.029) >7 0.922 (0.027) >8 0.926 (0.027) >9 0.925 (0.028) >10 0.928 (0.029)
A box and whisker plot is created for the distribution of accuracy scores for each configured tree depth.
We can see the general trend of increasing model performance with the tree depth to a depth of five levels, after which performance begins to sit reasonably flat.
Explore Learning Rate
Learning rate controls the amount of contribution that each model has on the ensemble prediction.
Smaller rates may require more decision trees in the ensemble.
The learning rate can be controlled via the “learning_rate” argument and defaults to 0.1.
The example below explores the learning rate and compares the effect of values between 0.0001 and 1.0.
# explore lightgbm learning rate effect on performance from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from lightgbm import LGBMClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() rates = [0.0001, 0.001, 0.01, 0.1, 1.0] for r in rates: key = '%.4f' % r models[key] = LGBMClassifier(learning_rate=r) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()
Running the example first reports the mean accuracy for each configured learning rate.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that a larger learning rate results in better performance on this dataset. We would expect that adding more trees to the ensemble for the smaller learning rates would further lift performance.
>0.0001 0.800 (0.038) >0.0010 0.811 (0.035) >0.0100 0.859 (0.035) >0.1000 0.925 (0.031) >1.0000 0.928 (0.025)
A box and whisker plot is created for the distribution of accuracy scores for each configured learning rate.
We can see the general trend of increasing model performance with the increase in learning rate all the way to the large values of 1.0.
Explore Boosting Type
A feature of LightGBM is that it supports a number of different boosting algorithms, referred to as boosting types.
The boosting type can be specified via the “boosting_type” argument and take a string to specify the type. The options include:
- ‘gbdt‘: Gradient Boosting Decision Tree (GDBT).
- ‘dart‘: Dropouts meet Multiple Additive Regression Trees (DART).
- ‘goss‘: Gradient-based One-Side Sampling (GOSS).
The default is GDBT, which is the classical gradient boosting algorithm.
DART is described in the 2015 paper titled “DART: Dropouts meet Multiple Additive Regression Trees” and, as its name suggests, adds the concept of dropout from deep learning to the Multiple Additive Regression Trees (MART) algorithm, a precursor to gradient boosting decision trees.
This algorithm is known by many names, including Gradient TreeBoost, boosted trees, and Multiple Additive Regression Trees (MART). We use the latter to refer to this algorithm.
— DART: Dropouts meet Multiple Additive Regression Trees, 2015.
GOSS was introduced with the LightGBM paper and library. The approach seeks to only use instances that result in a large error gradient to update the model and drop the rest.
… we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain.
— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
The example below compares LightGBM on the synthetic classification dataset with the three key boosting techniques.
# explore lightgbm boosting type effect on performance from numpy import arange from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from lightgbm import LGBMClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7) return X, y # get a list of models to evaluate def get_models(): models = dict() types = ['gbdt', 'dart', 'goss'] for t in types: models[t] = LGBMClassifier(boosting_type=t) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()
Running the example first reports the mean accuracy for each configured boosting type.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that the default boosting method performed better than the other two techniques that were evaluated.
>gbdt 0.925 (0.031) >dart 0.912 (0.028) >goss 0.918 (0.027)
A box and whisker plot is created for the distribution of accuracy scores for each configured boosting method, allowing the techniques to be compared directly.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Related Tutorials
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost
Papers
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
- DART: Dropouts meet Multiple Additive Regression Trees, 2015.
APIs
- LightGBM Project, GitHub
- LightGBM’s Documentation.
- LightGBM Installation Guide
- LightGBM Parameters Tuning.
- lightgbm.LGBMClassifier API.
- lightgbm.LGBMRegressor API.
Articles
Summary
In this tutorial, you discovered how to develop Light Gradient Boosted Machine ensembles for classification and regression.
Specifically, you learned:
- Light Gradient Boosted Machine (LightGBM) is an efficient open source implementation of the stochastic gradient boosting ensemble algorithm.
- How to develop LightGBM ensembles for classification and regression with the scikit-learn API.
- How to explore the effect of LightGBM model hyperparameters on model performance.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post How to Develop a Light Gradient Boosted Machine (LightGBM) Ensemble appeared first on Machine Learning Mastery.