Ensemble Learning Techniques Demystified

So you came here—let me guess—it’s either you’re in a data science competition and you read somewhere about how winners of most competitions win with ensembles, or you’re just a curious data scientist who wants to learn about ensembles.

Either way, understanding how ensembles work is a very important knowledge and as data scientists and machine learning engineers, you should be able to employ the skills behind them.

Research has shown that a majority of the time, ensembles will outperform a single model, and it’s the recommended technique for maximizing accuracy or reducing errors in a machine learning model.

If you’ve done a couple of data science projects, then you have probably used one type of ensemble or another. Popular algorithms like Random-Forest, AdaBoost, XGBoost or CatBoost are different implementations of ensembles.

In this article, we’ll walk through the basic concept of ensembles, and you’ll learn just enough to construct good ones. So let’s begin.

What you’ll learn:

Introduction to ensembles
A peek at the sample data
Simple ensemble techniques

a. Averaging
b. Weighted Average
c. Max Voting

4. Advanced ensemble techniques

a. Bagging
b. Boosting
c. Stacking

Introduction to ensembles

Suppose you want to save money to buy a new laptop, but you don’t know how much it sells for, and as such, you can’t set a savings target. Of course, we assume you’re out of power and access to the internet because of an alien invasion the previous day. Obviously.

One reasonable thing to do in order to know the price is to ask someone, presumably a friend—we don’t want you stopping people on their way to work asking about laptop prices!

You find a friend and ask; he thinks for a moment and mumbles a number around $900. Now you know your friend may love tech gadgets, but he certainly doesn’t know the actual price of that laptop.

You also know he may not be able to give you the actual price, but all you really care about is an estimate close to the true price (your friend is a single model).

Next, you find five of your colleagues from work. Luckily, they’re arguing about Mac vs PC for software development. You jump in and pose the question—the price of your dream laptop. Well, as you guessed, they all had strong opinions.

Person 1 said $1000

Person 2 said $950

Person 3 said $800

Person 4 said $1100

Person 5 said $900

You look at the following prices from the five different people and notice that they’re all within a certain range ($800–1100). You decided to take the average, which is $950.

Well, I’m happy to inform you that you just created your first ensemble. A combination of five predictions and taking the average. Surely you would trust this average better than your one friend right? I would…well, unless your friend is the dealer.

That is the basic intuition behind ensembles. However, what we illustrated above is just a simple type of ensemble called averaging for regression problems. As we proceed, we’ll see other techniques.

Introduction to the sample data

For this tutorial, we’re going to use German bank credit data. Let’s take a peek at the data:

import pandas as pd
import numpy as np
from statistics import mode

german_cred = pd.read_csv('credit_preped.csv')
german_cred.head()

Since we’ll be explaining ensembles for regression as well as classification tasks, we’ll often rephrase the problem using the same data. For classification ensembles, we’ll use the feature bad_credit as the target and for regression task, we’ll use age_yrs.

First, we’ll import some modules and then drop the customer_id column

german_cred.drop('customer_id', axis=1, inplace=True)

#Metric calculations
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_absolute_error, accuracy_score


from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

#set seed
rand_seed = 234
np.random.seed = rand_seed

Before we start creating our ensembles, let’s train single models and get their performance. For classification we use three simple models: support vector classifiers (SVC), logistic regression, and a k-nearest neighbor classifier. And for regression task, we’ll use linear regression, support vector regressor (SVR), and k-nearest neighbor regressor.

We append these models to a list and loop over each as we train and cross validate.

#Import single models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor,KNeighborsClassifier

#Classification models
log_cf = LogisticRegression(solver='lbfgs', random_state=rand_seed)
svc_cf = SVC(gamma='scale', random_state=rand_seed)
knn_cf = KNeighborsClassifier()

classification_models = [log_cf, svc_cf, knn_cf]

#Regression models
linear_reg = LinearRegression()
svr_reg = SVR(gamma='scale')
knn_reg = KNeighborsRegressor()

regression_models = [linear_reg, svr_reg, knn_reg]

Next, we define some functions: The first is used to standardize the dataset.

The second, to split data into train and validation set.

The third and fourth measure performance for regression and classification respectively.

And the final function is used to train and validate our models.

#Define a function to standardize the data set
def standardize_data(df):
    scaler = RobustScaler()
    data = scaler.fit_transform(df)
    return data

#Create a function to split our data into train and validation set for both task

def get_split_data(features, target_name=None):
    ## Get the target column
    target = features[target_name]
    ## Drop the target from the data
    temp_data = features.drop(target_name, axis=1)
    temp_data = standardize_data(temp_data)
    
    #split data
    X_train, X_val, y_train, y_val = train_test_split(temp_data, target, test_size=0.1)
    return (X_train, X_val, y_train, y_val)
    
def get_mae(pred, true_value):
    return mean_absolute_error(true_value, pred)


def get_acc(pred, true_value):
    return accuracy_score(true_value, pred) * 100

# A Function to train and cross validate a model
def model_train(model, features=None, target_name=None, nfolds = 10, task = 'class'):
    ## Get the target column
    target = features[target_name]
    ## Drop the target from the data
    temp_data = features.drop(target_name, axis=1)
    temp_data = standardize_data(temp_data)
    
    if task == 'reg':
        score = -1 * (cross_val_score(model, temp_data, target, cv=nfolds, scoring='neg_mean_absolute_error'))
        print("Mean Absolute Error of {} is {}".format(model.__class__.__name__, round(score[0], 4)))
        print("-------------------------------------")

    else:
        score = cross_val_score(model, temp_data, target, cv=nfolds, scoring='accuracy')
        print("Accuracy of {} is {} %".format(model.__class__.__name__, round(score[0] * 100)))
        print("-------------------------------------")

Now that we have our functions, let’s test the base models. Remember, we’ll use the feature bad_credit for classification and the feature age_yrs for regression.

#Classification 
for model in classification_models:
    model_train(model, features=german_cred, target_name='bad_credit')
    
#Regression  
for model in regression_models:
    model_train(model, features=german_cred, target_name='age_yrs', task='reg')

Now that we have our base models, let’s learn about ensembles.

Simple Ensemble Techniques

Averaging

Averaging is the simplest and most intuitive type of ensemble technique for regression. Just like the name implies, it combines predictions from different models and takes the average/mean. For example, since we’re predicting age, if our three base models predicted 24, 23, and 26 respectively, we’d take the average as (24 + 23 + 26) / 3 which is approximately 24.3. This becomes our prediction. Let’s see this in code:

#get the data sets
X_train, X_val, y_train, y_val = get_split_data(german_cred, target_name='age_yrs')

#fit base models
linear_reg.fit(X_train, y_train)
knn_reg.fit(X_train, y_train)
svr_reg.fit(X_train, y_train)

#make predictions with trained models
pred1 = linear_reg.predict(X_val)
pred2 = knn_reg.predict(X_val)
pred3 = svr_reg.predict(X_val)

#Take average as final prediction
avgpred = (pred1 + pred2 + pred3) / 3

Now, do you think the average prediction does better than the single model? Well, let’s find out. We’ll calculate the mean absolute error of the individual models and compare with the average.

We’ll see that the average prediction gives us the lowest MAE, and as such does better:

print("Linear Regression Model")
print(get_mae(pred1, y_val))
print("KNN Regression Model")
print(get_mae(pred2, y_val))
print("SVR Regression Model")
print(get_mae(pred3, y_val))
print("Average Model")
print(get_mae(avgpred, y_val))

Weighted Average

Weighted average is a modification of averaging. The intuition behind this is that some of the base models we want to average may have higher predictive powers than others. As such, taking the average may not really capture this individual predictive power. In cases like this, we assign different weights to different models based on their predictive power.

Looking at the MAEs of our base regression models above, we see that the linear regression model does better than the others, so let’s assign a higher weight to it.

We demonstrate this below, by assigning 0.5 to the linear model and 0.25 to the other two. This can be interpreted as saying “take the linear model 50% more seriously than the other two.”

We’ll observe that the weighted average does better—even better (slightly) than the average ensemble.

#fit base models
linear_reg.fit(X_train, y_train)
knn_reg.fit(X_train, y_train)
svr_reg.fit(X_train, y_train)

#make predictions with trained models
pred1 = linear_reg.predict(X_val)
pred2 = knn_reg.predict(X_val)
pred3 = svr_reg.predict(X_val)

#Take average as final prediction
w_avgpred = (0.5 * pred1 + 0.25 * pred2 + 0.25* pred3)

print("Linear Regression Model")
print(get_mae(pred1, y_val))
print("KNN Regression Model")
print(get_mae(pred2, y_val))
print("SVR Regression Model")
print(get_mae(pred3, y_val))
print("Weighted Average Model")
print(get_mae(w_avgpred, y_val))

Max Voting

Max voting is similar to averaging, except it’s used for classification problems. In max voting, as the name implies, we train multiple models, make predictions, and then take the maximum/modal/most popular class as the predicted class.

Let’s return to our laptop example to understand this better: Suppose you’ve found the laptop you want, but want to make sure it’s right for you, so you seek the opinion of friends who know a thing or two about the laptop. You asked them to rate the efficiency of the laptop on a scale of 1 to 5. Each friend rated as follow:

Friend 1 = 3

Friend 2 = 5

Friend 3 = 4

Friend 4 = 3

Friend 5 = 3

Now, looking at the ratings, if we use max voting, we simply pick the rating that occurs most, which is 3.

We demonstrate this in code below.

#get the data sets
X_train, X_val, y_train, y_val = get_split_data(german_cred, target_name='bad_credit')

#fit single models
log_cf.fit(X_train, y_train)
knn_cf.fit(X_train, y_train)
svc_cf.fit(X_train, y_train)

#make predictions with trained models
pred1 = log_cf.predict(X_val)
pred2 = knn_cf.predict(X_val)
pred3 = svc_cf.predict(X_val)

#Take max voting as final prediction
maxpred = []

for i in range(0, len(X_val)):
    #calculate the mode and append to maxpred vector
    maxpred.append(mode([pred1[i], pred2[i], pred3[i]]))
    
    
print("Logistic Regression Model")
print(get_acc(pred1, y_val))
print("KNN Classifier Model")
print(get_acc(pred2, y_val))
print("SVR Classifier Model")
print(get_acc(pred3, y_val))

print("Max Voting Model")
print(get_acc(np.array(maxpred), y_val))

For ease, it’s worth mentioning that the sklearn library has an implementation of max voting (VotingClassifier) that you can use. An example using this module is shown below.

#Import the module
from sklearn.ensemble import VotingClassifier

#Pass the classifiers as a list of tuples with model names and the models themselves
max_model = VotingClassifier(estimators=[('logistic_reg', log_cf), ('KNN Classifier', knn_cf), ("SVC", svc_cf)], voting='hard')
max_model.fit(X_train, y_train)

print("Max Voting in sklearn")
print(get_acc(max_model.predict(X_val), y_val))

Advanced Ensemble Techniques

Bagging

The intuition behind bagging (Bootstrap Aggregating) is quite simple. It’s similar to averaging with the exception of one tiny change, which is made on the dataset used to train the model.

In averaging, we train multiple models on the same dataset and take the average, but in bagging we train multiple models on different sub-samples of the original dataset before taking the combined predictions.

One question you might be asking is if we train on sub-samples of a dataset, are we not still training on the same dataset? Will the result not be similar?

The answer is actually a gray no.

We are not training on exactly the same dataset—yes, the sub-samples may contain similar examples, but approximately ≈63.2% of the time we get unique examples in our dataset.

While it’s possible to write your own bagging algorithm, I’d advise using pre-built bagging algorithms, like RandomForest and ExtraTrees.

In the sklearn library, it’s possible to also create your own bagging classifier or regressor from a specified base model.

Let’s see this in code below. First, we import some bagging implementations in sklearn and the bagging meta-estimator; this allows us to choose our own base model.

#Bagging and Boosting models for both classification and regression problems
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor
#import xgboost as xgb


#bagging algorithms for regression
rand_forest_reg = RandomForestRegressor(n_estimators=100, random_state=rand_seed)
extra_tree_reg = ExtraTreesRegressor(n_estimators=100,random_state=rand_seed)
#We use support vector regressor as our base model for bagging
bagging_meta_reg = BaggingRegressor(svr_reg, n_estimators=100, random_state=rand_seed)

#bagging algorithms for classification
rand_forest_cf = RandomForestClassifier(n_estimators=100, random_state=rand_seed)
extra_tree_cf = ExtraTreesClassifier(n_estimators=100, random_state=rand_seed)
#We use svc as our base model for bagging
bagging_meta_cf = BaggingClassifier(svc_cf, n_estimators=10, random_state=rand_seed)

We get our train and test sets for regression and train the model:

#get data for regression task
X_train, X_val, y_train, y_val = get_split_data(german_cred, target_name='age_yrs')

#Train and fit these models
rand_forest_reg.fit(X_train, y_train)
extra_tree_reg.fit(X_train, y_train)
bagging_meta_reg.fit(X_train, y_train)

#check their performance
print("MAE of Random Forest is : ", get_mae(rand_forest_reg.predict(X_val), y_val))
print("MAE of Extra Trees is : ", get_mae(extra_tree_reg.predict(X_val), y_val))
print("MAE of Bagging estimator is : ", get_mae(bagging_meta_reg.predict(X_val), y_val))

We do the same for classification:

#get data for classification task
X_train, X_val, y_train, y_val = get_split_data(german_cred, target_name='bad_credit')

#Train and fit these models
rand_forest_cf.fit(X_train, y_train)
extra_tree_cf.fit(X_train, y_train)
bagging_meta_cf.fit(X_train, y_train)

#check their performance
print("ACC of Random Forest is : ", get_acc(rand_forest_cf.predict(X_val), y_val))
print("ACC of Extra Trees is : ", get_acc(extra_tree_cf.predict(X_val), y_val))
print("ACC of Bagging estimator is : ", get_acc(bagging_meta_cf.predict(X_val), y_val))

Boosting

Boosting is another popular and effective ensembling technique. In boosting, multiple models are trained sequentially. The goal is to train models that perform better than their predecessors. This means we have to take into account the areas where the previous models performed poorly and improve on those areas.

If we keep doing this—improving on the failures of the predecessors— theoretically, it’s possible that we’ll achieve a perfect model. But the world is never perfect, and as such, for all intents and purposes, this isn’t achievable in practice.

Boosting works really well and it’s implementations are definitely among the go-to algorithms when working on data science competition problems.

A few of the boosting algorithms include XGBoost, LigthGBM, AdaBoost, CatBoost etc.

Let’s see some implementation of these boosting algorithms

#Import boosting regressoion algorithms
# import xgboost.XGBRegressor as xgb_reg
# import lightgbm.LGBRegressor as lgb_reg
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor

#Import boosting regressoion algorithms
# import xgboost.XGBClassifier as xgb_cf
# import lightgbm.LGBClassifier as lgb_cf
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

#Regression
ada_reg = AdaBoostRegressor(base_estimator=svr_reg,n_estimators=100, random_state=rand_seed)
gb_reg = GradientBoostingRegressor(n_estimators=100, random_state=rand_seed)

#Classification
ada_cf = AdaBoostClassifier(base_estimator=log_cf, random_state=rand_seed)
gb_cf = GradientBoostingClassifier(random_state=rand_seed)

Now that we have initialized our boosting algorithms, we simply train and calculate performance on both regression and classification tasks.

#get data for regression task
X_train, X_val, y_train, y_val = get_split_data(german_cred, target_name='age_yrs')

#Train and fit these models
ada_reg.fit(X_train, y_train)
gb_reg.fit(X_train, y_train)

#check their performance
print("MAE of AdaBoost is : ", get_mae(ada_reg.predict(X_val), y_val))
print("MAE of Gradient Boosting is : ", get_mae(gb_reg.predict(X_val), y_val))


#get data for regression task
X_train, X_val, y_train, y_val = get_split_data(german_cred, target_name='bad_credit')

#Train and fit these models
ada_cf.fit(X_train, y_train)
gb_cf.fit(X_train, y_train)

#check their performance
print("ACC of AdaBoost is : ", get_acc(ada_cf.predict(X_val), y_val))
print("ACC of Gradient Boosting is : ", get_acc(gb_cf.predict(X_val), y_val))

Out-of-the-box without tuning hyperparameters, we can see that boosting performed better with classification task than both bagging and single models.

Stacking

Stacking is an advanced ensemble technique that’s also been proven to perform highly. Stacking is almost always behind the success of most data science competitions on Kaggle.

In stacking, first we train and make predictions on the original dataset using base algorithms, which are called first-level learners. The predictions from these base learners are then combined to make up new training data for another algorithm called a meta-learner. In other words, the output of the first-level learners serves as input for the meta-learner.

The first-level learners are often made up of different, simple, and diverse algorithms, although it’s possible to create stacked ensembles from the same learning algorithms.

The procedure for stacking is as follows:

Split the total training set into two disjointed sets (train and test )
Train several base models (train)
Test these base models (test)
Use the predictions from 3) as inputs and the correct responses (target) as outputs to train a higher-level learner called a meta-model.

The first three steps are done iteratively. If we take, for example, a 10-fold stacking ensemble, we first split the training data into 10 folds. Then we’ll do 10 iterations. In each iteration, we train every base model on 9 folds and predict on the remaining fold (“the holdout fold”).

This ensures that, after 10 iterations, the entire dataset is used to get test predictions, which we use as a new feature to train our meta-model in the step 4.

For prediction, we average the predictions of all base models on the test data and use them as meta-features on which the final prediction is made with the meta-model.

You don’t need to code a stacking ensemble yourself, as there already exist many efficient implementations of it, including ML Ensemble and H20.

In this article, we’ll write our own simple stacking ensemble just to demonstrate the idea:

from sklearn.model_selection import KFold

def stackingModel(base_models, meta_model, features, target, nfolds=10):
    #Split data into folds
    kfold = KFold(n_splits=nfolds, shuffle=True, random_state=rand_seed)
    #initialize arrays to hold predictions
    test_predictions = np.zeros((features.shape[0], len(base_models)))
    train_predictions = np.zeros((features.shape[0], len(base_models)))
    
    # Train base models
    for i, model in enumerate(base_models):
        for train_index, test_index in kfold.split(features, target):
            #Fit train data on the model
            model.fit(np.array(features)[train_index], np.array(target)[train_index])
            
            #Make prediction on the holdout data
            y_pred = model.predict(np.array(features)[test_index])
            #make predictions on train data
            t_pred = model.predict(np.array(features)[train_index])
            
            #Append the prediction to out of folds
            test_predictions[test_index, i] = y_pred
            #Append predictions to train predictions
            train_predictions[train_index, i] = t_pred


    # Now train the meta-model using the train predictions as new feature
    meta_model.fit(train_predictions, target)
    #Make fianl predictions on the average of out of fold predictions
    final_preds = meta_model.predict(np.mean([test_predictions], axis=0))
    
    return final_preds

Now that we have a simple stacking ensemble, let’s train and test on both a regression and classification task.

#get data for regression task
target = german_cred['age_yrs']
data = german_cred.drop('age_yrs', axis=1)
data = standardize_data(data)

#first level learners
base_learners = [linear_reg, svr_reg, knn_reg]
#meta learner
meta_ln = svr_reg

pred = stackingModel(base_learners, meta_ln, data, target)

#check performance
print("MAE of Stacking Model is : ", get_mae(pred ,target))


#get data for classification task
target = german_cred['bad_credit']
data = german_cred.drop('bad_credit', axis=1)
data = standardize_data(data)

#first level learners
base_learners = [log_cf, svc_cf, knn_cf]
#meta learner
meta_ln = svc_cf

pred = stackingModel(base_learners, meta_ln, data, target)

#check performance
print("ACC of Stacking Model is : ", get_acc(pred ,target))

In the stacking ensemble above, we created just 2 levels. level 1 for the base models, and level 2 for the meta-model.

You can create as many levels as you wish, but make sure the models are diverse, as stacking performs better on diverse sets of base learners.

Final Thoughts

Ensembles are tried and tested methods for greatly improving the performance of your machine learning models, and most times they end up being the difference between 1st and 2nd place in a data science competition.

In this article, we’ve covered some of the basic ideas behind ensembles. It’s worth mentioning here that we can combine ensembles together to create more complex ensembles. While this may help sometimes, most times the performance drops. Remember the “No Free Lunch Rule?” Well that happens.

I’m sure this article has given you a solid background on ensembles—don’t forget to clap and share. If you have any questions, suggestions, or feedback use the comment section below.

Link to full Notebook with explanations and codes on GitHub

Link to my paper “An Empirical Study of Ensemble Techniques (Bagging, Boosting and Stacking).