# A Guide to Ensemble Learning

## Ensemble learning: Need, benefits, and detailed coverage of bagging, boosting, and stacking

In this post, we’ll have explore the fundamentals of ensemble learning. Specifically, we’ll discuss techniques like bagging, boosting, and stacking in detail and take a brief look at various foundational algorithms that adopt these techniques.

Without further ado, let’s get started!

## What is Ensemble Learning?

This article presents techniques for improving classification accuracy by aggregating various classifiers in different ways. These techniques are known as the ensemble methods.

An ensemble method involves constructing a set or a group of base classifiers or individual learners (that may or may not follow the same algorithm) from training data. The final classification that we predict is, hence, the aggregation of all these classifiers’ outputs.

Error in any machine learning model can be described as

Ensemble techniques are designed to help reduce these errors, and thus enhance the performance of the model.

### Necessary Conditions for Better Performance

There are two necessary conditions for an ensemble classifier to perform better than a single classifier:

1. The base classiﬁers should be independent of each other. If there’s a dependency, then these dependent classifiers will introduce a bias towards their favored class.
2. The base classiﬁers should individually perform better than random guessing. Any model that provides an accuracy of less than 50% is considered worse than random guessing. Thus, accumulating such classifiers will result in even worse performance.

## Methods for Constructing an Ensemble Classifier

1. By manipulating the training set: Various training sets are constructed by sampling data from the original dataset. Example: bagging and boosting.
2. By manipulating the input features: In this approach, a subset of input features is chosen to form each training set. Example: random forest.
3. By manipulating the learning algorithm: In this approach, several models are applied to the same training dataset. For example, an artiﬁcial neural network can produce diﬀerent models by changing its network topology or initial weights. Similarly, an ensemble of decision trees can be constructed by taking different nodes as root or splitting criteria.

## Basic Ensemble Techniques

In this section, we’re going to dive into the details of the three most commonly used ensemble techniques: Bagging, boosting, and stacking.

## Bagging

To help us understand the concept of bagging, let’s use an analogy. Suppose that you’re a patient seeking a diagnosis for a medical condition or illness, given that you’ve recently noticed some symptoms. Instead of asking one doctor, you may choose to take the advice of several. If a certain result occurs more than any of the others, you may choose this as the ﬁnal or best diagnosis.

Thus, it can be inferred that the final outcome will be a combined majority vote, where each doctor gets an equal vote. The intuition for bagging is similar.

Bagging, which is also known as bootstrap aggregating, is a technique that repeatedly samples data (with replacement) from a dataset.

Every individual dataset will have the same size as the original dataset, but the data points are chosen either randomly or following a sampling distribution. Because the sampling is done with replacement, a few data points may appear several times in the same training set, while others may never be a part of that training set.

Think of sampling with replacement as taking out balls of different colors like blue and red from a bag. Let’s assume there are 10 blue and 10 red balls. Initially, the probability of picking a blue or red ball is equal—that is, 10/20= 0.5.

If we take out a blue ball once, we put another blue ball in the bag so that total number of both blue and red balls will be same always, and thus, the probability, too. So one dataset might have 15 blue and 5 red, another might have 13 red and 7 blue, and so on.

Each classifier’s output is aggregated, and the class with the majority of votes is selected as the output class.

### Benefits of Bagging

1. The effect of outliers is minimized by applying bagging. Since every data point has an equal probability of being selected, bagging does not focus on any particular output class of the training data. It’s therefore less susceptible to overﬁtting in cases involving noisy data.
2. Bagging improves generalization error by reducing the variance of the base classiﬁers. The performance of bagging depends on the stability of the base classiﬁer. If a base classiﬁer is unstable, bagging helps to reduce the errors associated with random ﬂuctuations in the training data. For example, choosing a different splitting criteria in decision tree induction can result in entirely different trees.

### Implementation in Scikit-learn

Implementing bagging is fairly simple using the sklearn library. We just need to import the bagging aggregator and the base classifier that we prefer:

## Boosting

Let’s understand the concept of boosting by tweaking the previous analogy a little. Suppose again that you’re a patient seeking a diagnosis for new symptoms. Instead of asking one doctor, you may choose to take the advice of several. Let’s say you weigh the advice of a few doctors more than the others, due to their expertise and experience.

The ﬁnal output will then be a combination of the weighted diagnoses. This is the essence behind boosting. Unlike bagging, every doctor doesn’t get an equal vote. They are all assigned weights.

Boosting is an iterative procedure that changes the training set according to the data points that are falsely classified. It tweaks the weights of the data points in such a way that the base classiﬁers will focus on examples that are hard to classify.

In the above image, as you can see, the weights and parameters are updated in such a manner that the falsely predicted data points are given more importance or weight.

Given above is the algorithm for AdaBoosting. We’re going to walk through it step-by-step. Let’s say we have a dataset whose data points are represented as{(xj, yj) | j =1 ,2,…,N}, where x is the set of features and y is the class label. In the AdaBoost algorithm, every classifier is given importance according to it’s error rate, given by:

where I(p) = 1 if the predicate p is true, and 0 otherwise. The importance of a classiﬁer Ci is given by the following parameter:

The αi parameter is used to update the weights of the data points based on how well they were classified. The weight update mechanism for AdaBoost is given by the equation:

Thus, wi reduces for a correctly classified point and increases for a falsely predicted data point. In addition, if any intermediate rounds produce an error rate higher than 50%, the weights are reverted back to their original uniform values, wi =1/N.

### Implementation in Scikit-learn

We can implement the AdaBoost algorithm in sklearn by using these few lines of code:

## Stacking

With stacking, a learner is used to combine the output from all the base classifiers. All of these are combined via a meta classifier or regressor. In stacking, a single model is used to learn how to best combine the predictions from the contributing models. These models need not be similar (like random forest), and each of them are trained on the complete dataset.

Thus, two things need to be pre-defined while making our stacking model:

1. The k base classifiers that we fit our training data on.
2. The meta classifier to combine the output from these k base classifiers.

For example, if we take 3 models, namely, k-nearest neighbors, an SVM, and a naive Bayes classifier, and we select a neural network as our meta classifier—the neural network will then take the outputs of these 3 models as input and will return final predictions based on their combination.

### Implementation in Scikit-learn

Implementing stacking in sklearn is quite easy. You just have to enlist the models you want to accumulate together, along with their parameters, and then a meta-estimator to combine them all:

## Algorithms Based on Ensemble Techniques

A few famous algorithms based on the ensemble techniques discussed above are listed here:

1. Random Forest: A random forest can be thought of as an ensemble of several decision trees, which are made using either a subset of the training data or a subset of the features. Then bagging is applied to aggregate and find the final output. Refer to this article to learn more.
3. XGBoost: XGBoost is based on a gradient boosting algorithm that works on an ensemble of decision trees. It’s tremendously successful for certain use cases and is very popular these days in the community. Refer to this article to learn more.

## Conclusion

In this post, we explored the whys and hows of ensemble learning. As we can see, ensemble techniques help in preventing overfitting and reducing the bias and variance of the resultant model, while increasing accuracy.

There are a lot of algorithms that employ ensemble learning, and we discussed a few. But there’s much more to learn if you’re so inclined. Hope this post was informative. All feedback is welcome.

Until next time! 🙂

## Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.