In machine learning, overfitting is having your machine learning model study the data so well that the model somehow memorizes the data and loses the ability to predict correctly given an unseen situation.
The irreducible error
I think most of us who are starting in machine learning fall in the trap of overfitting our models. Having exceptionally good accuracy on training data may be a really bad thing because the main goal of solving machine learning problems is to generate a generalized solution that can be used on unseen data.
Imagine this graph represents a classification problem where, for example, the red dots represent cats and the blue dots represent dogs. We need to classify a dot to be a cat or dog.
The black curve in this graph represents the ideal model. Applying this model on the training data will have an error ratio, which is due to the noise (misplaced dots) in the training data—this error is called the irreducible error.
Hypothetically, if we used the ultimate model that captures the features perfectly, we’ll still have a significant error in the training data due to the noise in it. This irreducible error is the kind of error we want to have when pushing our model to production.
Try to reduce this error, and the test error will go up.
Variance measures how jumpy our estimator is, for example, the green curve in the previous graph is considered a model/estimator with high variance as it is very jumpy to fit the training data.
In the training process, the model tries to improve and accurately capture the regularities in the training data. By overdoing this process, the model ends up learning extra noise, which will make your model flexible enough to overfit on the training data. We can see an example of this in our classification example above—the green curve shows this in action, that will un-generalize your model.
High variance (which is mainly overfitting) can cause an algorithm to model the random noise in the training data, rather than the intended (correct) outputs.
When you’re trying to lower the complexity of your model to generalize it to unseen data, you can miss out on some important regularities in the data. For example, using a straight line to classify between cats and dogs in the previous graph.
Models with low bias are usually more complex, enabling them to represent the training set more accurately. And models with higher bias tend to be relatively simple.
High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
The first graph (on the left) draws the data points around the black curve (with some noise) and includes 3 different models with different flexibilities.
- The Linear model is the least flexible. It has a high bias value and low variance value. As you can see in the middle graph, the linear model’s MSE is high in both the training and the testing data. This is due to underfitting.
- The Spline model is the most flexible. It has a low bias value and a high variance value. As you can see in the middle graph, its MSE is high in the testing data and low in the training data. This is due to overfitting.
- The Quadratic model is balanced in terms of bias and variance. It has both a low bias value and low variance value, but not the minimum values. It’s not possible to gain both minimum values, due to the unfortunate but inevitable existence of errors.
Here’s the primary takeaway:
We can compute the variance and bias to reach a good balance between them.
For more information about the mathematics check out this article.
Mean Square Error, MSE(x)=Var(x)+Bias(x)²
Expected Prediction Error, EPE(x)=Bias(x)²+Var(x)+σ2