Machine Learning Model Evaluation & Selection

Introduction

A good data scientist is not one who knows all the fancy algorithms but one who knows that he/she is overfitting. We all have been through that time when our super awesome, fully tuned model has failed to live up to the expectations on Kaggle private LB or after deployment. Knowing how to get an unbiased estimate of the predictive power of our model is important. There are different validation strategies like holdout and cross validation which are commonly used in practice for this. But which strategy is appropriate in which scenario is something that needs more discussion and thought.

In this series, I’ll share my understanding on this topic, which is derived from my experiences as a masters student in data science and many amazing blogs on this subject. This series by Sebastian Raschka was a big inspiration for this blog.

In part 1, we’ll introduce different validation strategies, their pros & cons, and how to use them correctly for model evaluation. In part 2, we’ll talk about model selection and how to bring the idea of model selection and evaluation together.

Refer to this notebook for all the codes for this post.

Outline

Machine learning — 101
Objectives of validation
Holdout validation
Cross validation

Machine learning — 101 (Back to basics)

Let’s go over some fundamental definitions in machine learning that will be commonly used.

Features & Target

Target (Y) is what we’re trying to predict. Features (X) are factors we think will help us in predicting this target.

Model

This is the manifestation of our estimate of the true (f) relationship between the features and the target.

Learning algorithm

The functional space that we can explore to estimate the true relationship is infinite. The learning algorithm narrows this space. Below are some of the popular learning algorithms.

Loss/Objective function

This is a function of the difference between estimated (Ŷ) and the ground truth (Y). During training, we minimize the loss function in order to learn the functional relationship.

Parameter & Hyperparameters

Parameters of a model establish the rules of how X will translate into Ŷ. They are learnt during the model training (minimizing loss function). For ex: In regression, coefficients of features are the parameters.

Hyperparameters control how model will learn it’s parameters. They are fixed before the model training starts. For ex: k in k-nearest neighbours algorithm is a hyperparameter.

Bias & Variance

Bias and variance are general concept and are tied back to estimation of any non-deterministic parameter. Here parameter can refer to the parameters of f or true predictive power of our model. We’ll use these 2 terms in context of the latter.

High bias implies our estimate based on the observed data is not close to the true parameter. (aka underfitting).

High variance implies our estimates are sensitive to sampling. They’ll vary a lot if we compute them with a different sample of data (aka overfitting).

Now we’re ready to roll. Till now we’ve a vague idea of why we need validation. Let’s formalize this first.

Objectives of validation

We use validation strategies for 3 broad objectives:

Algorithm selection: Selecting the class of models which is best-suited for the data at hand (tree-based models vs neural networks vs linear models)
Hyperparameter tuning: Tuning the hyperparameters of a model to increase this predictive power
Measure of generalizability: Computing an unbiased estimate of the predictive power of the model

We refer to the 3rd objective as model evaluation. The first 2 objectives comes under the task of model selection. We call it model selection because for a specific class of model (for ex: Random Forest) and specific values of hyperparameters (For ex: max_depth = 5), we get a single model that we train and improve. Different combinations of learning algorithms and different hyperparameters give us different models and we have to select the best among them.

In this blog, I’ll be covering model evaluation and the next blog will cover model selection.

With this in mind, let’s understand the nuances of different validation strategies. For each validation strategy, I’ll talk about the following:

Implementation of the strategy
Considerations while using this strategy
Confidence intervals

Validation strategies can be broadly divided into 2 categories: Holdout validation and cross validation.

Holdout validation

Within holdout validation we have 2 choices: Single holdout and repeated holdout.

a) Single Holdout

Implementation

The basic idea is to split our data into a training set and a holdout test set. Train the model on the training set and then evaluate model performance on the test set. We take only a single holdout—hence the name. Let’s walk through the steps:

Step 1: Split the labelled data into 2 subsets (train and test).

Step 2: Choose a learning algorithm. (For ex: Random Forest). Fix values of hyperparameters. Train the model to learn the parameters.

Step 3: Predict on the test data using the trained model. Choose an appropriate metric for performance estimation (ex: accuracy for a classification task). Assess predictive performance by comparing predictions and ground truth.

Step 4: If the performance estimate computed in the previous step is satisfactory, combine the train and test subset to train the model on the full data with the same hyperparameters.

Considerations

Some things that need to be take into account while using this strategy:

Random splitting or not?

Whether to split the data randomly depends on the kind of data we have. If the observations are independent from each other, random splitting can be used. In cases where this assumption is violated, random splitting should be avoided. A typical case of this scenario is time series data, as observations are dependent on each other. For example: Today’s stock price will be dependent on yesterday’s stock price (most likely).

This also aligns with the first principle. More recent data will more likely be similar to what we can expect in production.

Stratified sampling

While splitting, we need to ensure that the distribution of features as well as target remains the same in the training and test sets.

For ex: Consider a problem where we’re trying to classify an observation as fraudulent or not. While splitting, if the majority of fraud cases went to the test set, the model won’t be able to learn the fraudulent patterns, as it doesn’t have access to many fraud cases in the training data. In such cases, stratified sampling should be done, as it maintains the proportion of different classes in the train and test set.

Choice of test size

Keeping aside a large amount of data for the test can result in an underestimation of predictive power (high bias**). But the estimate will be more stable (low variance**), as shown in the figure below. This consideration is more relevant for smaller datasets.

**Note: Here, bias and variance are w.r.t. the estimate of predictive power and not of the model itself.

More training data is generally better

With more training data, the model’s predictive power should improve. Therefore in step 4, we’re combining the train and test to build the final model.

No model selection

We should not do model selection and model evaluation on the same holdout. If we are trying multiple models on the same test set, we’re looking at the test set multiple times. Hence, the estimate of the true predictive power of the best model will be positively biased. More on model selection in part 2.

How confident are we in our estimates?

From the above steps, we’ll get a point estimate of the true predictive power of our model. But this single number doesn’t mean anything unless we know how confident we are in this estimate.

Defining the confidence interval around this point estimate would tell us how much this estimate can vary for a different set of model inputs. Let’s discuss a way of estimating this interval.

Normal approximation interval

Suppose we’re choosing accuracy as the proxy for predictive power of the model.

Let’s look at the calculation for the confidence interval (CI) in this case:

I used a Random Forest classifier on Fashion MNIST data (10000 images). Below is the test accuracy (single holdout) and associated 95% CI using a normal approximation interval for different test sizes.

This method not only provides a way to compute the confidence interval, but it also helps us choose an appropriate test size.