Classification Model Evaluation

What is Model Evaluation?

Model evaluation is the process of choosing between models, different model types, tuning parameters, and features. Better evaluation processes lead to better, more accurate models in your applications.

In this article, we’ll be discussing model evaluation for supervised classification models. We’ll cover evaluation procedures, evaluation metrics, and where to apply them.


  • Python 3.+
  • Anaconda (Scikit Learn, Numpy, Pandas, Matplotlib, Seaborn)
  • Jupyter Notebook.
  • Basic understanding of supervised machine learning methods — specifically classification.


In my previous article series, I talked about how machine learning workflow can be performed for a classification task. We’ll take the same example and discuss in detail how the model evaluation can be applied to that.

You can find my previous articles below:

  1. Machine Learning Workflow on Diabetes Data : Part 01
  2. Machine Learning Workflow on Diabetes Data : Part 02

In this section we’ll recap the model selection process.

The complete workflow is explained in detail in the above posts.

First we’ll import the necessary libraries and then read the dataset using the read_csv function of pandas.

Next we go through a quick data cleaning process to remove unusual data rows in the dataset. The complete data cleaning process is described in the part 1 of the above given series.

Next we select the features that best represent the model. This step is explained in detail in part 2 of the series. We’ve selected the following features: ‘Pregnancies’, ‘Glucose’, ‘BMI’, ‘DiabetesPedigreeFunction’.

Finally, in part 2 of the series, after the hyper-parameter tuning phase, we’ve selected the logistic regression model with the given hyper-parameters.

Model Evaluation Procedures

Generally, we avoid training and testing a model on the same data because it could lead to overfitting. Models that overfit training data tend to perform poorly when given out-of-sample-data. To avoid this, we can take the following precautions.

  1. Train/Test Split
  2. K-Fold Cross Validation

Train/Test Split

This method splits the data set into two portions : a training set and a testing set. The training set is used to train the model. We can also measure the model’s accuracy on the training set, but we shouldn’t evaluate models based on this metric alone.

The testing set is only used to test the model and evaluate the accuracy after training. Data samples in the test set are never shown to the model during training. Accuracy on the test set provides a better indication of how models will perform on new data.

The scikit-learn library provides us with a method to divide the the data into train and test sets called train_test_split from the model_selection module. First we’ll split the data into train and test sets. Then we’ll use the train set to train the logistic regression model. Then we’ll predict with the test set.

Finally, we calculate the performance of the model using the evaluation metric Classification Accuracy (which we’ll discuss in detail in an upcoming section). We get an accuracy score of 0.795 or 79.5%.

K-Fold Cross Validation

This method splits the data set into K equal partitions (“folds”), then uses 1 fold as the testing set and the union of the other folds as the training set.

The process will follow the above steps K times, using different folds as the testing set each time. The average testing accuracy of the process is the testing accuracy.

For cross validation, Scikit Learn provides the method cross_val_score, which is also from the model_selection module. We pass the logistic regression model with the features X and responses y as parameters. And the the method will perform a 10-fold cross validation, using classification accuracy as the scoring method. We get a mean accuracy of 78%.

Model Evaluation Metrics

A module evaluation metric is a criterium by which the performance or the accuracy of a model is measured.

In the upcoming sections we will discuss evaluation metrics in detail.

Classification Accuracy

Classification accuracy is by far the most common model evaluation metric used for classification problems. Classification accuracy is the percentage of correct predictions.

Even though classification is a good metric, when class distribution is imbalanced, it can give a false sense of high accuracy.

Scikit-learn provides a separate method to evaluate the accuracy, which is accuracy_score in the metrics module. Also the accuracy estimator is built in as a parameter in cross_val_score. The scoring parameter is what decides the classification accuracy.

We already looked into classification accuracy using Scikit Learn in the Model Evaluation Procedures section.

Confusion Matrix

A confusion matrix can be defined loosely as a table that describes the performance of a classification model on a set of test data for which the true values are known. A confusion matrix is highly interpretative and can be used to estimate a number of other metrics.

Scikit-learn provides a method to perform the confusion matrix on the testing data set. The confusion_matrix method requires the actual response class values and the predicted values to determine the matrix.

Since our problem has only two response classes, it can be categorized as a binary classification problem. Therefore the confusion matrix is a 2 X 2 grid. The confusion matrix is interpreted differently in different implementations. Scikit-learn’s confusion matrix class document is found here.

The above matrix is not clear enough for us to predict anything. Therefore we’ll plot the confusion matrix using a sample method found in Scikit-learn examples, which can be found here.

The basic terminology related to the confusion matrix is as follows. We’ll interpret with regards to our problem.

  • True Positives (TP) : Correct prediction as Diabetic
  • True Negatives (TN) : Correct prediction as Non-diabetic
  • False Positives (FP) : Incorrect prediction as Diabetic (‘Type I error’)
  • False Negatives (FN) : Incorrect prediction as Non-diabetic (‘Type II error’)

Metrics computed from the confusion matrix

First we’ll parse the obtained confusion matrix into True Positives(TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

We can calculate the following metrics from the confusion matrix.

Classification accuracy

Classification accuracy is the ratio of correct predictions to the total no. of predictions. Or more simply, how often is the classifier correct.

We can calculate the accuracy using the confusion matrix. Following is the equation to calculate the accuracy using the confusion matrix:

Accuracy can also be calculated using the method accuracy_score. We can observe that the accuracy is 0.795.


Sensitivity or recall is the ratio of correct positive predictions to the total no. of positive predictions. Or more simply, how sensitive the classifier is for detecting positive instances. This is also called the True Positive Rate.

Using the confusion matrix recall can be calculated as follows:

Also, Scikit-learn provides a method called recall_score to find the recall score. We can observe that the classifier has a recall score of 0.58.


Specificity is the ratio of correct negative predictions to the total no. of negative predictions. This determines how specific the classifier is in predicting positive instances.

We can calculate specificity using the confusion matrix as follows.

False Positive Rate

The false positive rate is the ratio of negative predictions that were determined to be positive to the total number of negative predictions. Or, when the actual value is negative, how often is the prediction incorrect.

This can be calculated using the confusion matrix as follows:


Precision is the ratio of correct predictions to the total no. of predicted correct predictions. This measures how precise the classifier is when predicting positive instances.

This can be calculated from the confusion matrix as follows:

Scikit -learn provides the method precision_score to calculate precision. We can observe that the precision is 0.76.

Confusion matrix advantages:

  • Variety of metrics can be derived.
  • Useful for multi-class problems as well.

NOTE : Choosing which metric to use depends on the business objective or the nature of the problem.

Adjusting Classification Threshold

It’s possible to adjust the logistic regression model’s classification threshold to increase the model’s sensitivity.

After training, the model exposes an attribute called predict_proba, which returns the probability of the test data being in a particular response class. From this, we’ll get the probabilities of predicting a diabetic result.

Next we’ll plot the probability of becoming diabetic in a histogram.

Since it’s a binary classification problem, the classification probability threshold is 0.5, which means if the probability is less than 0.5, it’s classified as “0 (non-diabetic)”. If the probability is more than 0.5, it’s classified as “1 (diabetic)”.

We can use the Scikit-learn’s binarize method to set the threshold to 0.3, which will classify as ‘0 (non-diabetic)’ if the probability is less than 0.3, and if it’s greater it will be classified as ‘1 (diabetic)’.

Next we’ll print the confusion matrix for the new threshold predictions, and compare with the original.

Next we’ll calculate sensitivity and specificity to observe the changes from the previous confusion matrix calculations.

Previously the sensitivity calculated was 0.58. We can observe that the sensitivity has increased, which means it’s more sensitive to predict “positive (diabetic)” instances.

Using the same process, we can calculate the specificity for the new confusion matrix. Previously it was 0.90. We observe that it has decreased.

We adjust the threshold of a classifier in order to suit the problem we’re trying to solve.

ROC curve

An ROC curve is a commonly used way to visualize the performance of a binary classifier, meaning a classifier with two possible output classes. The curve plots the True Positive Rate (Recall) against the False Positive Rate (also interpreted as 1-Specificity).

Scikit-learn provides a method called roc_curve to find the false positive and true positive rates across various thresholds, which we can use to draw the ROC curve. We can plot the curve as follows.

We’re unable to find the threshold used to generate the ROC curve on the curve itself. But we can use the following method to find the specificity and sensitivity across various thresholds.

The following is an example to show how the sensitivity and specificity behave with several thresholds.

ROC curve is a reliable indicator in measuring the performance of a classifier. It can also be extended to classification problems with three or more classes using the “one versus all” approach.

AUC (Area Under the Curve)

AUC or Area Under the Curve is the percentage of the ROC plot that is underneath the curve. AUC is useful as a single number summary of classifier performance.

In Scikit-learn, we can find the AUC score using the method roc_auc_score.

Also, the cross_val_score method, which is used to perform the K-fold cross validation method, comes with the option to pass roc_auc as the scoring parameter. Therefore, we can measure the AUC score using the cross validation procedure as well.

ROC/AUC advantages:

  • Setting a classification threshold is not required.
  • Useful even when there is a high class imbalance.


In this article, we explored the evaluation of classification models. We discussed the need for an evaluation of a model, and main model evaluation procedures that are used such as “train/test split” and “k-fold cross validation”.

Next we talked about model evaluation metrics in detail along with code samples using Scikit-learn. We discussed, in detail: “classification accuracy”, “confusion matrix”, “roc curve” and “area under the curve”.

Now you should be able to confidently evaluate a classification model and choose the best performing model for a given dataset using the knowledge gained from this article.

Source code that created this post can be found below.

If you have any problems or questions regarding this article, please don’t hesitate to leave a comment below or drop me an email: [email protected]

Hope you enjoyed the article. Cheers!

Discuss this post on Hacker News.

Avatar photo


Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *