Naive Bayes algorithms are a set of supervised machine learning algorithms based on the Bayes probability theorem, which we’ll discuss in this article. Naive Bayes algorithms assume that there’s no correlation between features in a dataset used to train the model.
In spite of this oversimplified assumption, naive Bayes classifiers work very well in many complex real-world problems. A big advantage of naive Bayes classifiers is that they only require a relatively small number of training data samples to perform classification efficiently, compared to other algorithms like logistic regression, decision trees, and support vector machines.
Before we dive into the implementation, let’s first cover some key terms related to naive Bayes.
The Bayes theorem describes the probability of a feature, based on prior knowledge of situations related to that feature. For example, if the probability of someone having diabetes is related to his or her age, then by using the Bayes theorem, the age can be used to more accurately predict the probability of diabetes.
The word naive implies that every pair of features in the dataset is independent of each other. All naive Bayes classifiers work on the assumption that the value of a particular feature is independent from the value of any other feature for a given the class.
For example, a fruit may be classified as an orange if it’s round, about 8 cm in diameter, and is orange in color. With a naive Bayes classifier, each of these three features (shape, size, and color) contributes independently to the probability that this fruit is an orange. Also, it’s assumed that there is no possible correlation between the shape, size, and color attributes.
Oftentimes, we deal with continuous features, and Gaussian means it’s assumed that the continuous values associated with each class are distributed according to a normal distribution. Suppose the training data contains a continuous feature A—let’s divide the feature data by the class and then find the mean and variance of x in each class—this should end up as a normal distribution.
Let’s start our implementation using Python and a Jupyter Notebook.
Once the Jupyter Notebook is up and running, the first thing we should do is import the necessary libraries.
We need to import:
To actually implement the naive Bayes classifier model, we’re going to use scikit-learn, and we’ll import our GaussianNB from sklearn.naive_bayes.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt import seaborn as sns
Load the Data
Once the libraries are imported, our next step is to load the data, stored in the GitHub repository linked here. You can download the data and keep it in your local folder. After that, we can use the read_csv method of Pandas to load the data into a Pandas data frame df, as shown below:
df = pd.read_csv(‘Naive-Bayes-Classifier-Data.csv’)
Also, in the snapshot of the data below, notice that the data frame has two columns, x and y. Here, x is the feature and y is the label. We’re going to predict y using x as an independent variable.
Before feeding the data to the naive Bayes classifier model, we need to do some pre-processing.
Here, we’ll create the x and y variables by taking them from the dataset and using the train_test_split function of scikit-learn to split the data into training and test sets.
Note that the test size of 0.25 indicates we’ve used 25% of the data for testing. random_state ensures reproducibility. For the output of train_test_split, we get x_train, x_test, y_train, and y_test values.
Train the model
We’re going to use x_train and y_train, obtained above, to train our naive Bayes classifier model. We’re using the fit method and passing the parameters as shown below.
Note that the output of this cell is describing a few parameters like priors and var_smoothing for the model. All these parameters are configurable, and you’re free to tune them to match your requirements.
Once the model is trained, it’s ready to make predictions. We can use the predict method on the model and pass x_test as a parameter to get the output as y_pred.
Notice that the prediction output is an array of real numbers corresponding to the input array.
Finally, we need to check to see how well our model is performing on the test data. For this, we evaluate our model by finding the accuracy score produced by the model.
In this article, we discussed how to implement a naive Bayes classifier algorithm. We also looked at how to pre-process and split the data into features as variable x and labels as variable y.
After that, we trained our model and then used it to run predictions. You can find the data used here.