Support Vector Regression in Python Using Scikit-Learn

Introduction

Support vector regression (SVR) is a statistical method that examines the linear relationship between two continuous variables.

In regression problems, we generally try to find a line that best fits the data provided. The equation of the line in its simplest form is described as below y=mx +c

In the case of regression using a support vector machine, we do something similar but with a slight change. Here we define a small error value e (error = prediction – actual).

The value of e determines the width of the error tube (also called insensitive tube). The value of e determines the number of support vectors, and a smaller e value indicates a lower tolerance for error.

Thus, we try to find the line’s best fit in such a way that:

(mx+c)-y ≤ e and y-(mx+c) ≤ e

Also, we do not care about errors as long, as they are less than e. So in this case, only those data points that are outside the e error region will be contributing to the final cost calculation.

For example, if we’re dealing with stock trading, and we want to minimize the trading loss, but we do not care about loss as long as they are less than a certain value (e).

Hence, the support vector regression model depends only on a subset of the training data points, as the cost function of the model ignores any training data close to the model prediction when the error is less than e.

In the realm of machine learning, a support vector regression algorithm can, in some cases, be more suitable for regression problems than other common and popular algorithms. Below are the cases where a support vector regression is advantageous over other regression algorithms:

SVM is memory efficient, which means it takes a relatively lower amount of calculation resources to train the model. This is because presenting the solution by means of a small subset of training points gives enormous computational advantages.
There are non-linear or complex relationships between features and labels. This is because we have the option to convert non-linear relationships to higher-dimensional problems in the case of support vector regression.

Before we jump into an implementation of a support vector machine for a regression problem, let’s define some key terms.

Key Terms

Hyperplane

A hyperplane is a subspace that has one less dimension than that of its surrounding space. For example, with a two-dimensional space, the hyperplane will be one dimensional (i.e. a line). For a three-dimensional space, the hyperplane will be two-dimensional (i.e. a plane).

Kernel

A kernel is a set of mathematical functions. This kernel function takes data as input and transforms it into the form required by the output. A few examples of kernels used in SVM are linear and radial basis function (RBF) kernels.

Parametric

Parametric models have a finite number of parameters; however, non-parametric models have an infinite number of parameters. So for a parametric model, the complexity of the model doesn’t grow with the number of training data points.

However, in the case of non-parametric models, the complexity of the model grows with the number of training data points. Linear regression, logistic regression, and linear SVMs are parametric models; however decision trees (the depth of the tree changes as the training data changes), random forests, and SVMs with radial basis function kernels are non-parametric.

Implementation

Let’s start our implementation using Python and a Jupyter Notebook.

Once the Jupyter Notebook is up and running, the first thing we should do is import the necessary libraries.

We need to import:

NumPy
Pandas
SVR
train_test_split
r2_score
mean squared error
Seaborn

To actually implement the support vector regression model, we’re going to use scikit-learn, and we’ll import our SVR from sklearn.svm

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import r2_score,mean_squared_error
import seaborn as sns
import matplotlib.pyplot as plt

Load the Data

As the libraries are imported now, our next step is to load the data, stored at the GitHub repo linked here. You can download the data and keep it in your local folder. After that, we can use the read_csv method of Pandas to load the data into a Pandas data frame df, as shown below.

df = pd.read_csv(‘Support-Vector-Regression-Data.csv’)

Also, in the snapshot of the data below, notice that the data frame has two columns, x and y. Here, x is the feature and y is the label. We’re going to predict y using x as an independent variable.

Data pre-processing

Before feeding the data to the support vector regression model, we need to do some pre-processing.

Here, we’ll create the x and y variables by taking them from the dataset and using the train_test_split function of scikit-learn to split the data into training and test sets.

We also need to reshape the values using the reshape method so that we can pass the data to train_test_split in the format required.

Note that the test size of 0.4 indicates we’ve used 40% of the data for testing. random_state ensures reproducibility. For the output of train_test_split, we get x_train, x_test, y_train, and y_test values.

Train the model

We’re going to use x_train and y_train, obtained above, to train our support vector regression model. We’re using the fit method and passing the parameters as shown below.

Note that the output of this cell is describing a large number of parameters like cache size, degree, epsilon etc. for the model. All these parameters are configurable, and you’re free to tune them to match your requirements.

Prediction

Once the model is trained, it’s ready to make predictions. We can use the predict method on the model and pass x_test as a parameter to get the output as y_pred.

Notice that the prediction output is an array of real numbers corresponding to the input array.

Model Evaluation

Finally, we need to check to see how well our model is performing on the test data. For this, we evaluate our model by finding the root mean squared error produced by the model.

Mean squared error is a built-in function, and we’re using NumPy’s square root function (np.sqrt) on top of it to find the root mean squared error value.

End notes

In this article, we discussed how to implement linear regression using a support vector machine algorithm. We also looked at how to pre-process and split the data into features as variable x and labels as variable y.

After that, we trained our model and then used it to run predictions. You can find the data used here.