Implementing ordinary least squares (OLS) using Statsmodels in Python

Are you looking for a computationally cheap, easy-to-explain linear estimator that’s based on simple mathematics? Look no further than OLS!

OLS stands for ordinary least squares. OLS is heavily used in econometrics—a branch of economics where statistical methods are used to find the insights in economic data.

As we know, the simplest linear regression algorithm assumes that the relationship between an independent variable (x) and dependent variable (y) is of the following form: y = mx + c, which is the equation of a line.

In line with that, OLS is an estimator in which the values of m and c (from the above equation) are chosen in such a way as to minimize the sum of the squares of the differences between the observed dependent variable and predicted dependent variable. That’s why it’s named ordinary least squares.

Also, it should be noted that when the sum of the squares of the differences is minimum, the loss is also minimum—hence the prediction is better.

Please find below the video on Multiple Linear Regression in Python and sklearn

Advantages of OLS

OLS is easier to implement compared to other similar econometric techniques. This is because the theory of least squares is easier to understand for a developer than other common approaches.
OLS has a simple mathematical concept so it is easier to explain to non-technologists or stakeholders at high level.

Assumptions of OLS

There should be no multicollinearity between any two independent variables.
The value of the mean of the error terms should be zero for given independent variables.
The sample taken for the OLS regression model should be taken randomly from the population.
All the error terms in the regression should have the same variance, which means homoscedasticity.

OLS using Statsmodels

Statsmodels is part of the scientific Python library that’s inclined towards data analysis, data science, and statistics. It’s built on top of the numeric library NumPy and the scientific library SciPy.

The Statsmodels package provides different classes for linear regression, including OLS. However, linear regression is very simple and interpretative using the OLS module. We can perform regression using the sm.OLS class, where sm is alias for Statsmodels.

OLS method

The sm.OLS method takes two array-like objects a and b as input. a is generally a Pandas dataframe or a NumPy array. The shape of a is o*c, where o is the number of observations and c is the number of columns. b is generally a Pandas series of length o or a one dimensional NumPy array.

In the below code, OLS is implemented using the Statsmodels package:

OLS regression results

R-squared is also called the coefficient of determination. It’s a statistical measure of how well the regression line fits the data.
Adjusted R–squared actually adjusts the statistics based on the number of independent variables present.
The ratio of deviation of the estimated value of a parameter from its hypothesized value to its standard error is called t–statistic.
F-statistic is calculated as the ratio of mean squared error of the model and mean squared error of residuals.
AIC stands for Akaike Information Criterion, which estimates the relative quality of statistical models for a given dataset.
BIC stands for Bayesian Information Criterion, which is used as a criterion for model selection among a finite set of models. BIC is like AIC, however it adds a higher penalty for models with more parameters.

Conclusion

Here we worked through a quick overview of OLS using Statsmodels and its implementation in a Jupyter Notebook with sample data. I hope you liked it and will give OLS a try for your regression problems.

You can find the code and the data here.

Happy Machine Learning 🙂