Implementing ARIMA using Statsmodels and Python

ARIMA stands for Auto Regressive Integrated Moving Average. It’s a statistical library used for analyzing and forecasting time series data. Models created using ARIMA library are more flexible than other statistical models like simple linear regression. Here’s a closer look at the terms that make up this acronym:

Auto regressive indicates a series that has a dependent relationship between an observation and some number of lagged observations. In mathematical terms, it means that any given value v(t) in the series can be explained by a function of its previous value v(t-1) and some random error, say e(t). In some cases, the series could be related to two or more past values as well.
Integrated means the use of differencing (described below) on the given series in order to make the time series stationary.
Moving average means that any given value v(t) in the series can be explained by a function of its one or more previous errors only, like e(t-1). In some cases, the series could be related to two or more past errors as well.
Statsmodels is part of the scientific Python stack oriented towards data analysis, data science, and statistics. It’s built on top of the numeric library NumPy and library SciPy. Statsmodels integrates with Pandas library very well, also.

Stationarity

If statistical properties of a particular time series doesn’t depend on the time at which the series is observed, the series or data is called stationary. Hence, the statistical properties like mean, variance, and co-variance of a stationary time series are all constant over time.

Seasonality

Seasonality in a time series is akin to a repetitive pattern of changes over a observed period of time. For example, ice cream sales increase every summer. Hence, ice cream sales data will exhibit significant seasonality. Time series with trends, or with seasonality, are non-stationary.

If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast. The Dickey-Fuller test is one of the ways to determine if the series is stationary or not.

Differencing

If the time series is non-stationary, it needs to be made so using a technique called differencing. In doing this, we’d take the first difference, then check for stationarity. Take as many differences as it takes.

Differencing is implemented by subtracting the series data in the current time from the previous one in the series. If doing this activity once eliminates the trend or non-stationarity of data, we say we’re using differencing of order 1. In this case, data increases or decreases at constant rate.

However, after implementing the differencing of order 1, if the data still has non-stationarity, then it means that data was changing at a non-constant rate. So we need to apply differencing again on this new data, and effectively, it becomes second order differencing.

Parameters in ARIMA

There are three parameters for ARIMA models, generally denoted by p, d, and q. What follows is a brief overview of these parameters.

Parameter p

p is the parameter associated with the auto-regressive aspect of the model. p represents the number of auto-regressive terms and allows us to incorporate the effect of past values into our model.

In layman’s terms, this parameter says it’s likely to rain tomorrow if it has been raining for the last 5 days.

Parameter d

The next parameter of ARIMA is d. d is the parameter associated with the integrated part of the model. d is the number of non-seasonal differences, or in other words, the number of non-seasonal differences needed for stationarity for the series.

In layman’s terms, this parameter says it’s likely to rain the same amount tomorrow if the difference in rain in the last 5 days has been small.

Parameter q

The next parameter of ARIMA is q. q is the parameter associated with the moving average part of the model. q is the number of moving-average terms or the size of the moving average window. In other words, q is the number of lagged forecast errors in the prediction equation.

In layman’s terms, this parameter sets the error of the model as a linear combination of the error values observed at previous time points in the series.

ARIMA using Python

Below is the code written in Python using a Jupyter Notebook for ARIMA implementation.

It should be noted that in the below code we’ve imported ARIMA from the statsmodels library and have used a parsealldate function to parse the date columns in the data.

Note that we’re using the following values: p =0 , d=1 and q =1. We’re printing the model result using the summary method. A value of 0 can be used for any of the three parameters, which indicates we aren’t using that element of the model.

Thus, ARIMA can be configured to act like an ARMA model if we make d=0, and similarly, we can configure it to work like a simple AR, I, or MA model.

You can find the data and code on GitHub here.

In this tutorial we learned how to implement an ARIMA model in Python using the statsmodels library. I encourage you to try different values of p, d and q and see for yourself how it affects the results. I hope you will enjoy the experiment.

Happy Machine Learning 🙂