Time Series Part 2: Forecasting with SARIMAX models: An Intro
This tutorial provide the basics about SARIMAX models in such a way that it helps you understand the working of the algorithm, which is useful if you want to study other forecasting algorithms as well.
- Python
- Time Series
- Forecasting
- SARIMAX
In our first tutorial we introduced some basics on time series. In this one we will learn about ARIMA models and their variants SARIMA and ARIMAX : statistical models used for forecasting.
The code of this tutorial can be found at 02-Forecasting_with_SARIMAX.ipynb on GitHub.
After completing this tutorial, you will know:
- What a ARIMA model is
- When to use ARIMA, SARIMA, or ARIMAX
- How to fit (S)ARIMA(X) model
- How to optimize (S)ARIMA(X) model
- How to make forecasts
- How to evaluate the obtained model
The other tutorials about time series, they are available at:
- Time Series Part 1: An Introduction to Time Series Analysis
- Time Series Part 3: Forecasting with Facebook Prophet: An Intro
(S)ARIMA(X) models
In this section you will learn about ARIMA models and their variants SARIMA and ARIMAX.
ARIMA model
ARIMA means Auto Regressive Integrated Moving Average. It is a combination of two models: AR (Auto Regressive) model which uses lagged values of the time series to forecast and MA (Moving Average) model that uses lagged values of residual errors to forecast. In other words, this model uses dependencies both between data values and errors values from the past to optimize the predictions.
ARIMA uses three parameters – ARIMA(p,d,q):
- Auto Regressive term p: Number of autoregressive lags.
- Order of differencing term d: Number of times differencing pre-processing step is applied to make the time series stationary.
- Moving Average term q: Number of moving average lags.
We cannot model a time series if it is not stationary. However, we can apply some transformation such as differencing to make a time series stationary, as seen in the previous tutorial. The parameter d allow us to apply differencing within ARIMA.
It is also possible to apply transformations before applying ARIMA in order to make time series stationary. However, when applying differencing in this way we are forecasting on the transformed data, therefore we need to reverse the transformation in order to access the forecast of the original values.
ARIMAX model
It is also possible to extend the ARIMA model to use exogenous inputs and create an ARIMAX model. In this model the time series is modeled using other independent variables as well as the time series itself.
For example, when modeling the waiting time in an emergency room. The number of nurses available at a certain shift could be considered an external variable since it may have impact on the waiting time. If this is indeed the case, we can affect the waiting times by changing the number of nurses .
SARIMA model
If there is seasonality visible in a time series dataset, a SARIMA (Seasonal ARIMA) model should be used. When applying an ARIMA model, we are ignoring seasonality and using only part of the information in the data. As a consequence, we are not making the best predictions possible.
SARIMA models include extra parameters related to the seasonal part. Indeed, we can see a SARIMA model as two ARIMA models combined: one dealing with non-seasonal part and another dealing with the seasonal part.
Therefore, a SARIMA(p,d,q)(P,D,Q,S) model have all the parameters described above (non-seasonal parameters) and P,D,Q,S that are the seasonal parameters, i.e.,
Non-seasonal orders
- p: Autoregressive order.
- d: Differencing order.
- q: Moving average order.
Seasonal orders
- P: Seasonal autoregressive order.
- D: Seasonal differencing order.
- Q: Seasonal moving average order
- S: Length of the seasonal cycle.
Now time for some hands-on examples to illustrate these models.
The Walmart Dataset
To put in practice what we’ve learnt so far about SARIMAX model we will use data from the Store Item Demand Forecasting Challenge Kaggle competition.
This data consists of 5 years of store-item sales data split in a training dataset (train.csv) and a test dataset (test.csv). The objective of this competition was to forecast 3 months of sales for 50 different items at 10 different stores using the 5 years history of sales.
Our goal here is to use part of this data to apply what we have learnt about time series and the forecast methods introduced during this tutorial.
Both train and test datasets have 10 stores and each store offers 50 unique items, and sales is our target.
The training dataset has data from January 1st, 2013 until December 31st, 2017. The goal is to predict sales of items in all stores from January 1st, 2018 until March 31st, 2018, i.e., 3 months.
As we can see, store 2 has the highest volume of sales. Which products are the most sold there?
The most sold item in store 2 is item 28 with 205677 units sold . Item 15 comes in second with 205569 units sold.
From the box-plots above we can conclude:
- Higher sales occur during the weekend (5=Saturday, 6=Sunday).
- Higher volume in sales is achieved in July and lowest in January
- This pattern of sales per month seems to be pretty the same every year.
Box-Jenkins Method
To learn applying (S)ARIMA(X) models we will follow some steps based on Box-Jenkins Method. This popular framework provides a systematic way that involves getting to know your time series data and applying the appropriate methods to choose parameters that will lead to a good model.
STEP1 : Identify
Following the schema above in this step we use tools to identify characteristics of the time series so we can build an appropriate model.
Here we search for answers for questions such as:
- Is the time series stationary?
- If not stationary, which transformation should we apply to make it stationary?
- Is the time series seasonal?
- If seasonal what is the seasonal period?
- Which orders to use? (p for AR, q for MA)
Visualize times series sales of item 28 at store 2
As we saw previously in this data, we have 500 time series which are defined by the pair store-item. Here we are working with forecasting individual time series. To forecast sales for all stores and all items we need to apply a forecast model to each one of the time series.
For the purpose of demonstrating the use of these models, we will work with only one time series in this tutorial: sales of item 28 (the most sold item) at store 2 (the store with the highest number of sales).
From the decomposition above we can conclude:
- There is a upward trend on sales. Therefore, this time series is not stationary.
- From the seasonal component we can observe that the model is additive, since the seasonal component is similar (not getting multiplied) over the period of time.
- Also, we can observe on the seasonal component seasonality in sales with lower sales in January and higher sales in July.
Since our data is not stationary, we need to answer What differencing will make it stationary? For this we will use our obtain_adf_kpss_results function to find out how many times we need to apply differencing in order to make this time series stationary. This will be our parameter d for the ARIMA model.
Apply Stationarity Tests
Results of stationary tests show that applying differencing only once is enough to make our time series is stationary, i.e., d=1 .
Plot ACF and PACF
Autocorrelation and Partial autocorrelation plots are heavily used in time series analysis and forecasting. They can give clues about promising values of ARIMA parameters. It also can show us if we need to apply differencing or if we have applied it too much.
AutoCorrelation Function – ACF is the plot of the autocorrelation of a time series by lag. This plot is sometimes called a correlogram. It includes direct and indirect dependence information. In simple terms, ACF describes how well the present value of the series is related with its past values. The bars of the ACF plot represent the ACF values at increasing lags. The blue shaded area represents the confidence interval, which is set to 95% by default. If the bars lie inside the blue shaded region, then they are not statistically significant.
Different from ACF, Partial AutoCorrelation Function – PACF only describes the direct relationship between an observation and its lag. Basically, instead of finding correlations of present with lags like ACF, it finds correlation of the residuals (which remains after removing the effects which are already explained by the earlier lag(s)).
So how these plots can help us finding p (AR term) and q (MA term)?
A good intuition on how these plots relates with parameters of ARIMA is given here. You can read more on how to interpret these plots here.
As said earlier, ACF and PACF can also give tips about differencing.
It is important to made the time series stationary before making these plots. If the ACF values are high and trail off very slowly this is a sign that the data is non-stationarity, so it needs to be differenced.
If the autocorrelation at lag-1 is very negative this is a sign that we have taken the difference too many times.
We know that our time series is not stationary and need to be differenced. Let’s plot ACF and PACF for the original and differenced time series.
By observing the ACF and PACF plots after making the time series stationary, we can infer from the ACF plots that there is a seasonal behaviour of period 7 which is clear by the picks at lag 7, 14, 21 etc. (every week). This shows us the need of a seasonal term in our ARIMA model. In other words, we need a SARIMA model.
STEP 2 : Estimate Coefficients (p,q)
Although we had clear signs that we need a SARIMA model, we will start by applying a ARIMA model instead. By doing so we can obtain a better understanding on the differences between ARIMA and SARIMA when applying the Box-Jenkins’s method. In addition, it will be clear what the advantages are of choosing the appropriate model.
ACF and PACF plots can help us find appropriate values for parameters p and q . However, the interpretation of these plots is not always clear. To obtain more assurance to our choices we can apply an empirical method. This method consists on fitting the ARIMA model for different values of p and q, and choosing the best value based on metrics such as AIC and BIC.
AIC (Akaike information criterion) is a metric which tells us how good a model is. Lower the value, better the model. The AIC also penalizes models which have lots of parameters. This means if we set the order too high compared to the data, we will get a high AIC value. This stops us overfitting to the training data.
BIC (Bayesian information criterion) is similar to AIC, therefore lower value means a better model. However, BIC penalizes additional model orders more than AIC. As consequence, BIC will sometimes suggest a simpler model.
After fitting a model, we can access its summary statistics, and that is where we can find the values of AIC and BIC.
Usually there is agreement between AIC and BIC. If there is no agreement, you should choose a smaller AIC if you prefer a predictive model. Otherwise, choose smaller BIC for an explanatory model.
Check the how to obtain the summary statistics in the following example.
Now let’s scan different values of p and q and choose the values that points to smaller AIC and/or BIC.
The results above show that both AIC and BIC agree that the best model in this case should be ARIMA(4,1,5).
STEP 3: Model Evaluation
Before, using a model we want to know how accurate it is. Here we present some tools to evaluate the model before considering it the best one and putting it to production.
For this evaluation we focus on the residuals. The residuals are the difference between the model’s one-step-ahead predictions and the real values of the time series.
Mean Absolute Error (MAE)
We start by calculating the Mean Absolute Error (MAE) of the residuals. This will give us an idea of how far, on average, the predictions are from the true values.
MAE = 4.73, i.e., the mean average error is about 5 sales per day where the average sale of item 28 in store 2 is in average 28 sales per day. Can we do better than this?
For an ideal model the residuals should be uncorrelated with Gaussian noise centered on zero. Therefore, continuing our model evaluation we use tools that allow us to check if it is true.
Diagnostic Summary Statistics
Another important tool to evaluate the model is the analysis of the residual test statistics in the results summary.
Now we evaluate Prob(Q) and Prob(JB) applying, respectively, the following tests:
Null hypothesis: There are no correlations in the residuals.
Null hypothesis: Residuals are normally distributed.
Prob(Q) = 0.88 > 0.05. We shouldn’t reject the null hypothesis that the residuals are uncorrelated so the residuals are not correlated.
Prob(JB) = 0.02 < 0.05. We reject the null hypothesis that the residuals are normally distributed. Therefore, the residuals are not normally distributed.
Plot Diagnostics
In addition, there are 4 common plots to help us deciding whether a model is a good fit for the data in question.
For an ideal model the residuals should be uncorrelated with Gaussian noise centered on zero. By analyzing the plots above having this in mind we can evaluate if we have a good model or not.
So, let’s analyze those plots (Clockwise from left-top plot):
- Standardized residual: There are no obvious patterns in the residuals. This is our case which points out to a good model
- Histogram plus kde estimate: The histogram shows the measured distribution of the residuals while the orange line shows the KDE curve (smoothed version of the histogram). The green line shows a normal distribution. For a good model the orange line should be similar to the green line. The orange curve is not very similar to the green one.
- Correlogram or ACF plot: 95% of correlations for lag greater than one should not be significant (inside the blue area). This is also the case, i.e., good model.
- Normal Q-Q: Most of the data points should lie on the straight line, indicating a normal distribution of the residuals. This happens here.
Therefore, all in all the model pointed by our empirical search seems to be a good model.
Final tips::
- If the residuals are not normally distributed try to increase d.
- If the residuals are correlated try to increase p or q.
SARIMA model
ACF Plot to Determine Seasonal Period
Previously, we got a hint from the ACF plot that our time series has a seasonal period of 7, i.e., S=7.
Therefore, the ACF plot can help in finding out the time period, especially if it is not clear from the plots of the time series.
Important: Make time series stationary first.
Seasonal Differencing
The non-seasonal ACF and PACF plots (2) above show MA model pattern with q=1.
The Seasonal ACF and PACF plots (3) look like an MA(1) model, i.e., Q=1. We could select the model that combines both of these, i.e., SARIMA(0,1,6)(0,1,1)7.
The following code shows how to fit this model and obtain the MAE of the residuals:
MAE = 4.555 which is a bit lower than the ARIMA model, but is this a good model, i.e., residuals are not correlated and normally distributed?
Both summary and diagnostic plots agree that residuals are uncorrelated and normally distributed which points for a good model. Even the histogram looks better than the previous one with the orange curve closer to the green one.
Automated Model Selection
pmdarima allows us to automate the search of model orders. We use information we got from the Box-Jenkins identification step to predefine some of the orders before we fit. Automated Model Selection can speed up the process of choosing model orders, but needs to be done with care. Automation can make mistakes since the input data can be imperfect and affect the test scores in non-predictable ways.
The only non-optional parameter in auto_arima is data. However, using your knowledge to specify other parameters can help finding the best model.
After running the code above, automated model selection pointed out SARIMA(6,1,1)(6,1,0)7 as best model based on AIC.
For now on let’s call sarima_01_model the first SARIMA model obtained by observing ACF and PACF plots (i.e., SARIMA(0,1,6)(0,1,1)7) and sarima_02_model the second model obtained using automated model selection SARIMA(6,1,1)(6,1,0)7.
Comparing sarima_01_model and sarima_02_model: the kde curve of sarima_01_model is closer to a normal distribution than the kde curve of sarima_02_model. In addition, sarima_01_model has smaller MAE, 4.55 against 4.788 MAE obtained by sarima_02_model.
This would point us to choose sarima_01_model as the most promising model.
Forecasting with SARIMAX
SARIMA vs ARIMA forecasts
We will continue using both ARIMA and SARIMA models even if we know that SARIMA, in this case, is the most adequate model. The goal here is to show why SARIMA is the most adequate.
Forecast in Sample
To have a feeling on how good the chosen models are doing we will take the last 90 days in training dataset as validation data.
Metrics Used to Compare Models
The models presented here and in the third part of this tutorial will be evaluated using MAE (Mean Absolute Error) and MAPE (Mean Absolute Percentage Error). These are popular metrics when evaluating regression models and you will come often across MAE and MAPE when evaluating forecasting models.
When comparing forecast methods applied to a single time series, or to several time series with the same units, MAE is popular as it is easy to both understand and compute. Percentage errors measures such as MAPE have the advantage of being unit-free, and so are frequently used to compare forecast performances between data sets.
Before comparing our models using those metrics let’s revisit their summary statistics.
The three models seem to present no correlation in the residuals. The ARIMA model is the only one where residuals are NOT normally distributed. according to Jarque_Bera (Prob(JB) < 0.05). When observing values of AIC and BIC, SARIMA(0,1,6)(0,1,1)7 has the smallest values which also points to a better model.
Let’s see what forecast in-sample tells us.
For our forecasting in sample, we use the method get_prediction using the last days of the training data as validation data. After that we use mean_absolute_error and mean_absolute_percentage_error from sklearn.metrics to obtain MAE and MAPE for all three models.
MAE and MAPE for these three models are:
When considering MAE and MAPE the SARIMA model chosen by automated selection has the smallest values, i.e., this model has the best accuracy.
Let’s check graphically the observed values against the predictions.
Observe how the result of the SARIMA(6,1,1)(6,1,0)7 model (green line) follows better the red curve (observed values) than the ARIMA model (blue line) and the other SARIMA model (SARIMA(0,1,6)(0,1,1)7, orange line).
Forecast Out of Sample
Let’s predict 90 days ahead. For this part we will just use the ARIMA model (ARIMAX(4,1,5)) and the SARIMA model chosen by automated model selection: SARIMA(6,1,1)x(6,1,0)7.
Notice that now we use get_forecast in place of get_predict.
The plot below shows again that the result obtained by SARIMA model follows better the observed time series. Remember, the ARIMA model completely ignores the seasonal information which explains in great deal such difference between these two models.
Saving Model
Once the best model is found, it is useful to know how to save it. For this we use the joblib library.
Since, the second SARIMA model was the one showing best results, we choose to save it.
To load the saved model use .load() as follows:
Conclusions about SARIMAX models
In this section we introduced ARIMA models and its variants: Seasonal ARIMA (SARIMA) and ARIMAX which uses external data (exogenous inputs) to improve the performance of the ARIMA model.
We followed the Box-Jenkins method to find the best model considering a part of our dataset (time series of sales of product 28 of Walmart’s store 2). As first step, we’ve identified important characteristics of our time series such as stationarity and seasonality.
Then, we also used graphical and statistical methods such as follows to find the best fit model:
- Augmented Dickey-Fuller test,
- Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test,
- ACF and PACF plots analysis,
- Exploring model summary statistics,
- Analyze plots obtained using the statsmodel method plot_diagnostics.
Once we’ve found a model considered good, we used it to forecast in sample, i.e., we applied the model on part of the training data as validation data. Like this, we were able to have a feeling of how good the model is. After that we forecast out of the sample, i.e., 90 days in future.
Both the application of the Box-Jenkins Method as well as using the chosen model to forecast was applied on ARIMA and SARIMA models.
We knew since the beginning that the most appropriate model would be a SARIMA model since we were dealing with a seasonal time series. However, working with both models gave us the opportunity to see the nuances in the application of the Box-Jenkins for these two types of models. Moreover, we could see clearly that if seasonality is not considered we are not using all information and therefore not making the best predictions possible. This became clear also from the forecasting plots.
We didn’t provide an explicit example of (S)ARIMAX, i.e., an (S)ARIMA model using external data. Therefore, I suggest trying to improve the best model by adding holidays as external data. Here you have an example using exogenous regressor. The holidays Python library helps you obtaining holidays.
Thank you for reading!
Comments and questions are always welcome, feel free to get in touch via d.paes.barretto@tue.nl.
For the all code presented here as well as in the previous time series tutorial visit our GitHub repository.