1 Introduction

The example and explanation here are extracted from this book:

Ref Book: “R in Action, Data Analysus and Graphics in R”, by Robert I.Kabacoff

ARIMA stands for Autoregressive Integrated Moving Average Model.

ARIMA models are used to forecast and predict values with linear function of recent values and recent errors of prediction (residuals).

In this exercise, we are covering ARIMA for non-seasonal time series.

Terms that used in ARIMA models :

* autocorrelation (acf)
* partial-correlation (pacf)
* differencing
* stationarity

2 AR vs MA vs ARMA vs ARIMA

2.1 AR

An Autoregressive model of order p, is a timeseries model whereby each value in the timeseries is predicted from a linear combination of the previous p values:

* AR(p):  Y(t) = a + b1*Y(t-1) + b2*Y(t-2) + ... + bp*Y(t-p) + e(t)

where a is the mean of the series, bi are the weights and e is the error term

2.2 MA

A Moving Average model of order q, is a timeseries model whereby each value in the timeseries is predicted from a linear combination of the previous errors.

* MA(q): Y(t) = a - c1*e(t-1) - c2*e(t-2) - ... - cq*e(t-q) + e(t)

where the e’s are the errors of prediction and ci are the weights

2.3 ARMA

Combining AR and MA gives us a model that predicts each value of the time series from the past p values and q residuals.

* ARMA(p,q): Y(t) = a + b1*Y(t-1) + b2*Y(t-2) + ... + bp*Y(t-p) - c1*e(t-1) - c2*e(t-2) - ... - cq*e(t-q) + e(t)

2.4 ARIMA

An ARIMA(p,d,q) model is a model which the time series has been differenced d times.

Note that ARIMA models are designed to fit stationary timeseries (or timeseries that can be made stationary).

In a stationary timeseries:

  • the mean of Y(t) is constant
  • the variance of Y(t) is constant
  • the autocorrelations for any lag k don’t change with time

3 Stationarity

  1. If the variance of Y(t) is not constant, transform the values using log transformation or Box Cox Transformation

  2. If the mean of Y(t) is not constant, perform differencin. In differencing, each value Y(t) is replace with {Y(t-1) - Y(t)}. Differencing a time series once reomves a linear trend; differencing it a second time removes a quadratic trend; differencing it a thrid time removes a cubic terms. Usually we dont need to difference more than twice.

    • In R, the function ndiffs() of the forecast package can be used to help determine the best value of d. 

    • In R, we can difference a timeseries with the function diff(ts, d).

  3. Stationarity is often evaluated with a visual inspection of a timeseries plot.

  4. We can also use a statistical procedure, Augmented Dicky-Fuller (ADF) test to evaluate the assumptions of stationarity.

    • In R, the function adf.test() of the tseries* package performs the test. A significant result indicates stationarity.

4 Steps for ARIMA Modeling

The following is the steps in ARIMA:

  1. Ensure the timeseries is stationary (identify the d)
  
  2. Identify appropriate models (identify the p & q)
  
  3. Fit the model
  
  4. Evaluate the models
  
  5. Make Forecasts
  

Using the Nile dataset for illustration below.


4.1 Step1: Ensure the timeseries is stationary

First, plot the data to assess the stationarity.

# Plot the data
plot(Nile, main = "Time Series plot for Nile Dataset")

The variance appears to be stable so there is no need for transformation.

However, there may be a trend, reconfirm this observation with ndiffs() function from forecast package.

# Check the differencing requirement
forecast::ndiffs(Nile)
## [1] 1

The output shows that there is a 1 level differencing, meaning there is a linear trend.

Perform the level 1 differencing for the time series.

# Perform differencing
dNile = diff(Nile)

Plot the timeseries to check the outcome.

# Plot the data
plot(dNile, main = "Time Series plot for Nile Dataset (Level 1 Difference)")

From the plot, the timeseries look more stationary.

Apply the ADF test to the differenced series to confirm its stationary.

# Apply ADF test
adf.test(dNile)
## Warning in adf.test(dNile): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  dNile
## Dickey-Fuller = -6.5924, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

The p-value is less that 0.05, so confirme the differenced timeseries is stationary.


4.2 Step2: Identify appropriate models

Possible models are selected based on the ACF and PACF plots.

The goal is to identify the p and q.

Simple guidelines:

  * ARIMA(p,d,0)    ACF trails off to zero    PACF cut at lag p
  
  * ARIMA(0,d,q)    ACF cut at lag q          PACF trails off to zero
  
  * ARIMA(p,d,q)    trails off to zero        PACF trails off to zero
  

First, check the ACF plot using Acf() function from forecast package.

# ACF plot 
forecast::Acf(dNile)

Next, check the PACF plot using Pacf() function from forecast package.

# ACF plot 
forecast::Pacf(dNile)

There is a large ACF at lag 1, and the PACF diminishes to zero. This suggests that the model may be ARIMA(0,1,1).


4.3 Step3: Fit the model

Fit the model ARIMA(0,1,1) using the arima(p,d, q) function in forecast package.

# Fit ARIMA model  
fit = arima(Nile, order = c(0,1,1))
fit
## 
## Call:
## arima(x = Nile, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -0.7329
## s.e.   0.1143
## 
## sigma^2 estimated as 20600:  log likelihood = -632.55,  aic = 1269.09

If we have more than one model, we can use the AIC statistics to compare the models. The smaller the AIC value, the better the model.

Check the accuracy using the accuracy() function.

# Check Accuracy of the fitted model
accuracy(fit)
##                    ME     RMSE      MAE       MPE     MAPE     MASE      ACF1
## Training set -11.9358 142.8071 112.1752 -3.574702 12.93594 0.841824 0.1153593

The output shows the various error measures.


4.4 Step4: Evaluate the model

If the model is a good fit, the residuals should be normally distributed with mean zero, and the autocorrelations should be zero for every possible lag.

Meaning, the residuals should be normally and independently distributed (i.e. no relationship between the residuals).

# Evaluate the model
qqnorm(fit$residuals)
qqline(fit$residuals)

The output of qqnorm() and qqline() functions shows that the residual data fall along the line, meaning the data is approximately Normally Distributed.

Check the autocorrelation of the residuals are close to zero using the Box-Ljung Test.

# Evaluate the model
Box.test(fit$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  fit$residuals
## X-squared = 1.3711, df = 1, p-value = 0.2416

The result of the Box-Ljung test is not significant as the p-value is higher than 0.05. Thus the autocorrelation of the residuals are close.

The ARIMA model appears to fit the data well.


4.5 Step5: Make Forecasts

Use the forecast() function from the forecast package to make 3 years forecast.

# Predict and Forecast the model
forecast(fit,3)
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1971       798.3673 614.4307 982.3040 517.0605 1079.674
## 1972       798.3673 607.9845 988.7502 507.2019 1089.533
## 1973       798.3673 601.7495 994.9851 497.6663 1099.068

Plot the forecast.

# Plot the forecase
plot(forecast(fit,3), xlab = "Year", ylab = "Annual Flow", main = "Forecast the Annual flow with timeseries model ARIMA(0,1,1)")

The point estimates is represented by the dot, the bands represent the 80% and 95% confidence bands.


5 Automate ARIMA Model

We can also use the auto.arima() function from forecast() package to get recommendation for the best ARIMA model.

# Automated AIRMA model
fit_auto = auto.arima(Nile)
fit_auto
## Series: Nile 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.2544  -0.8741
## s.e.  0.1194   0.0605
## 
## sigma^2 estimated as 20177:  log likelihood=-630.63
## AIC=1267.25   AICc=1267.51   BIC=1275.04

The automated ARIMA recommended ARIMA(1,1,1) instead of the ARIMA(0,1,1) that we tested.

5.0.1 Comparison

Check the accuracy of ARIMA(0,1,1)

# Check Accuracy of the fitted model
accuracy(fit)
##                    ME     RMSE      MAE       MPE     MAPE     MASE      ACF1
## Training set -11.9358 142.8071 112.1752 -3.574702 12.93594 0.841824 0.1153593

Check the accuracy of ARIMA(1,1,1)

# Check Accuracy of the fitted model
accuracy(fit_auto)
##                     ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -16.06603 139.8986 109.9998 -4.005967 12.78745 0.825499
##                     ACF1
## Training set -0.03228482

Comparing the ARIMA(0,1,1) vs ARIMA(1,1,1)

  • ARIMA(1,1,1) has a lower AIC

  • Most of the residual measures are smaller for ARIMA(1,1,1)

Thus, ARIMA(1,1,1) is the preferred model between the two.


@end