The example and explanation here are extracted from this book:
Ref Book: “R in Action, Data Analysus and Graphics in R”, by Robert I.Kabacoff
ARIMA stands for Autoregressive Integrated Moving Average Model.
ARIMA models are used to forecast and predict values with linear function of recent values and recent errors of prediction (residuals).
In this exercise, we are covering ARIMA for non-seasonal time series.
Terms that used in ARIMA models :
* autocorrelation (acf)
* partial-correlation (pacf)
* differencing
* stationarity
An Autoregressive model of order p, is a timeseries model whereby each value in the timeseries is predicted from a linear combination of the previous p values:
* AR(p): Y(t) = a + b1*Y(t-1) + b2*Y(t-2) + ... + bp*Y(t-p) + e(t)
where a is the mean of the series, bi are the weights and e is the error term
A Moving Average model of order q, is a timeseries model whereby each value in the timeseries is predicted from a linear combination of the previous errors.
* MA(q): Y(t) = a - c1*e(t-1) - c2*e(t-2) - ... - cq*e(t-q) + e(t)
where the e’s are the errors of prediction and ci are the weights
Combining AR and MA gives us a model that predicts each value of the time series from the past p values and q residuals.
* ARMA(p,q): Y(t) = a + b1*Y(t-1) + b2*Y(t-2) + ... + bp*Y(t-p) - c1*e(t-1) - c2*e(t-2) - ... - cq*e(t-q) + e(t)
An ARIMA(p,d,q) model is a model which the time series has been differenced d times.
Note that ARIMA models are designed to fit stationary timeseries (or timeseries that can be made stationary).
In a stationary timeseries:
If the variance of Y(t) is not constant, transform the values using log transformation or Box Cox Transformation
If the mean of Y(t) is not constant, perform differencin. In differencing, each value Y(t) is replace with {Y(t-1) - Y(t)}. Differencing a time series once reomves a linear trend; differencing it a second time removes a quadratic trend; differencing it a thrid time removes a cubic terms. Usually we dont need to difference more than twice.
In R, the function ndiffs() of the forecast package can be used to help determine the best value of d.
In R, we can difference a timeseries with the function diff(ts, d).
Stationarity is often evaluated with a visual inspection of a timeseries plot.
We can also use a statistical procedure, Augmented Dicky-Fuller (ADF) test to evaluate the assumptions of stationarity.
The following is the steps in ARIMA:
1. Ensure the timeseries is stationary (identify the d)
2. Identify appropriate models (identify the p & q)
3. Fit the model
4. Evaluate the models
5. Make Forecasts
Using the Nile dataset for illustration below.
First, plot the data to assess the stationarity.
# Plot the data
plot(Nile, main = "Time Series plot for Nile Dataset")
The variance appears to be stable so there is no need for transformation.
However, there may be a trend, reconfirm this observation with ndiffs() function from forecast package.
# Check the differencing requirement
forecast::ndiffs(Nile)
## [1] 1
The output shows that there is a 1 level differencing, meaning there is a linear trend.
Perform the level 1 differencing for the time series.
# Perform differencing
dNile = diff(Nile)
Plot the timeseries to check the outcome.
# Plot the data
plot(dNile, main = "Time Series plot for Nile Dataset (Level 1 Difference)")
From the plot, the timeseries look more stationary.
Apply the ADF test to the differenced series to confirm its stationary.
# Apply ADF test
adf.test(dNile)
## Warning in adf.test(dNile): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: dNile
## Dickey-Fuller = -6.5924, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
The p-value is less that 0.05, so confirme the differenced timeseries is stationary.
Possible models are selected based on the ACF and PACF plots.
The goal is to identify the p and q.
Simple guidelines:
* ARIMA(p,d,0) ACF trails off to zero PACF cut at lag p
* ARIMA(0,d,q) ACF cut at lag q PACF trails off to zero
* ARIMA(p,d,q) trails off to zero PACF trails off to zero
First, check the ACF plot using Acf() function from forecast package.
# ACF plot
forecast::Acf(dNile)
Next, check the PACF plot using Pacf() function from forecast package.
# ACF plot
forecast::Pacf(dNile)
There is a large ACF at lag 1, and the PACF diminishes to zero. This suggests that the model may be ARIMA(0,1,1).
Fit the model ARIMA(0,1,1) using the arima(p,d, q) function in forecast package.
# Fit ARIMA model
fit = arima(Nile, order = c(0,1,1))
fit
##
## Call:
## arima(x = Nile, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## -0.7329
## s.e. 0.1143
##
## sigma^2 estimated as 20600: log likelihood = -632.55, aic = 1269.09
If we have more than one model, we can use the AIC statistics to compare the models. The smaller the AIC value, the better the model.
Check the accuracy using the accuracy() function.
# Check Accuracy of the fitted model
accuracy(fit)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -11.9358 142.8071 112.1752 -3.574702 12.93594 0.841824 0.1153593
The output shows the various error measures.
If the model is a good fit, the residuals should be normally distributed with mean zero, and the autocorrelations should be zero for every possible lag.
Meaning, the residuals should be normally and independently distributed (i.e. no relationship between the residuals).
# Evaluate the model
qqnorm(fit$residuals)
qqline(fit$residuals)
The output of qqnorm() and qqline() functions shows that the residual data fall along the line, meaning the data is approximately Normally Distributed.
Check the autocorrelation of the residuals are close to zero using the Box-Ljung Test.
# Evaluate the model
Box.test(fit$residuals, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: fit$residuals
## X-squared = 1.3711, df = 1, p-value = 0.2416
The result of the Box-Ljung test is not significant as the p-value is higher than 0.05. Thus the autocorrelation of the residuals are close.
The ARIMA model appears to fit the data well.
Use the forecast() function from the forecast package to make 3 years forecast.
# Predict and Forecast the model
forecast(fit,3)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1971 798.3673 614.4307 982.3040 517.0605 1079.674
## 1972 798.3673 607.9845 988.7502 507.2019 1089.533
## 1973 798.3673 601.7495 994.9851 497.6663 1099.068
Plot the forecast.
# Plot the forecase
plot(forecast(fit,3), xlab = "Year", ylab = "Annual Flow", main = "Forecast the Annual flow with timeseries model ARIMA(0,1,1)")
The point estimates is represented by the dot, the bands represent the 80% and 95% confidence bands.
We can also use the auto.arima() function from forecast() package to get recommendation for the best ARIMA model.
# Automated AIRMA model
fit_auto = auto.arima(Nile)
fit_auto
## Series: Nile
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## 0.2544 -0.8741
## s.e. 0.1194 0.0605
##
## sigma^2 estimated as 20177: log likelihood=-630.63
## AIC=1267.25 AICc=1267.51 BIC=1275.04
The automated ARIMA recommended ARIMA(1,1,1) instead of the ARIMA(0,1,1) that we tested.
Check the accuracy of ARIMA(0,1,1)
# Check Accuracy of the fitted model
accuracy(fit)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -11.9358 142.8071 112.1752 -3.574702 12.93594 0.841824 0.1153593
Check the accuracy of ARIMA(1,1,1)
# Check Accuracy of the fitted model
accuracy(fit_auto)
## ME RMSE MAE MPE MAPE MASE
## Training set -16.06603 139.8986 109.9998 -4.005967 12.78745 0.825499
## ACF1
## Training set -0.03228482
Comparing the ARIMA(0,1,1) vs ARIMA(1,1,1)
ARIMA(1,1,1) has a lower AIC
Most of the residual measures are smaller for ARIMA(1,1,1)
Thus, ARIMA(1,1,1) is the preferred model between the two.
@end