We will be working with a dataset provided by one of the largest fast-food restaurant chains in the US. The data observation window is from early March, 2015 to 06/15/2015 and includes transactional data from 2 stores in Berkeley, CA and 2 stores in New York, NY.
As this dataset has plenty of information, we will simplify the task and choose only one store (ID 46673) and one ingredient to forecast: lettuce.
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
We will first check stationarity of our time series and follow the Box-Jenkings Procedure to fit our ARIMA model, which involves three main stages: identification, estimation and verification.
We read our csv file of daily demand of lettuce in store 46673 and convert to time series. As it is a daily dataset we input a weekly frequency equal to 7. We will expect similar consuming patterns the same days of the week.
data <- read.csv("dataset_lettuce.csv")
data <- subset(data, select = -c(X))
data_ts <- ts(data[, 2], frequency = 7, start = c(3, 5))
data_ts
## Time Series:
## Start = c(3, 5)
## End = c(18, 2)
## Frequency = 7
## [1] 152 100 54 199 166 143 162 116 136 68 176 181 120 141 205 118 62 152
## [19] 135 180 172 180 82 96 193 209 146 189 170 94 67 182 201 186 223 187
## [37] 96 70 156 220 214 160 180 114 76 194 176 141 194 153 102 128 193 185
## [55] 173 169 171 95 80 137 147 163 183 165 78 88 169 172 169 190 123 76
## [73] 50 147 126 159 172 162 92 79 99 125 171 182 185 98 50 151 188 148
## [91] 165 167 96 86 244 173 127 124 151 134 96 138 244
Plot the data
# plot.ts(data_ts)
autoplot(data_ts)
There is no clear evidence of trend, but the time series seems to be seasonal because the expected value of the time series depends on time. In particular, it seems to be a weekly seasonality as expected. We will test if the time series is stationary
adf.test(data_ts) # H0: the time series is non-stationary --> we reject null hypothesis so it is stationary
## Warning in adf.test(data_ts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: data_ts
## Dickey-Fuller = -8.4973, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
pp.test(data_ts) # H0: the time series is non-stationary --> we reject null hypothesis so it is stationary
## Warning in pp.test(data_ts): p-value smaller than printed p-value
##
## Phillips-Perron Unit Root Test
##
## data: data_ts
## Dickey-Fuller Z(alpha) = -55.097, Truncation lag parameter = 4, p-value
## = 0.01
## alternative hypothesis: stationary
kpss.test(data_ts) # H0: the time series is stationary --> we cannot reject null hypothesis so it is stationary
## Warning in kpss.test(data_ts): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: data_ts
## KPSS Level = 0.13004, Truncation lag parameter = 4, p-value = 0.1
# trend stationarity
ndiffs(data_ts) # there is no need to take a first order difference to take the trend out of our time series
## [1] 0
The three different tests points out that the time series is stationary regarding trend stationarity. We now check seasonal stationarity:
# seasonal stationarity
nsdiffs(data_ts) # the result says we should take 1 seasonal difference in order to stationarize our time series. This is consistent whith the weekly seasonal pattern we observed before
## [1] 1
data_ts.diff7 <- diff(data_ts, lag = 7)
nsdiffs(data_ts.diff7)
## [1] 0
We check there is a seasonal effect with a lag = 7. But it will be handled by the auto.arima function with the D argument (D = 1).
We know start to apply the Box-Jenkings Procedure.
Once we proved our time series is trend stationary, we will determine the optimal orders of MA and AR components. We first plot the ACF and PACF of the time series to have an idea of the possible values of p and q.
ggAcf(data_ts)
ggPacf(data_ts)
ggAcf(data_ts, lag.max = 100)
ggPacf(data_ts, lag.max = 100)
Auto-correlation function (ACF) shows a slow decrease and partial auto-correlation function (PACF) lags shows a more abrupt decrease. By the PACT plot, as the last significant lag is the seventh one, we should be expecting a < 7 order of the AR component.
We are going to use the auto.arima() function to evaluate the potential models based on BIC criteria. We use BIC criteria because it is the most common used information criteria in order to choose the models.
auto.arima(data_ts, trace = TRUE, ic = 'bic')
##
## ARIMA(2,0,2)(1,1,1)[7] with drift : Inf
## ARIMA(0,0,0)(0,1,0)[7] with drift : 964.5852
## ARIMA(1,0,0)(1,1,0)[7] with drift : 952.5099
## ARIMA(0,0,1)(0,1,1)[7] with drift : Inf
## ARIMA(0,0,0)(0,1,0)[7] : 960.033
## ARIMA(1,0,0)(0,1,0)[7] with drift : 969.0862
## ARIMA(1,0,0)(2,1,0)[7] with drift : 948.1129
## ARIMA(1,0,0)(2,1,1)[7] with drift : Inf
## ARIMA(1,0,0)(1,1,1)[7] with drift : Inf
## ARIMA(0,0,0)(2,1,0)[7] with drift : 946.2993
## ARIMA(0,0,0)(1,1,0)[7] with drift : 948.5985
## ARIMA(0,0,0)(2,1,1)[7] with drift : Inf
## ARIMA(0,0,0)(1,1,1)[7] with drift : Inf
## ARIMA(0,0,1)(2,1,0)[7] with drift : 947.2248
## ARIMA(1,0,1)(2,1,0)[7] with drift : 951.27
## ARIMA(0,0,0)(2,1,0)[7] : 941.9237
## ARIMA(0,0,0)(1,1,0)[7] : 944.1557
## ARIMA(0,0,0)(2,1,1)[7] : Inf
## ARIMA(0,0,0)(1,1,1)[7] : Inf
## ARIMA(1,0,0)(2,1,0)[7] : 943.7616
## ARIMA(0,0,1)(2,1,0)[7] : 942.8872
## ARIMA(1,0,1)(2,1,0)[7] : 946.9357
##
## Best model: ARIMA(0,0,0)(2,1,0)[7]
## Series: data_ts
## ARIMA(0,0,0)(2,1,0)[7]
##
## Coefficients:
## sar1 sar2
## -0.5887 -0.3022
## s.e. 0.1038 0.1119
##
## sigma^2 = 917.5: log likelihood = -464.12
## AIC=934.23 AICc=934.49 BIC=941.92
Best model based on BIC criteria is (0,0,0)(2,1,0)[7] without drift, which is consistent with the inspection we made at the ACF and PACF plots, and with the seasonal behavior of the time series. P = 2, D = 1 and Q = 0.
Model (0,0,1)(2,1,0)[7] is a second best and (1,0,0)(2,1,0)[7] third best. We will be also checking them.
We can also check that the second best model with the BIC criteria, is the best model elected by the AIC and AICc criteria.
auto.arima(data_ts, trace = FALSE, ic = 'aic')
## Series: data_ts
## ARIMA(0,0,1)(2,1,0)[7]
##
## Coefficients:
## ma1 sar1 sar2
## 0.2526 -0.6856 -0.3802
## s.e. 0.1279 0.1118 0.1171
##
## sigma^2 = 881.2: log likelihood = -462.31
## AIC=932.63 AICc=933.07 BIC=942.89
auto.arima(data_ts, trace = FALSE, ic = 'aicc')
## Series: data_ts
## ARIMA(0,0,1)(2,1,0)[7]
##
## Coefficients:
## ma1 sar1 sar2
## 0.2526 -0.6856 -0.3802
## s.e. 0.1279 0.1118 0.1171
##
## sigma^2 = 881.2: log likelihood = -462.31
## AIC=932.63 AICc=933.07 BIC=942.89
We estimate our best model:
data_ts.m1 <- Arima(data_ts, order = c(0, 0, 0),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
We check the residual plots, including the time series and ACFs, to make sure that there is no warning message. In particular, residuals should have a zero mean, constant variance, and should be distributed symmetrically around mean zero. ACF of any lag greater than zero is statistically insignificant.
autoplot(data_ts.m1$residuals)
ggAcf(data_ts.m1$residuals, lag.max = 40)
ggPacf(data_ts.m1$residuals, lag.max = 40)
checkresiduals(data_ts.m1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(2,1,0)[7]
## Q* = 19.909, df = 12, p-value = 0.06883
##
## Model df: 2. Total lags used: 14
Residuals appear to be random, and there exists no inter-temporal correlation. Regarding the ACF plot, the rule of thumb is that no more than two or three out of 40 residuals fall outside the thresholds, so we are within the limit. The same applies to sample PACF of the residuals. So there is no warning message.
Now we forecast:
lettuce.f <- forecast(data_ts.m1, h = 14)
autoplot(lettuce.f)
We will check accuracy by testing out top three best models with part of the data. So we can evaluate out-of-sample performance of the models. For the performance comparison we will use the last 14 data points.
data_ts.m2 <- Arima(window(data_ts, end = c(16, 2)), order = c(0, 0, 0),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
accuracy(forecast(data_ts.m2, h = 14), window(data_ts, start = c(16, 3)))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3193726 25.57387 19.32826 -3.094707 14.67793 0.7266928
## Test set 14.9981522 46.57808 35.69009 7.556829 22.97894 1.3418558
## ACF1 Theil's U
## Training set 0.1794441 NA
## Test set 0.2888364 0.7500206
data_ts.m3 <- Arima(window(data_ts, end = c(16, 2)), order = c(0, 0, 1),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
accuracy(forecast(data_ts.m3, h = 14), window(data_ts, start = c(16, 3)))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2881424 24.96695 18.73205 -2.976765 14.21303 0.7042769
## Test set 14.9970723 47.49726 37.18525 7.464603 23.92493 1.3980701
## ACF1 Theil's U
## Training set 0.0004257993 NA
## Test set 0.2945371561 0.7558137
data_ts.m4 <- Arima(window(data_ts, end = c(16, 2)), order = c(1, 0, 0),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
accuracy(forecast(data_ts.m4, h = 14), window(data_ts, start = c(16, 3)))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.2536805 24.99110 18.77685 -2.990367 14.23154 0.7059615
## Test set 14.7972219 47.61884 37.37253 7.325945 24.03950 1.4051111
## ACF1 Theil's U
## Training set 0.01284025 NA
## Test set 0.29724360 0.7573132
When we evaluate the performance of models we will use RMSE as the most appropriate measure. If we analyze the RMSE (root mean squared error) values of the test sets of the three models, the values are close to each other, but the model (0,0,0)(2,1,0)[7] seems to have the best performance.
In order to apply the Holt-Winters model we will use the ets() function.
At the model parameter, by reinspecting the plot, we see we should be expecting an ANA model, because we have no trend (second letter: N) and seasonal variations are constant over time (third letter: A). We also expect an additive error type (first letter: A).
Again, we save the last 14 data points for the model comparison part.
autoplot(data_ts)
data_ts.ets1 <- ets(window(data_ts, end = c(16, 2)), model = "ANA")
data_ts.ets2 <- ets(window(data_ts, end = c(16, 2)), model = "ZZZ")
data_ts.ets2
## ETS(A,N,A)
##
## Call:
## ets(y = window(data_ts, end = c(16, 2)), model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.0946
## gamma = 1e-04
##
## Initial states:
## l = 146.0165
## s = 32.6444 19.1329 25.5759 20.8428 -70.5411 -47.1212
## 19.4663
##
## sigma: 24.3637
##
## AIC AICc BIC
## 978.3710 981.1915 1003.2574
We check the output of the “ZZZ” model, which automatically determines the optimal type of each element in the model, also generates the ANA model that was expected.
We will compare the performance of our best ARIMA model and the Holt-Winters model. We compare against the true last 14 data points.
accuracy(forecast(data_ts.m2, h = 14), window(data_ts, start = c(16, 3)))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.3193726 25.57387 19.32826 -3.094707 14.67793 0.7266928
## Test set 14.9981522 46.57808 35.69009 7.556829 22.97894 1.3418558
## ACF1 Theil's U
## Training set 0.1794441 NA
## Test set 0.2888364 0.7500206
accuracy(forecast(data_ts.ets2, h = 14), window(data_ts, start = c(16, 3)))
## ME RMSE MAE MPE MAPE MASE
## Training set -1.014639 23.09898 18.47706 -3.708049 14.26531 0.6946899
## Test set 12.025567 38.53718 28.55525 5.920676 18.50333 1.0736040
## ACF1 Theil's U
## Training set 0.08798640 NA
## Test set 0.07795779 0.608142
We check that the Holt-Winters model has the lower RMSE, so we conclude it has the best performance. We can also see that the other error measures at the Holt-Winters model are also lower in general.
We will use our best model (Holt-Winters ANA model) to forecast the upcoming two weeks:
data_ts.ets.final <- ets(data_ts, model = "ANA")
lettuce.forecast <- forecast(data_ts.ets.final, h = 14)
autoplot(lettuce.forecast)
Save the point forecast column in a .csv file:
lettuce.forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 18.28571 159.79583 125.66254 193.9291 107.59347 211.9982
## 18.42857 173.08458 138.95130 207.2179 120.88223 225.2869
## 18.57143 164.56924 130.43595 198.7025 112.36688 216.7716
## 18.71429 101.27350 67.14021 135.4068 49.07114 153.4758
## 18.85714 76.91392 42.78064 111.0472 24.71157 129.1163
## 19.00000 170.12068 135.98740 204.2540 117.91833 222.3230
## 19.14286 175.76910 141.63581 209.9024 123.56674 227.9715
## 19.28571 159.79583 125.66254 193.9291 107.59347 211.9982
## 19.42857 173.08458 138.95130 207.2179 120.88223 225.2869
## 19.57143 164.56924 130.43595 198.7025 112.36688 216.7716
## 19.71429 101.27350 67.14021 135.4068 49.07114 153.4759
## 19.85714 76.91392 42.78064 111.0472 24.71156 129.1163
## 20.00000 170.12068 135.98740 204.2540 117.91833 222.3230
## 20.14286 175.76910 141.63581 209.9024 123.56674 227.9715
write.csv(lettuce.forecast[2], "forecast_lettuce.csv")