Introduction

We will be working with a dataset provided by one of the largest fast-food restaurant chains in the US. The data observation window is from early March, 2015 to 06/15/2015 and includes transactional data from 2 stores in Berkeley, CA and 2 stores in New York, NY.

As this dataset has plenty of information, we will simplify the task and choose only one store (ID 46673) and one ingredient to forecast: lettuce.

Libraries

library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(tseries)
## Warning: package 'tseries' was built under R version 4.0.5
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5

ARIMA model

We will first check stationarity of our time series and follow the Box-Jenkings Procedure to fit our ARIMA model, which involves three main stages: identification, estimation and verification.

We read our csv file of daily demand of lettuce in store 46673 and convert to time series. As it is a daily dataset we input a weekly frequency equal to 7. We will expect similar consuming patterns the same days of the week.

data <- read.csv("dataset_lettuce.csv")

data <- subset(data, select = -c(X))

data_ts <- ts(data[, 2], frequency = 7, start = c(3, 5))
data_ts
## Time Series:
## Start = c(3, 5) 
## End = c(18, 2) 
## Frequency = 7 
##   [1] 152 100  54 199 166 143 162 116 136  68 176 181 120 141 205 118  62 152
##  [19] 135 180 172 180  82  96 193 209 146 189 170  94  67 182 201 186 223 187
##  [37]  96  70 156 220 214 160 180 114  76 194 176 141 194 153 102 128 193 185
##  [55] 173 169 171  95  80 137 147 163 183 165  78  88 169 172 169 190 123  76
##  [73]  50 147 126 159 172 162  92  79  99 125 171 182 185  98  50 151 188 148
##  [91] 165 167  96  86 244 173 127 124 151 134  96 138 244

Plot the data

# plot.ts(data_ts)
autoplot(data_ts)

There is no clear evidence of trend, but the time series seems to be seasonal because the expected value of the time series depends on time. In particular, it seems to be a weekly seasonality as expected. We will test if the time series is stationary

adf.test(data_ts) # H0: the time series is non-stationary --> we reject null hypothesis so it is stationary
## Warning in adf.test(data_ts): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_ts
## Dickey-Fuller = -8.4973, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
pp.test(data_ts)  #  H0: the time series is non-stationary --> we reject null hypothesis so it is stationary
## Warning in pp.test(data_ts): p-value smaller than printed p-value
## 
##  Phillips-Perron Unit Root Test
## 
## data:  data_ts
## Dickey-Fuller Z(alpha) = -55.097, Truncation lag parameter = 4, p-value
## = 0.01
## alternative hypothesis: stationary
kpss.test(data_ts) #  H0: the time series is stationary --> we cannot reject null hypothesis so it is stationary
## Warning in kpss.test(data_ts): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  data_ts
## KPSS Level = 0.13004, Truncation lag parameter = 4, p-value = 0.1
# trend stationarity
ndiffs(data_ts) # there is no need to take a first order difference to take the trend out of our time series
## [1] 0

The three different tests points out that the time series is stationary regarding trend stationarity. We now check seasonal stationarity:

# seasonal stationarity
nsdiffs(data_ts) # the result says we should take 1 seasonal difference in order to stationarize our time series. This is consistent whith the weekly seasonal pattern we observed before
## [1] 1
data_ts.diff7 <- diff(data_ts, lag = 7)

nsdiffs(data_ts.diff7) 
## [1] 0

We check there is a seasonal effect with a lag = 7. But it will be handled by the auto.arima function with the D argument (D = 1).

We know start to apply the Box-Jenkings Procedure.

Identification

Once we proved our time series is trend stationary, we will determine the optimal orders of MA and AR components. We first plot the ACF and PACF of the time series to have an idea of the possible values of p and q.

ggAcf(data_ts)

ggPacf(data_ts)

ggAcf(data_ts, lag.max = 100)

ggPacf(data_ts, lag.max = 100)

Auto-correlation function (ACF) shows a slow decrease and partial auto-correlation function (PACF) lags shows a more abrupt decrease. By the PACT plot, as the last significant lag is the seventh one, we should be expecting a < 7 order of the AR component.

We are going to use the auto.arima() function to evaluate the potential models based on BIC criteria. We use BIC criteria because it is the most common used information criteria in order to choose the models.

auto.arima(data_ts, trace = TRUE, ic = 'bic')
## 
##  ARIMA(2,0,2)(1,1,1)[7] with drift         : Inf
##  ARIMA(0,0,0)(0,1,0)[7] with drift         : 964.5852
##  ARIMA(1,0,0)(1,1,0)[7] with drift         : 952.5099
##  ARIMA(0,0,1)(0,1,1)[7] with drift         : Inf
##  ARIMA(0,0,0)(0,1,0)[7]                    : 960.033
##  ARIMA(1,0,0)(0,1,0)[7] with drift         : 969.0862
##  ARIMA(1,0,0)(2,1,0)[7] with drift         : 948.1129
##  ARIMA(1,0,0)(2,1,1)[7] with drift         : Inf
##  ARIMA(1,0,0)(1,1,1)[7] with drift         : Inf
##  ARIMA(0,0,0)(2,1,0)[7] with drift         : 946.2993
##  ARIMA(0,0,0)(1,1,0)[7] with drift         : 948.5985
##  ARIMA(0,0,0)(2,1,1)[7] with drift         : Inf
##  ARIMA(0,0,0)(1,1,1)[7] with drift         : Inf
##  ARIMA(0,0,1)(2,1,0)[7] with drift         : 947.2248
##  ARIMA(1,0,1)(2,1,0)[7] with drift         : 951.27
##  ARIMA(0,0,0)(2,1,0)[7]                    : 941.9237
##  ARIMA(0,0,0)(1,1,0)[7]                    : 944.1557
##  ARIMA(0,0,0)(2,1,1)[7]                    : Inf
##  ARIMA(0,0,0)(1,1,1)[7]                    : Inf
##  ARIMA(1,0,0)(2,1,0)[7]                    : 943.7616
##  ARIMA(0,0,1)(2,1,0)[7]                    : 942.8872
##  ARIMA(1,0,1)(2,1,0)[7]                    : 946.9357
## 
##  Best model: ARIMA(0,0,0)(2,1,0)[7]
## Series: data_ts 
## ARIMA(0,0,0)(2,1,0)[7] 
## 
## Coefficients:
##          sar1     sar2
##       -0.5887  -0.3022
## s.e.   0.1038   0.1119
## 
## sigma^2 = 917.5:  log likelihood = -464.12
## AIC=934.23   AICc=934.49   BIC=941.92

Best model based on BIC criteria is (0,0,0)(2,1,0)[7] without drift, which is consistent with the inspection we made at the ACF and PACF plots, and with the seasonal behavior of the time series. P = 2, D = 1 and Q = 0.

Model (0,0,1)(2,1,0)[7] is a second best and (1,0,0)(2,1,0)[7] third best. We will be also checking them.

We can also check that the second best model with the BIC criteria, is the best model elected by the AIC and AICc criteria.

auto.arima(data_ts, trace = FALSE, ic = 'aic')
## Series: data_ts 
## ARIMA(0,0,1)(2,1,0)[7] 
## 
## Coefficients:
##          ma1     sar1     sar2
##       0.2526  -0.6856  -0.3802
## s.e.  0.1279   0.1118   0.1171
## 
## sigma^2 = 881.2:  log likelihood = -462.31
## AIC=932.63   AICc=933.07   BIC=942.89
auto.arima(data_ts, trace = FALSE, ic = 'aicc')
## Series: data_ts 
## ARIMA(0,0,1)(2,1,0)[7] 
## 
## Coefficients:
##          ma1     sar1     sar2
##       0.2526  -0.6856  -0.3802
## s.e.  0.1279   0.1118   0.1171
## 
## sigma^2 = 881.2:  log likelihood = -462.31
## AIC=932.63   AICc=933.07   BIC=942.89

Estimation

We estimate our best model:

data_ts.m1 <- Arima(data_ts, order = c(0, 0, 0),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)

Verification

We check the residual plots, including the time series and ACFs, to make sure that there is no warning message. In particular, residuals should have a zero mean, constant variance, and should be distributed symmetrically around mean zero. ACF of any lag greater than zero is statistically insignificant.

autoplot(data_ts.m1$residuals)

ggAcf(data_ts.m1$residuals, lag.max = 40)

ggPacf(data_ts.m1$residuals, lag.max = 40)

checkresiduals(data_ts.m1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(2,1,0)[7]
## Q* = 19.909, df = 12, p-value = 0.06883
## 
## Model df: 2.   Total lags used: 14

Residuals appear to be random, and there exists no inter-temporal correlation. Regarding the ACF plot, the rule of thumb is that no more than two or three out of 40 residuals fall outside the thresholds, so we are within the limit. The same applies to sample PACF of the residuals. So there is no warning message.

Now we forecast:

lettuce.f <- forecast(data_ts.m1, h = 14)
autoplot(lettuce.f)

We will check accuracy by testing out top three best models with part of the data. So we can evaluate out-of-sample performance of the models. For the performance comparison we will use the last 14 data points.

data_ts.m2 <- Arima(window(data_ts, end = c(16, 2)), order = c(0, 0, 0),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
accuracy(forecast(data_ts.m2, h = 14), window(data_ts, start = c(16, 3)))
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.3193726 25.57387 19.32826 -3.094707 14.67793 0.7266928
## Test set     14.9981522 46.57808 35.69009  7.556829 22.97894 1.3418558
##                   ACF1 Theil's U
## Training set 0.1794441        NA
## Test set     0.2888364 0.7500206
data_ts.m3 <- Arima(window(data_ts, end = c(16, 2)), order = c(0, 0, 1),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
accuracy(forecast(data_ts.m3, h = 14), window(data_ts, start = c(16, 3)))
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.2881424 24.96695 18.73205 -2.976765 14.21303 0.7042769
## Test set     14.9970723 47.49726 37.18525  7.464603 23.92493 1.3980701
##                      ACF1 Theil's U
## Training set 0.0004257993        NA
## Test set     0.2945371561 0.7558137
data_ts.m4 <- Arima(window(data_ts, end = c(16, 2)), order = c(1, 0, 0),seasonal = list(order = c(2, 1, 0), period = 7), include.drift = FALSE)
accuracy(forecast(data_ts.m4, h = 14), window(data_ts, start = c(16, 3)))
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.2536805 24.99110 18.77685 -2.990367 14.23154 0.7059615
## Test set     14.7972219 47.61884 37.37253  7.325945 24.03950 1.4051111
##                    ACF1 Theil's U
## Training set 0.01284025        NA
## Test set     0.29724360 0.7573132

When we evaluate the performance of models we will use RMSE as the most appropriate measure. If we analyze the RMSE (root mean squared error) values of the test sets of the three models, the values are close to each other, but the model (0,0,0)(2,1,0)[7] seems to have the best performance.

Holt-Winters model

In order to apply the Holt-Winters model we will use the ets() function.
At the model parameter, by reinspecting the plot, we see we should be expecting an ANA model, because we have no trend (second letter: N) and seasonal variations are constant over time (third letter: A). We also expect an additive error type (first letter: A).
Again, we save the last 14 data points for the model comparison part.

autoplot(data_ts)

data_ts.ets1 <- ets(window(data_ts, end = c(16, 2)), model = "ANA")
data_ts.ets2 <- ets(window(data_ts, end = c(16, 2)), model = "ZZZ")

data_ts.ets2
## ETS(A,N,A) 
## 
## Call:
##  ets(y = window(data_ts, end = c(16, 2)), model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.0946 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 146.0165 
##     s = 32.6444 19.1329 25.5759 20.8428 -70.5411 -47.1212
##            19.4663
## 
##   sigma:  24.3637
## 
##       AIC      AICc       BIC 
##  978.3710  981.1915 1003.2574

We check the output of the “ZZZ” model, which automatically determines the optimal type of each element in the model, also generates the ANA model that was expected.

Model comparison

We will compare the performance of our best ARIMA model and the Holt-Winters model. We compare against the true last 14 data points.

accuracy(forecast(data_ts.m2, h = 14), window(data_ts, start = c(16, 3)))
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.3193726 25.57387 19.32826 -3.094707 14.67793 0.7266928
## Test set     14.9981522 46.57808 35.69009  7.556829 22.97894 1.3418558
##                   ACF1 Theil's U
## Training set 0.1794441        NA
## Test set     0.2888364 0.7500206
accuracy(forecast(data_ts.ets2, h = 14), window(data_ts, start = c(16, 3)))
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.014639 23.09898 18.47706 -3.708049 14.26531 0.6946899
## Test set     12.025567 38.53718 28.55525  5.920676 18.50333 1.0736040
##                    ACF1 Theil's U
## Training set 0.08798640        NA
## Test set     0.07795779  0.608142

We check that the Holt-Winters model has the lower RMSE, so we conclude it has the best performance. We can also see that the other error measures at the Holt-Winters model are also lower in general.

Final forecast for the upcoming two weeks

We will use our best model (Holt-Winters ANA model) to forecast the upcoming two weeks:

data_ts.ets.final <- ets(data_ts, model = "ANA")

lettuce.forecast <- forecast(data_ts.ets.final, h = 14)
autoplot(lettuce.forecast)

Save the point forecast column in a .csv file:

lettuce.forecast
##          Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 18.28571      159.79583 125.66254 193.9291 107.59347 211.9982
## 18.42857      173.08458 138.95130 207.2179 120.88223 225.2869
## 18.57143      164.56924 130.43595 198.7025 112.36688 216.7716
## 18.71429      101.27350  67.14021 135.4068  49.07114 153.4758
## 18.85714       76.91392  42.78064 111.0472  24.71157 129.1163
## 19.00000      170.12068 135.98740 204.2540 117.91833 222.3230
## 19.14286      175.76910 141.63581 209.9024 123.56674 227.9715
## 19.28571      159.79583 125.66254 193.9291 107.59347 211.9982
## 19.42857      173.08458 138.95130 207.2179 120.88223 225.2869
## 19.57143      164.56924 130.43595 198.7025 112.36688 216.7716
## 19.71429      101.27350  67.14021 135.4068  49.07114 153.4759
## 19.85714       76.91392  42.78064 111.0472  24.71156 129.1163
## 20.00000      170.12068 135.98740 204.2540 117.91833 222.3230
## 20.14286      175.76910 141.63581 209.9024 123.56674 227.9715
write.csv(lettuce.forecast[2], "forecast_lettuce.csv")