Neural Networks and Multivariate Linear Regression for beer sales prediction

1. Introduction

Demand forecasting is an essential component for the retail industry in terms of planning and efficiency; it allows us to analyse past data and thus have a forward-looking vision. Learning from the past ensures that a sufficient quantity of beers is produced to meet demand while preventing an overproduction, resulting in little obsolete stock and low inventory.

In this project, we will analyse a US beer company’s top 3 brands from 14 Sept 1989 - 19 January 1994. The weekly historical data spanning a total of 4.4 years includes the weekly and wholesale price by brand, sales and retail margin by brand, and price and non-price promotions by brand.

We will explore two different methods: a multivariate linear regression (MLR) and neural networks (NN) to estimate and predict each brand’s sales. Throughout the analysis we will also comment on the effects of price vs non-price promotion on each brand.

2. Data Loading and Exploration

We begin by loading the dataset and adding the libraries needed to perform this analysis.

library(openxlsx)

## Warning: package 'openxlsx' was built under R version 4.0.5

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

BeerData <- read.xlsx("beer.xlsx")

Below, we present the fist rows of the dataset to visualize it and identify the information contained in the different columns.

nrow(BeerData)

## [1] 227

head(BeerData)

The dataset contains data on sales for three different beer brands for 227 weeks.

The Category Sales column displays the total weekly sales resulting from aggregating the sales of the three beer brands (i.e., Sales Brand). The price at which the product is sold each week is shown in the Price Brand column, and the price paid to the manufacturer is expressed in the Wholesale Price column.

Feature relates to price promotions for each brand (e.g., 15% off), and Display relates to the non-price promotions (e.g., buy 1 get 1 free) for each brand. We can also observe that there are multiple scenarios: either only 1 type of promotion is leveraged, or both promotions are leveraged at the same time.

Correlation Analysis

Having understood each of the variables, we can create correlations for each distinct brand as per the assignment to explore the relationship between variables. However, in a business context, we would also consider the effects of the marketing-mix on sales for the whole company as this will allow us to make decisions for the company as a whole. Observing brands in isolation provides insights into the performance of individual brands; however, a CMO would be interested in the performance of all brands as one brand may be taking market share from another brand.

BeerData1 = data.frame(BeerData$PRICEBRAND1, BeerData$display_brand1, BeerData$FEATUREBRAND1, BeerData$RETAILERMARGINBRAND1, BeerData$SALESBRAND1, BeerData$WHOLESALEPRICE1) 
colnames(BeerData1) <- c('pb1','db1','fb1', 'rmb1', 'sb1', 'wp1')


BeerData2 = data.frame(BeerData$PRICEBRAND2, BeerData$display_brand2, BeerData$FEATUREBRAND2, BeerData$RETAILERMARGINBRAND2, BeerData$SALESBRAND2, BeerData$WHOLESALEPRICE2) 
colnames(BeerData2) <- c('pb2','db2','fb2', 'rmb2', 'sb2', 'wp2')

BeerData3 = data.frame(BeerData$PRICEBRAND3, BeerData$display_brand3, BeerData$FEATUREBRAND3, BeerData$RETAILERMARGINBRAND3, BeerData$SALESBRAND3, BeerData$WHOLESALEPRICE3)
colnames(BeerData3) <- c('pb3','db3','fb3', 'rmb3', 'sb3', 'wp3')

Below we see the correlation plots for each brand:

BeerData1.cor1 <- cor(BeerData1)
BeerData1.cor2 <- cor(BeerData2)
BeerData1.cor3 <- cor(BeerData3)

#install.packages("corrplot")
library(corrplot)

## corrplot 0.92 loaded

corrplot(BeerData1.cor1, type = "upper", addCoef.col = 1,
         number.cex = 0.5, order = "hclust", tl.cex = 0.7,
         tl.col = "black", tl.srt = 45, main =  "Beer Brand 1 - Correlation Plot", mar=c(0,0,1,0))

corrplot(BeerData1.cor2, type = "upper", addCoef.col = 1,
         number.cex = 0.5, order = "hclust", tl.cex = 0.7,
         tl.col = "black", tl.srt = 45, main =  "Beer Brand 2 - Correlation Plot", mar=c(0,0,1,0))

corrplot(BeerData1.cor3, type = "upper",addCoef.col = 1,
         number.cex = 0.5, order = "hclust", tl.cex = 0.7,
         tl.col = "black", tl.srt = 45, main =  "Beer Brand 3 - Correlation Plot", mar=c(0,0,1,0))

#Correlation Matrices

#round(BeerData1.cor1, 2)

#round(BeerData1.cor2, 2)

#round(BeerData1.cor3, 2)

The positive correlations in the above plots are displayed in blue, and the negative correlations are displayed in red. The color saturation in conjunction with the size of the circle is proportional to the correlation coefficient.The correlation matrix was also re-ordered via hierarchical clustering in order to better identify patterns in the matrix. In addition, only the upper triangle is shown as the matrix expresses symmetry.

From the plots above, we can see that for the different beer brands common correlation patterns among variables exist; however, we can also spot some differences. For instance, all brands seem to have a negative correlation between sales and retail prices, indicating that beer sales are higher when prices are lower and vice versa. For brand 1 and brand 3, there is a positive correlation between sales and both non-price (display) and price (feature) promotions; whereas for brand 2, there is a clear positive correlation of sales with non-price promotion only. Additionally, brand 3 seems to be the only brand that exhibits a strong positive correlation between retail and wholesale prices, and therefore, resulting in a negative correlation between sales and the wholesale price.

3. Multiple Linear Regression

Brand 1

To prepare the data for the estimation, we divide the original dataset into two subsets: a training subset and a testing subset. For this, we follow the 80/20 general rule by which 80% of the data is used for training purposes and the remaining 20% for testing our models.

set.seed(123)

# A random sample of 80% of the data for training purposes
train_ind_mlr <- sample(seq_len(nrow(BeerData)), size = floor(nrow(BeerData) * 0.8))

train_mlr <- BeerData[train_ind_mlr,]
test_mlr <- BeerData[-train_ind_mlr,]

We then proceed to elaborate a regression model and check its features (otherwise known as variable selection). For the regression model, we select sales as the dependent variable; and price, display, feature and retail margin as the independent variables. The parameters for the model are calculated using the code below.

lm_estimation1 <- lm(SALESBRAND1 ~ PRICEBRAND1 + display_brand1 + FEATUREBRAND1 + RETAILERMARGINBRAND1, data = train_mlr)

# Summarizing model results using the "summary" function:

summary(lm_estimation1)

## 
## Call:
## lm(formula = SALESBRAND1 ~ PRICEBRAND1 + display_brand1 + FEATUREBRAND1 + 
##     RETAILERMARGINBRAND1, data = train_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15900.5  -5129.1   -635.5   4055.9  29035.9 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            139650      16406   8.512 7.30e-15 ***
## PRICEBRAND1            -22812       3208  -7.111 2.78e-11 ***
## display_brand1           7269       7727   0.941    0.348    
## FEATUREBRAND1           32776      27761   1.181    0.239    
## RETAILERMARGINBRAND1   -48562      11740  -4.137 5.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8145 on 176 degrees of freedom
## Multiple R-squared:  0.3957, Adjusted R-squared:  0.382 
## F-statistic: 28.82 on 4 and 176 DF,  p-value: < 2.2e-16

From the results, we can see that in this particular model the retail price and retail margin have a negative coefficient and these variable are statistically significant. On the contrary, the feature and display variables both have positive coefficients, denoting a positive relation between both types of promotions and sales for brand 1. The coefficient for feature is greater than that for display, suggesting that the price promotion has a greater impact on sales for this brand. However, none of these are statistically significant, indicating that sales of brand 1 are mainly driven by retail price and retail margin. Promotions seem to have a lower impact on the sales of this brand.

Forward Step wise Selection

We use the forward step wise selection approach to confirm the explanatory variables that should be included in our regression model. This is shown below.

#define intercept-only model
intercept_only <- lm(SALESBRAND1 ~ 1, data=train_mlr)

#define model with all predictors
all <- lm(SALESBRAND1 ~ PRICEBRAND1 + display_brand1 + FEATUREBRAND1 + RETAILERMARGINBRAND1 + WHOLESALEPRICE1, data=train_mlr)

#perform forward stepwise regression
forward <- step(intercept_only, direction='forward', scope=formula(all), trace=1)

## Start:  AIC=3347.96
## SALESBRAND1 ~ 1
## 
##                        Df  Sum of Sq        RSS    AIC
## + PRICEBRAND1           1 6189676371 1.3131e+10 3280.1
## + display_brand1        1 2683713194 1.6637e+10 3322.9
## + FEATUREBRAND1         1 1061332526 1.8260e+10 3339.7
## + RETAILERMARGINBRAND1  1  878458559 1.8443e+10 3341.5
## + WHOLESALEPRICE1       1  216268036 1.9105e+10 3347.9
## <none>                               1.9321e+10 3348.0
## 
## Step:  AIC=3280.06
## SALESBRAND1 ~ PRICEBRAND1
## 
##                        Df  Sum of Sq        RSS    AIC
## + RETAILERMARGINBRAND1  1 1329511104 1.1802e+10 3262.7
## + WHOLESALEPRICE1       1  144940026 1.2986e+10 3280.0
## <none>                               1.3131e+10 3280.1
## + display_brand1        1  136724268 1.2995e+10 3280.2
## + FEATUREBRAND1         1  125390185 1.3006e+10 3280.3
## 
## Step:  AIC=3262.74
## SALESBRAND1 ~ PRICEBRAND1 + RETAILERMARGINBRAND1
## 
##                   Df Sum of Sq        RSS    AIC
## <none>                         1.1802e+10 3262.7
## + FEATUREBRAND1    1  68017125 1.1734e+10 3263.7
## + display_brand1   1  34254276 1.1768e+10 3264.2
## + WHOLESALEPRICE1  1   3362242 1.1798e+10 3264.7

#view results of forward stepwise regression
forward$anova

#view final model
forward$coefficients

##          (Intercept)          PRICEBRAND1 RETAILERMARGINBRAND1 
##            153256.64            -25288.70            -51515.65

The forward step wise selection suggests to consider the price and retail margin as explanatory variables for sales of brand 1, excluding from the model the display and feature variables. Thus, we propose a second regression model similar to the one above but removing the display and features variables from it.

lm_estimation2 <- lm(SALESBRAND1 ~ PRICEBRAND1 + RETAILERMARGINBRAND1, data = train_mlr)

# Summarizing model results using the "summary" function:

summary(lm_estimation2)

## 
## Call:
## lm(formula = SALESBRAND1 ~ PRICEBRAND1 + RETAILERMARGINBRAND1, 
##     data = train_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16244.2  -5180.5   -371.3   4171.7  29007.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            153257      12155  12.608  < 2e-16 ***
## PRICEBRAND1            -25289       2527 -10.008  < 2e-16 ***
## RETAILERMARGINBRAND1   -51516      11504  -4.478 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8143 on 178 degrees of freedom
## Multiple R-squared:  0.3892, Adjusted R-squared:  0.3823 
## F-statistic:  56.7 on 2 and 178 DF,  p-value: < 2.2e-16

Brand 2

When applying the forward step wise selection method to the data for brand 2, we find that variables price, display, retail margin and wholesale price are suggested for the regression model. This is shown below.

#define intercept-only model
intercept_only2 <- lm(SALESBRAND2 ~ 1, data=train_mlr)

#define model with all predictors
all2 <- lm(SALESBRAND2~ PRICEBRAND2 + display_brand2 + FEATUREBRAND2 + RETAILERMARGINBRAND2 + WHOLESALEPRICE2, data=train_mlr)

#perform forward stepwise regression
forward2 <- step(intercept_only2, direction='forward', scope=formula(all2), trace=1)

## Start:  AIC=3320.05
## SALESBRAND2 ~ 1
## 
##                        Df  Sum of Sq        RSS    AIC
## + PRICEBRAND2           1 6885213786 9.6748e+09 3224.8
## + display_brand2        1  557525377 1.6003e+10 3315.9
## <none>                               1.6560e+10 3320.0
## + WHOLESALEPRICE2       1  120942162 1.6439e+10 3320.7
## + RETAILERMARGINBRAND2  1   44641758 1.6515e+10 3321.6
## + FEATUREBRAND2         1   11952682 1.6548e+10 3321.9
## 
## Step:  AIC=3224.77
## SALESBRAND2 ~ PRICEBRAND2
## 
##                        Df  Sum of Sq        RSS    AIC
## + display_brand2        1 1209084824 8465759199 3202.6
## + RETAILERMARGINBRAND2  1  740965591 8933878432 3212.3
## + WHOLESALEPRICE2       1  124047126 9550796897 3224.4
## <none>                               9674844023 3224.8
## + FEATUREBRAND2         1   32188887 9642655136 3226.2
## 
## Step:  AIC=3202.6
## SALESBRAND2 ~ PRICEBRAND2 + display_brand2
## 
##                        Df Sum of Sq        RSS    AIC
## + RETAILERMARGINBRAND2  1 527515697 7938243502 3193.0
## + WHOLESALEPRICE2       1 149963784 8315795415 3201.4
## <none>                              8465759199 3202.6
## + FEATUREBRAND2         1  18897581 8446861618 3204.2
## 
## Step:  AIC=3192.96
## SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + RETAILERMARGINBRAND2
## 
##                   Df Sum of Sq        RSS    AIC
## + WHOLESALEPRICE2  1 118752205 7819491297 3192.2
## <none>                         7938243502 3193.0
## + FEATUREBRAND2    1  28463602 7909779900 3194.3
## 
## Step:  AIC=3192.23
## SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + RETAILERMARGINBRAND2 + 
##     WHOLESALEPRICE2
## 
##                 Df Sum of Sq        RSS    AIC
## <none>                       7819491297 3192.2
## + FEATUREBRAND2  1  15011995 7804479302 3193.9

#view results of forward stepwise regression
forward2$anova

#view final model
forward2$coefficients

##          (Intercept)          PRICEBRAND2       display_brand2 
##           139780.345           -20916.204            67948.580 
## RETAILERMARGINBRAND2      WHOLESALEPRICE2 
##           -38886.439            -5391.805

Thus, we formulate a regression model with these variables as follows:

lm_estimation3 <- lm(SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + RETAILERMARGINBRAND2 + WHOLESALEPRICE2, data = train_mlr)

# Summarizing model results using the "summary" function:

summary(lm_estimation3)

## 
## Call:
## lm(formula = SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + RETAILERMARGINBRAND2 + 
##     WHOLESALEPRICE2, data = train_mlr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15831  -4186   -272   3531  27778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            139780      15377   9.090  < 2e-16 ***
## PRICEBRAND2            -20916       1555 -13.452  < 2e-16 ***
## display_brand2          67949      14173   4.794 3.46e-06 ***
## RETAILERMARGINBRAND2   -38886      11635  -3.342  0.00102 ** 
## WHOLESALEPRICE2         -5392       3298  -1.635  0.10386    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6666 on 176 degrees of freedom
## Multiple R-squared:  0.5278, Adjusted R-squared:  0.5171 
## F-statistic: 49.18 on 4 and 176 DF,  p-value: < 2.2e-16

The regression parameters show that the variable Wholesales Price is not significant for sales of beer brand 2. Thus, we propose another regression model removing this variable using the code below

lm_estimation4 <- lm(SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + RETAILERMARGINBRAND2, data = train_mlr)

# Summarizing model results using the "summary" function:

summary(lm_estimation4)

## 
## Call:
## lm(formula = SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + RETAILERMARGINBRAND2, 
##     data = train_mlr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15667.0  -4443.2   -355.1   3942.8  27582.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            117759       7452  15.801  < 2e-16 ***
## PRICEBRAND2            -20936       1562 -13.401  < 2e-16 ***
## display_brand2          67043      14229   4.712 4.95e-06 ***
## RETAILERMARGINBRAND2   -40019      11669  -3.430 0.000752 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6697 on 177 degrees of freedom
## Multiple R-squared:  0.5206, Adjusted R-squared:  0.5125 
## F-statistic: 64.08 on 3 and 177 DF,  p-value: < 2.2e-16

From the result, we can see that sales in the case of brand 2 are driven by retail price, display and the retail margin. Therefore, for this brand non-price promotions have a greater impact on sales compared to price promotions.

Brand 3

For brand 3, the forward stepwise selection method suggests the variables retail price, wholesale price, retail margin and feature brand as explanatory variables in a regression model, as indicated below

#define intercept-only model
intercept_only3 <- lm(SALESBRAND3 ~ 1, data=train_mlr)

#define model with all predictors
all3 <- lm(SALESBRAND3~ PRICEBRAND3 + display_brand3 + FEATUREBRAND3 + RETAILERMARGINBRAND3 + WHOLESALEPRICE3, data=train_mlr)

#perform forward stepwise regression
forward3 <- step(intercept_only3, direction='forward', scope=formula(all3), trace=1)

## Start:  AIC=2801.03
## SALESBRAND3 ~ 1
## 
##                        Df Sum of Sq       RSS    AIC
## + PRICEBRAND3           1 404782261 536474015 2701.3
## + RETAILERMARGINBRAND3  1 208670523 732585752 2757.7
## + WHOLESALEPRICE3       1 207217914 734038361 2758.0
## + display_brand3        1  77808476 863447799 2787.4
## + FEATUREBRAND3         1  45998934 895257342 2794.0
## <none>                              941256275 2801.0
## 
## Step:  AIC=2701.27
## SALESBRAND3 ~ PRICEBRAND3
## 
##                        Df Sum of Sq       RSS    AIC
## + WHOLESALEPRICE3       1  16924786 519549228 2697.5
## + display_brand3        1  16876467 519597547 2697.5
## + FEATUREBRAND3         1  15592874 520881141 2697.9
## + RETAILERMARGINBRAND3  1  12170185 524303830 2699.1
## <none>                              536474015 2701.3
## 
## Step:  AIC=2697.47
## SALESBRAND3 ~ PRICEBRAND3 + WHOLESALEPRICE3
## 
##                        Df Sum of Sq       RSS    AIC
## + RETAILERMARGINBRAND3  1  19075776 500473452 2692.7
## + FEATUREBRAND3         1  17356510 502192718 2693.3
## + display_brand3        1   6507916 513041312 2697.2
## <none>                              519549228 2697.5
## 
## Step:  AIC=2692.69
## SALESBRAND3 ~ PRICEBRAND3 + WHOLESALEPRICE3 + RETAILERMARGINBRAND3
## 
##                  Df Sum of Sq       RSS    AIC
## + FEATUREBRAND3   1  15356926 485116526 2689.1
## <none>                        500473452 2692.7
## + display_brand3  1   2501391 497972061 2693.8
## 
## Step:  AIC=2689.05
## SALESBRAND3 ~ PRICEBRAND3 + WHOLESALEPRICE3 + RETAILERMARGINBRAND3 + 
##     FEATUREBRAND3
## 
##                  Df Sum of Sq       RSS    AIC
## <none>                        485116526 2689.1
## + display_brand3  1    419057 484697469 2690.9

#view results of forward stepwise regression
forward3$anova

#view final model
forward3$coefficients

##          (Intercept)          PRICEBRAND3      WHOLESALEPRICE3 
##             53346.84             13568.69            -24016.34 
## RETAILERMARGINBRAND3        FEATUREBRAND3 
##            -92644.75             26809.12

We therefore explore a model with these variables.

lm_estimation5 <- lm(SALESBRAND3 ~  PRICEBRAND3 + FEATUREBRAND3 + RETAILERMARGINBRAND3 + WHOLESALEPRICE3, data = train_mlr)

# Summarizing model results using the "summary" function:

summary(lm_estimation5)

## 
## Call:
## lm(formula = SALESBRAND3 ~ PRICEBRAND3 + FEATUREBRAND3 + RETAILERMARGINBRAND3 + 
##     WHOLESALEPRICE3, data = train_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3382.7 -1232.4  -144.4   876.9  7098.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             53347       6067   8.793  1.3e-15 ***
## PRICEBRAND3             13569       7426   1.827   0.0693 .  
## FEATUREBRAND3           26809      11358   2.360   0.0194 *  
## RETAILERMARGINBRAND3   -92645      37221  -2.489   0.0137 *  
## WHOLESALEPRICE3        -24016       8526  -2.817   0.0054 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1660 on 176 degrees of freedom
## Multiple R-squared:  0.4846, Adjusted R-squared:  0.4729 
## F-statistic: 41.37 on 4 and 176 DF,  p-value: < 2.2e-16

The parameters of the regression model indicate that the variable price brand is not as significantly strong as other explanatory variables. Moreover, we also know that this variable is strongly positively correlated with wholesale price. Thus, we test a new model that excludes this variable shown below.

lm_estimation6 <- lm(SALESBRAND3 ~  FEATUREBRAND3 + RETAILERMARGINBRAND3 + WHOLESALEPRICE3, data = train_mlr)

# Summarizing model results using the "summary" function:

summary(lm_estimation6)

## 
## Call:
## lm(formula = SALESBRAND3 ~ FEATUREBRAND3 + RETAILERMARGINBRAND3 + 
##     WHOLESALEPRICE3, data = train_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3122.9 -1220.4  -199.3   756.9  8370.3 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           45046.1     4048.6  11.126  < 2e-16 ***
## FEATUREBRAND3         27772.6    11420.4   2.432    0.016 *  
## RETAILERMARGINBRAND3 -24838.7     2933.4  -8.468 9.33e-15 ***
## WHOLESALEPRICE3       -8536.8      969.5  -8.805 1.17e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1671 on 177 degrees of freedom
## Multiple R-squared:  0.4748, Adjusted R-squared:  0.4659 
## F-statistic: 53.34 on 3 and 177 DF,  p-value: < 2.2e-16

In this case, we can see that sales of brand 3 are driven by wholesale price, feature and the retail margin. Therefore, for this brand price promotions have a greater impact on sales compared to non-price promotions.

4. Model diagnostics

Variance

We begin the diagnostic of our regression model by checking the residuals, we aim to determine if there is homoscedasticity, that is, constant variance. Uneven variances would suggest the possibility of bias or skewed results, therefore, this is an important step prior in our model diagnosis.

The residuals for the models chosen for each brand, as follows:

# For Brand 1
plot(lm_estimation2$residuals, ylab="Residuals",xlab="Fitted Values", main="Residuals vs Fitted Values Plot for Brand 1")

# For Brand 2
plot(lm_estimation4$residuals,ylab="Residuals",xlab="Fitted Values", main="Residuals vs Fitted Values Plot for Brand 2")

# For Brand 3
plot(lm_estimation6$residuals, ylab="Residuals",xlab="Fitted Values", main="Residuals vs Fitted Values Plot for Brand 3")

From the figures, the residuals obtained from the regression model used to forecast sales for each beer brand appear fairly random in all cases, thus, showcasing homoscedasticity.

Normality

The normality assumption for multivariate regression is associated with the distribution of the residuals - the random error must be normally distrtibuted.This is particularly important for significance testing and small sample sizes. In larger sample sizes (>200) the central limit thereom suggests that the distribution of residuals will approximate a normal distribution. Given we have 227 weeks of data, we are closer to a large sample size, but will nonetheless observe the histogram of residuals for each brand.

#load ggplot2
library(ggplot2)

#create histogram of residuals for Brand 1
ggplot(data = train_mlr, aes(x = lm_estimation2$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black', bins=20) +
    labs(title = 'Histogram of Residuals for Brand 1', x = 'Residuals', y = 'Frequency')

#create histogram of residuals for Brand 2
ggplot(data = train_mlr, aes(x = lm_estimation4$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black', bins=20) +
    labs(title = 'Histogram of Residuals for Brand 2', x = 'Residuals', y = 'Frequency')

#create histogram of residuals for Brand 3
ggplot(data = train_mlr, aes(x = lm_estimation6$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black', bins=20) +
    labs(title = 'Histogram of Residuals for Brand 3', x = 'Residuals', y = 'Frequency')

The residuals seem to follow a fairly normally distributed curve. If we wanted to further prove normal distribution we could leverage the Chi-Squared test for example. However, given we have a larger sample size, we will assume the central limit thereom and confirm normality.

5. Model Predictions

After estimating an MLR model using the training set, we forecast sales in both the training and the testing set using predict function.

Brand 1

# Generate predictions for both training and testing set, naming them lm_predict_train and lm_predict_test, respectively. Here the difference is that for the testing set, the dataset used is a new dataset, i.e., the testing set, which is different from the training set that was used for model estimation. 

lm_predict_train <- predict(lm_estimation2 , data = train_mlr)
lm_predict_test <- predict(lm_estimation2 , newdata = test_mlr)

# We name the predicted sales using MLR method "Predict_MLR", and add that column into our dataset:

BeerData[train_ind_mlr,'Predict_MLR'] <- lm_predict_train
BeerData[-train_ind_mlr,'Predict_MLR'] <- lm_predict_test

#Now take a look at the first few rows of the updated data. Now the data should have 1 more column named Predict_MLR that we just constructed. Since we are mostly interested in predicting the testing set, we only check the data in the testing set: 

head(BeerData[-train_ind_mlr,])

Visualizing Prediction Performance

week <- c(1:46)

## MLR method: extract actual and predicted sales

actual_sales <- BeerData[-train_ind_mlr,'SALESBRAND1']
predicted_sales_mlr <-BeerData[-train_ind_mlr,'Predict_MLR']

# Construct a data frame for this plotting task
plot_data_mlr <-data.frame(week, actual_sales, predicted_sales_mlr)

# Plot dots and lines: actual sales are in red while predicted sales are in green.

plot(plot_data_mlr$week, plot_data_mlr$actual_sales, col="red", main = "Actual vs Predicted Sales for Brand 1", ylab = "Sales", xlab = "Week")
lines(plot_data_mlr$week, plot_data_mlr$actual_sales, col="red")
points(plot_data_mlr$week, plot_data_mlr$predicted_sales_mlr, col="green")
lines(plot_data_mlr$week, plot_data_mlr$predicted_sales_mlr,col="green")

Brand 2

# Generate predictions for both training and testing set, naming them lm_predict_train and lm_predict_test, respectively. Here the difference is that for the testing set, the dataset used is a new dataset, i.e., the testing set, which is different from the training set that was used for model estimation. 

lm_predict_train4 <- predict(lm_estimation4 , data = train_mlr)
lm_predict_test4 <- predict(lm_estimation4 , newdata = test_mlr)

# We name the predicted sales using MLR method "Predict_MLR", and add that column into our dataset:

BeerData[train_ind_mlr,'Predict_MLR'] <- lm_predict_train4
BeerData[-train_ind_mlr,'Predict_MLR'] <- lm_predict_test4

#Now take a look at the first few rows of the updated data. Now the data should have 1 more column named Predict_MLR that we just constructed. Since we are mostly interested in predicting the testing set, we only check the data in the testing set: 

head(BeerData[-train_ind_mlr,])

Visualizing Prediction Performance

week <- c(1:46)

## MLR method: extract actual and predicted sales

actual_sales <- BeerData[-train_ind_mlr,'SALESBRAND2']
predicted_sales_mlr <-BeerData[-train_ind_mlr,'Predict_MLR']

# Construct a data frame for this plotting task
plot_data_mlr <-data.frame(week, actual_sales, predicted_sales_mlr)

# Plot dots and lines: actual sales are in red while predicted sales are in green.

plot(plot_data_mlr$week, plot_data_mlr$actual_sales, col="red", main = "Actual vs Predicted Sales for Brand 2", ylab = "Sales", xlab = "Week")
lines(plot_data_mlr$week, plot_data_mlr$actual_sales, col="red")
points(plot_data_mlr$week, plot_data_mlr$predicted_sales_mlr, col="green")
lines(plot_data_mlr$week, plot_data_mlr$predicted_sales_mlr,col="green")

Brand 3

# Generate predictions for both training and testing set, naming them lm_predict_train and lm_predict_test, respectively. Here the difference is that for the testing set, the dataset used is a new dataset, i.e., the testing set, which is different from the training set that was used for model estimation. 

lm_predict_train6 <- predict(lm_estimation6 , data = train_mlr)
lm_predict_test6 <- predict(lm_estimation6, newdata = test_mlr)

# We name the predicted sales using MLR method "Predict_MLR", and add that column into our dataset:

BeerData[train_ind_mlr,'Predict_MLR'] <- lm_predict_train6
BeerData[-train_ind_mlr,'Predict_MLR'] <- lm_predict_test6

#Now take a look at the first few rows of the updated data. Now the data should have 1 more column named Predict_MLR that we just constructed. Since we are mostly interested in predicting the testing set, we only check the data in the testing set: 

head(BeerData[-train_ind_mlr,])

Visualizing Prediction Performance

week <- c(1:46)

## MLR method: extract actual and predicted sales

actual_sales <- BeerData[-train_ind_mlr,'SALESBRAND3']
predicted_sales_mlr <-BeerData[-train_ind_mlr,'Predict_MLR']

# Construct a data frame for this plotting task
plot_data_mlr <-data.frame(week, actual_sales, predicted_sales_mlr)

# Plot dots and lines: actual sales are in red while predicted sales are in green.

plot(plot_data_mlr$week, plot_data_mlr$actual_sales, col="red", main = "Actual vs Predicted Sales for Brand 3", ylab = "Sales", xlab = "Week")
lines(plot_data_mlr$week, plot_data_mlr$actual_sales, col="red")
points(plot_data_mlr$week, plot_data_mlr$predicted_sales_mlr, col="green")
lines(plot_data_mlr$week, plot_data_mlr$predicted_sales_mlr,col="green")

From the figures above, we notice that the predictions obtained from the regression models produced for each beer brand generally follows the trend of the actual data. Nevertheless, there are still significant gaps between the forecast values and the actual value given the relatively low R-square numbers obtained in the regressions. The accuracy of the regression models is therefore limited. Particularly around week 30 we see the actual value much higher than the predicted value for all 3 brands. One possible explanation for this could be the delayed effect of advertising - there is a lag in the effect of advertising on sales as consumers transition from the “consideration” stage of the marketing funnel to “conversions”. This customer journey path can be sped up through the use of advertising, but nonetheless there is a delay as consumers are exposed to the advertising, and proceed to consider/evaluate before making a purchase. This complexity of mapping the relationship between media marketing mix and sales can be explained by advertising having a non-linear effect. In addition, we experience diminishing returns in advertising, making it even harder to determine the incremental effects of price and none price promotions on sales.

Nowadays, we have new ways to analyse the effects of advertising on sales, some more robust methods involving AI. In the following we will leverage Neural Networks and thus compare the different methods.

6. Neural Networks

As we saw in regression, our model has difficulty in portraying the non-linear relationships that take place, and in addition the model required assumptions such as normality. The multivariate time series of the US beer company makes it a good use-case to attempt new techniques such as Neural Networks. In Neural Networks we analyze interconnected components, it stems off of the mathematical model for a neuron which takes into consideration inputs, processes the inputs, and returns an output. NN has proven to be successful in time series prediction, other industries such as Banking have leveraged this method for stock price prediction for example.

Data normalization

We will begin by normalizing the data:

maxs <- apply(BeerData, 2, max) 
mins <- apply(BeerData, 2, min)
# Now the scaled version of the data are saved as "scaled".
scaled <- as.data.frame(scale(BeerData, center = mins, scale = maxs - mins))

Similar to our regression analysis, we leverage the 80/20 principle to split our data into a training and testing set:

# Set the seed to make your partition reproducible
set.seed(123)

# Divide the data into training and testing set following the same 80/20 rule:

train_ind_nn <- sample(seq_len(nrow(BeerData)), size = floor(nrow(BeerData) * 0.8))

train_nn <- scaled[train_ind_nn,]
test_nn <- scaled[-train_ind_nn,]

# install.packages(neuralnet)
library(neuralnet)

## Warning: package 'neuralnet' was built under R version 4.0.5

For our neural network models we will use all our explanatory variables because we expect the model to be able to capture the complex relationships between them and our dependent variable.

Brand 1

nn <- neuralnet(formula = SALESBRAND1 ~ PRICEBRAND1 + display_brand1 + FEATUREBRAND1 + RETAILERMARGINBRAND1 + WHOLESALEPRICE1, 
               data = train_nn,
               hidden = 3,
               linear.output=TRUE, 
               err.fct = 'sse')

Hidden parameter: as it is a simple model, we choose only 1 hidden layer and 3 neurons (mean of the number of features in the independent variables and dependent variables)

plot(nn, rep = "best")

nn$result.matrix['error',]

##    error 
## 1.255035

The error value computed above is not useful to compare estimation performance between models because it is scaled. That is why we are going to revert the values back to the original scales.

#revert the fitted value back to original scale

fitted.train_nn <- nn$net.result[[1]] * (max(BeerData$SALESBRAND1)-min(BeerData$SALESBRAND1))+min(BeerData$SALESBRAND1)

Now we calculate the Root Mean Squared Error (RMSE) , we leverage the RMSE to compare this to the error generated by our linear regression model in an effort to place both methods side by side.

#use the row index to get the original value of sales in train dataset. 
train_nn.r <- BeerData$SALESBRAND1[train_ind_nn]

#calculate the Root Mean Squared Error of train dataset
rmse.train_nn <- (sum((train_nn.r - fitted.train_nn )^2)/nrow(fitted.train_nn))^0.5

rmse.train_nn

## [1] 6736.373

The RMSE is 6736.373 which is lower than that of the linear regression of 8143 for Brand 1. Note that both RMSE’s are calculated from the training set for us to compare model performances. We eva,uate prediciton performance on the basis of the training data.

Compute Prediction Error and Compare Model Performance Using MLR vs. NN

#fit model using test dataset
Predict.nn <- compute(nn,test_nn)

#get the predicted sales in original scale
Predict.nn_ <- Predict.nn$net.result*(max(BeerData$SALESBRAND1)-min(BeerData$SALESBRAND1))+min(BeerData$SALESBRAND1)

#gather test data
test.r_nn <- BeerData$SALESBRAND1[-train_ind_nn]

rmse.test_nn <- (sum((test.r_nn - Predict.nn_)^2)/46)^0.5


rmse.test_nn

## [1] 7566.8

The RMSE for test data is 7566.8 which is lower than that of the linear regression of 8165.

Visualizing Prediction Performance

# Below we print the predicted sales from the testing set, together with actual sales data 
BeerData[train_ind_nn,'Predict_NN'] <- fitted.train_nn
BeerData[-train_ind_nn,'Predict_NN'] <- Predict.nn_

head(BeerData[-train_ind_nn,])

Plot

week <- c(1:46)

## NN method
actual_sales <- BeerData[-train_ind_nn,'SALESBRAND1']
predicted_sales_nn <-BeerData[-train_ind_nn,'Predict_NN']

plot_data_nn <-data.frame(week, actual_sales, predicted_sales_nn)

# The actual sales are in red while the predicted sales are in green

plot(plot_data_nn$week, plot_data_nn$actual_sales, col="red", main = "Actual vs Predicted Sales for Brand 1", ylab = "Sales", xlab = "Week")
lines(plot_data_nn$week, plot_data_nn$actual_sales, col="red")
points(plot_data_nn$week, plot_data_nn$predicted_sales_nn, col="green")
lines(plot_data_nn$week, plot_data_nn$predicted_sales_nn,col="green")

Brand 2

nn2 <- neuralnet(formula = SALESBRAND2 ~ PRICEBRAND2 + display_brand2 + FEATUREBRAND2 + RETAILERMARGINBRAND2 + WHOLESALEPRICE2, 
               data = train_nn,
               hidden = 3,
               linear.output=TRUE, 
               err.fct = 'sse')

hidden parameter: as it is a simple model, we choose only 1 hidden layer and 3 neurons (mean of of the number of features in the independent variables and dependent variables)

plot(nn2, rep = "best")

#revert the fitted value back to original scale

fitted.train_nn2 <- nn2$net.result[[1]] * (max(BeerData$SALESBRAND2)-min(BeerData$SALESBRAND2))+min(BeerData$SALESBRAND2)

Now we calculate the Root Mean Squared Error (RMSE) that could be comparable to the error generated by our linear regression model.

#use the row index to get the original value of sales in train dataset. 
train_nn.r2 <- BeerData$SALESBRAND2[train_ind_nn]

#calculate the Root Mean Squared Error of train dataset
rmse.train_nn2 <- (sum((train_nn.r2 - fitted.train_nn2 )^2)/nrow(fitted.train_nn2))^0.5

rmse.train_nn2

## [1] 4989.762

The RMSE is 4989.762 which is lower than that of the linear regression of 6697 for Brand 2. Note that both RMSE’s are calculated from the training set for us to compare model performances. What we care more about is the prediction performance using the testing data.

Compute Prediction Error and Compare Model Performance Using MLR vs. NN

#fit model using test dataset
Predict.nn2 <- compute(nn2,test_nn)

#get the predicted sales in original scale
Predict.nn_2 <- Predict.nn2$net.result*(max(BeerData$SALESBRAND2)-min(BeerData$SALESBRAND2))+min(BeerData$SALESBRAND2)

#gather test data
test.r_nn2 <- BeerData$SALESBRAND2[-train_ind_nn]

rmse.test_nn2 <- (sum((test.r_nn2 - Predict.nn_2)^2)/46)^0.5


rmse.test_nn2

## [1] 6382.776

In this case the RMSE for test data is 6382.776 which is lower than that of the linear regression of 6697

Visualizing Prediction Performance

# Below we print the predicted sales from the testing set, together with actual sales data 
BeerData[train_ind_nn,'Predict_NN2'] <- fitted.train_nn2
BeerData[-train_ind_nn,'Predict_NN2'] <- Predict.nn_2

head(BeerData[-train_ind_nn,])

Plot

week <- c(1:46)

## NN method
actual_sales2 <- BeerData[-train_ind_nn,'SALESBRAND2']
predicted_sales_nn2 <-BeerData[-train_ind_nn,'Predict_NN2']

plot_data_nn2 <-data.frame(week, actual_sales2, predicted_sales_nn2)

# The actual sales are in red while the predicted sales are in green

plot(plot_data_nn2$week, plot_data_nn2$actual_sales2, col="red", main = "Actual vs Predicted Sales for Brand 2", ylab = "Sales", xlab = "Week")
lines(plot_data_nn2$week, plot_data_nn2$actual_sales2, col="red")
points(plot_data_nn2$week, plot_data_nn2$predicted_sales_nn2, col="green")
lines(plot_data_nn2$week, plot_data_nn2$predicted_sales_nn2,col="green")

Brand 3

nn3 <- neuralnet(formula = SALESBRAND3 ~ PRICEBRAND3 + display_brand3 + FEATUREBRAND3 + RETAILERMARGINBRAND3 + WHOLESALEPRICE3, 
               data = train_nn,
               hidden = 3,
               linear.output=TRUE, 
               err.fct = 'sse')

hidden parameter: as it is a simple model, we choose only 1 hidden layer and 3 neurons (mean of of the number of features in the independent variables and dependent variables)

plot(nn3, rep = "best")

#revert the fitted value back to original scale

fitted.train_nn3 <- nn3$net.result[[1]] * (max(BeerData$SALESBRAND3)-min(BeerData$SALESBRAND3))+min(BeerData$SALESBRAND3)

Now we calculate the Root Mean Squared Error (RMSE) that could be comparable to the error generated by our linear regression model.

#use the row index to get the original value of sales in train dataset. 
train_nn.r3 <- BeerData$SALESBRAND3[train_ind_nn]

#calculate the Root Mean Squared Error of train dataset
rmse.train_nn3 <- (sum((train_nn.r3 - fitted.train_nn3 )^2)/nrow(fitted.train_nn3))^0.5

rmse.train_nn3

## [1] 1492.631

The RMSE is 1492.631 which is lower than that of the linear regression of 1671 for Brand 3. Note that both RMSE’s are calculated from the training set for us to compare model performances. What we care more about is the prediction performance using the testing data.

Compute Prediction Error and Compare Model Performance Using MLR vs. NN

#fit model using test dataset
Predict.nn3 <- compute(nn3,test_nn)

#get the predicted sales in original scale
Predict.nn_3 <- Predict.nn3$net.result*(max(BeerData$SALESBRAND3)-min(BeerData$SALESBRAND3))+min(BeerData$SALESBRAND3)

#gather test data
test.r_nn3 <- BeerData$SALESBRAND3[-train_ind_nn]

rmse.test_nn3 <- (sum((test.r_nn3 - Predict.nn_3)^2)/46)^0.5


rmse.test_nn3

## [1] 1437.692

In this case the RMSE for test data is 1437.692 which is lower than that of the linear regression of 1671

Visualizing Prediction Performance

# Below we print the predicted sales from the testing set, together with actual sales data 
BeerData[train_ind_nn,'Predict_NN3'] <- fitted.train_nn3
BeerData[-train_ind_nn,'Predict_NN3'] <- Predict.nn_3

head(BeerData[-train_ind_nn,])

Plot

week <- c(1:46)

## NN method
actual_sales3 <- BeerData[-train_ind_nn,'SALESBRAND3']
predicted_sales_nn3 <-BeerData[-train_ind_nn,'Predict_NN3']

plot_data_nn3 <-data.frame(week, actual_sales3, predicted_sales_nn3)

# The actual sales are in red while the predicted sales are in green

plot(plot_data_nn3$week, plot_data_nn3$actual_sales3, col="red", main = "Actual vs Predicted Sales for Brand 3", ylab = "Sales", xlab = "Week")
lines(plot_data_nn3$week, plot_data_nn3$actual_sales3, col="red")
points(plot_data_nn3$week, plot_data_nn3$predicted_sales_nn3, col="green")
lines(plot_data_nn3$week, plot_data_nn3$predicted_sales_nn3,col="green")

We can see that the Neural Network (NN) models have a lower error (RMSE) for the three brands compared to Multiple Linear Regression (MLR). This could be because NN can capture complex nonlinear relationships between the independent variables and the dependent variable, besides only linear relationships as MLR does.

Hence, for forecasting purposes NN could be useful as we want to predict the sales for each brand as precise as possible, so we could take business actions according to it (e.g. adjust inventory) and get a higher profit or save costs.

On the other hand, as NN are a “black box”, the interpretation of the model is more complex So, for interpretability purposes, MLR models could be more useful. For instance, as we saw before, from the MLR we could check that on Brand 1 the price promotion has a greater impact on sales than the non-price promotion. We could also see and individualize the impact on Sales of the rest of the independent variables.