Preliminaries

The below preliminaries are standard except now we are loading in the ISLR2 package which contains datasets used in the second edition of Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Rob Tibshirani. This is analogous to the process of loading in the wooldridge package to use data from Wooldridge’s book to perform econometric analysis. All this does is allow us to use data used in the textbook easily without actually downloading it.

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(ISLR2, data.table, ggplot2)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Set random seed for reproducibility
set.seed(418518)

# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)

If we wanted to download the book data online and use it, we can do this by following these instructions:

  1. Click Here for ISL Data and click “Data Sets”.
  2. Download the dataset you want and save it to where you plan on setting your working directory.
  3. Open RStudio.
  4. Press Control + Shift + H on Windows (Command + Shift + H on OS) to begin the process of setting your working directory.
  5. Navigate to where the data is stored on your computer. This will set the working directory.
  6. Use data.table’s fread() function to read in the data. For instance, if your data is stored on your computer as a .csv file named “data”, the code will be dt <- fread("data.csv") to read in the “data.csv” file and save it as a data.table named dt.

Following this process, you will be able to successfully load in a dataset from your computer. You could also used R’s built in read.csv() function, but this function struggles with large datasets which is why I always use fread() (fast read). Remember, after doing any sort of data preprocessing, you may want to save that data to your computer so you don’t have to preprocess it again once you return to work. This is especially prevalent in big data when running the code to clean your data takes hours. If your data is defined in R as dt, then you can use the code fwrite(dt, "data_new.csv") to write dt to a .csv file named “data_new.csv” and it will be written to your current working directory.

The above is an aside, but in the vast majority of cases you will have data that will be loaded in rather than stored on some package so you definitely should know how to read data into R and write data out to your computer.

Also, from this point forward we will be using a lot of new R functions that we haven’t touched yet. Remember, any code that doesn’t make sense you can always either ask me via Slack or ask ChatGPT.

Data Analysis with a ISLR2 Dataset

We’ll use the Auto dataset contained in the ISLR2 package for the remainder of this code. Let’s load in the data and do some data analysis.

# Load in Auto data set from the ISLR2 package as a data table
dt <- data.table(ISLR2::Auto)

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
# Get summary statistics
summary(dt)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365
# Get the number of observations
paste0("There are ", nrow(dt), " observations in dt.")
## [1] "There are 392 observations in dt."
# Get the number of variables in the dataset 
paste0("There are ", ncol(dt), " variables in dt.")
## [1] "There are 9 variables in dt."
# Get the number of NAs in the dataset 
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 0 NAs in dt."

Validation Set Approach

Let’s first explore the validation set approach before getting into cross-validation and bootstrapping. In the following code we will:

  1. Define the training, validation, and test set sizes to be 70%, 15%, and 15%, respectively.
  2. Shuffle the indices of our original data table. This is important in machine learning because the data could have originally been entered in such a way that observations near the top of the data table are more similar to those near the bottom. We don’t want our machine learning algorithm to be making predictions based on where the data is stored in the data table so we shuffle our existing data around. This means the first observation could now be the 300th observation while the 300th observation could be the 51st observation and so on.
  3. Create the training, validation, and testing sets.

The below code is a good practice to always follow when first doing your train-validation-test split.

# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))

# Shuffle the data
shuffled_indices <- sample(nrow(dt))

# Split the indices
train_indices <- shuffled_indices[1:train_size]
val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
test_indices <- shuffled_indices[(train_size + val_test_size + 1):(train_size + 2 * val_test_size)]

# Create the datasets
dt_train <- dt[train_indices, ]
dt_val <- dt[val_indices, ]
dt_test <- dt[test_indices, ]

# Show first few rows of training and validation sets
head(dt_train)
head(dt_val)

Example 1

Now, we will train a linear regression model using our training data. The model to be estimated is \(mpg_i = \beta_0 + \beta_1 horsepower_i + \epsilon_i\). You’ll see we use R’s predict(reg, dt_val) function. This function predicts the outcome observations in dt_val (the validation data) using the parameters estimated by running a regression reg with the training data. By using the models’ estimated parameters based on the training data now on the validation data, we gain an estimate of its out-of-sample performance which is at the crux of machine learning.

# Linear regression
reg <- lm(mpg ~ horsepower, data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
val_rmse
## [1] 4.502003

Thus, the empirical root mean squared error (RMSE) of our model is 4.50.

Now, lets run a polynomial regression of degree two to see if we get a better model fit. This amounts to estimating \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \epsilon_i\).

# Quadratic regression
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
val_rmse
## [1] 3.650075

Thus, the empirical root mean squared error (RMSE) of our two degree polynomial model is 3.65, a solid improvement over the prior model. This indicates a non-linear relationship between MPG and horsepower for vehicles.

Now, lets run a polynomial regression of degree three to see if we get a better model fit. This amounts to estimating \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \beta_3 horsepower_i^3 + \epsilon_i\).

# Cubic regression
reg <- lm(mpg ~ poly(horsepower, 3), data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
val_rmse
## [1] 3.656845

Thus, the empirical root mean squared error (RMSE) of our three degree polynomial model is 3.65 as well.

Example 2

Now, suppose we used different training and validation sets. Let’s carry out the estimation of identical models as before.

# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))

# Shuffle the data
shuffled_indices <- sample(nrow(dt))

# Split the indices
train_indices <- shuffled_indices[1:train_size]
val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
test_indices <- shuffled_indices[(train_size + val_test_size + 1):(train_size + 2 * val_test_size)]

# Create the datasets
dt_train <- dt[train_indices, ]
dt_val <- dt[val_indices, ]
dt_test <- dt[test_indices, ]
# Linear regression
reg <- lm(mpg ~ horsepower, data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
paste0("RMSE estimate for linear model: ", val_rmse)
## [1] "RMSE estimate for linear model: 5.12737343036503"
# Quadratic regression
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
paste0("RMSE estimate for quadratic model: ", val_rmse)
## [1] "RMSE estimate for quadratic model: 3.9962608243897"
# Cubic regression
reg <- lm(mpg ~ poly(horsepower, 3), data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
paste0("RMSE estimate for cubic model: ", val_rmse)
## [1] "RMSE estimate for cubic model: 4.00482530738087"

Either in the case of quadratic or cubic regression, the estimate of the test error is much lower now relative to before. This is simply due to us using a different data set for training and validation. This shows the high variability in test error estimates when using the validation set approach.

Example 3

Now, suppose we used different training and validation sets. This time we’ll use 75% of the original data for training and carry out the estimation of identical models as before.

# Training, validation, and test set size
train_size <- floor(0.75 * nrow(dt))
val_test_size <- floor(0.125 * nrow(dt))

# Shuffle the data
shuffled_indices <- sample(nrow(dt))

# Split the indices
train_indices <- shuffled_indices[1:train_size]
val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
test_indices <- shuffled_indices[(train_size + val_test_size + 1):(train_size + 2 * val_test_size)]

# Create the datasets
dt_train <- dt[train_indices, ]
dt_val <- dt[val_indices, ]
dt_test <- dt[test_indices, ]
# Linear regression
reg <- lm(mpg ~ horsepower, data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
paste0("RMSE estimate for linear model: ", val_rmse)
## [1] "RMSE estimate for linear model: 4.93956951506883"
# Quadratic regression
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
paste0("RMSE estimate for quadratic model: ", val_rmse)
## [1] "RMSE estimate for quadratic model: 4.36793151811555"
# Cubic regression
reg <- lm(mpg ~ poly(horsepower, 3), data = dt_train)

# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2

# Empirical MSE
val_mse <- mean(val_resid_sq)

# Empirical RMSE
val_rmse <- sqrt(val_mse)

# Show RMSE
paste0("RMSE estimate for cubic model: ", val_rmse)
## [1] "RMSE estimate for cubic model: 4.45365597968012"

As we can see, the test error estimate for the quadratic and cubic models is higher relative to the simpler models. In particular, the cubic model has an even higher test error estimate. This increase in the test error estimate is largely due to the fact that as the polynomial degree increases, the model becomes more prone to overfitting. Overfitting occurs when a model captures noise in the training data rather than the underlying trend, leading to poor generalization on new, unseen data.

Moreover, using more data for training and less data for testing can also contribute to a higher test error rate, especially when there is a significant disparity between the training and testing data. A smaller test set provides a less reliable estimate of the model’s performance on new data, making the test error more sensitive to variations in the data.

Leave-One-Out Cross-Validation (LOOCV)

We can use the cv.glm() function in R’s boot package to perform LOOCV. Recall that for logit and probit regression, we added family = binomial or family = binomial(link = "probit") within the glm() function. However, if hadn’t added these, glm() would simply give the same result as lm(), a linear regression model.

Now, let’s carry out the LOOCV procedure.

# Load in boot package
pacman::p_load(boot)

# Run linear regression of mpg on horsepower using all data in dt
reg <- glm(mpg ~ horsepower, data = dt)

# Get MSE produced by LOOCV
loocv_error <- cv.glm(dt, reg)

# Show RMSE produced by LOOCV
sqrt(loocv_error$delta[1])
## [1] 4.922552

Thus, the LOOCV estimate of the root mean squared test error is 4.92. Notice this is on the higher end relative to Examples 1, 2, and 3 above. The value produced by cv.glm(dt, reg) is the LOOCV empirical mean squared error defined as \[\begin{align} CV_{n, MSE} = \frac{1}{n} \sum_{i=1}^n \left(y_i - \widehat{f}_{-i}(\boldsymbol{x}_i) \right)^2. \end{align}\]

Now, let’s obtain the LOOCV estimate of the mean squared testing error for a polynomial model of degrees one through ten. We can do this simply by using a for loop.

# Vector of 10 NA's to store the LOOCV errors
loocv_error <- rep(NA, 10)

# Loop from i = 1, 2, ... where i denotes the polynomial degree
for (i in 1:length(loocv_error)) 
{
  # Fit a polynomial regression model of degree i
  reg <- glm(mpg ~ poly(horsepower, i), data = dt)
  
  # Store the LOOCV for a polynomial regression model of degree i
  loocv_error[i] <- cv.glm(dt, reg)$delta[1]
}

# Show the LOOCV MSE for each model
sqrt(loocv_error)
##  [1] 4.922552 4.387279 4.397156 4.407316 4.362707 4.356449 4.339706 4.354440
##  [9] 4.366764 4.414854

We see a sharp decrease in empirical RMSE when going from a polynomial regression model of degree one (simple linear regression model) to a quadratic model. However, the MSE begins to rise going from a degree eight polynomial model to degrees nine and ten. Why is this? Overfitting! The higher the degree of the polynomial the more likely our model will overfit to our training data and not generalize well to unseen data. A higher degree polynomial model means our model is more complex and, consequently, more likely to suffer from overfitting. Sure, we may have lower training error, but that’s not nearly as important as the ability of our model to extrapolate its performance to out-of-sample data! In sum, we need to be weary about how complex we make our models; high model complexity isn’t always a good thing!

If you run this yourself, you’ll see that it takes decently long to run. This is because each model is estimated however many observations there are in dt. Since there are ten models and dt has 392 observations, this corresponds to estimating a regression model \(10 \times 392 = 3,290\) and computing the empirical MSE for each model. That’s quite computationally expensive! k-Fold CV alleviates this problem greatly as each model is only estimated \(k\) times where \(k\) is typically five or ten.

k-Fold Cross-Validation (k-Fold CV)

We can use the cv.glm() function in R’s boot package to perform k-Fold CV as in the case of LOOCV, but now simply adding the K = k argument where K is the function parameter and k is whatever number of folds we desire.

Now, let’s carry out the k-Fold CV procedure with k = 5 and k = 10.

# Specify vector of folds
folds <- c(5, 10)

# Specify vector of two NAs to store k-Fold CV test error estimate when k = 5 and k = 10
cv_error <- rep(NA, length(folds))

# Loop over indices of folds vector (so i = 1 and then i = 2)
for (i in seq_along(folds)) 
{
  # Obtain the fold at index i of the folds vector (so k = 5 when i = 1 and then k = 10 when i = 2)
  k <- folds[i]
  
  # Fit a simple linear regression model
  reg <- glm(mpg ~ horsepower, data = dt)
  
  # Store the k-Fold CV estimate for the model
  cv_error[i] <- cv.glm(dt, reg, K = k)$delta[1]
}

# Show the LOOCV mse for each model
sqrt(cv_error)
## [1] 4.935093 4.952289

Thus, the k-Fold CV estimate of root mean squared error is 4.90 and 4.93 for \(k = 5\) and \(k = 10\), respectively. Notice that the \(k = 10\) empirical MSE is close to the LOOCV estimate. This is because \(CV_k \to CV_n\) as \(k \to n\). The value produced by cv.glm(dt, reg) is the k-Fold CV empirical mean squared error defined as \[\begin{align} CV_{k, MSE} = \frac{1}{k} \sum_{j=1}^k \left( \frac{1}{n_j} \sum_{i \in D_j} \left(y_i - \widehat{f}_{-j}(\boldsymbol{x}_i) \right)^2 \right). \end{align}\]

Now, let’s obtain the 5-Fold CV estimate of the mean squared testing error for a polynomial model of degrees one through ten. We can do this simply by using a for loop.

# Specify number of folds
k <- 5

# Specify vector of ten NAs to store k-Fold CV test error for polynomial models of degree one through ten
cv_error <- rep(NA, 10)

# Loop from i = 1, 2, ... where i denotes the polynomial degree
for (i in 1:length(cv_error)) 
{
  # Fit a polynomial regression of degree i
  reg <- glm(mpg ~ poly(horsepower, i), data = dt)
  
  # Store the k-Fold CV estimate for the model
  cv_error[i] <- cv.glm(dt, reg, K = k)$delta[1]
}

# Show the k-Fold CV mse for each model
sqrt(cv_error)
##  [1] 4.927423 4.370023 4.383468 4.513201 4.350069 4.337293 4.314983 4.335217
##  [9] 4.398627 4.483634

Similar to above, we get a large decrease in RMSE going to a quadratic model, but this RMSE begins to increase after a polynomial model of degree eight adding further evidence to our claim of high polynomial orders leading to overfitting.

Furthermore, if you run this by yourself, you’ll find that it takes only a few seconds to get these MSE estimates which is much faster than the LOOCV case. This is another reason to use k-Fold CV over LOOCV. You’ll pretty much always use k-Fold CV in practice and we likely won’t ever touch LOOCV again.

Bootstrapping

We will now delve into bootstrapping. We will first obtain the bootstrapped estimate of the MSE to assess its variability across samples. The model estimated across each bootstrapped sample will be \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \epsilon_i\).

Then, we will use bootstrapping to assess the variability of OLS estimate of \(\widehat{\beta}_1\) by constructing the bootstrapped standard error and confidence intervals, as well as conducting bootstrapped hypothesis tests. Performing statistical inference this way relative to what we have discussed in the past is extremely useful when we have a small sample (say less than 50 or 100) so asymptotic results are not as robust or when we are estimating a more complex model relative to OLS where analystic solutions for standard errors are not straightfoward.

Bootstrapped MSE

Let’s construct the bootstrapped distribution of the MSE and plot it. This will give a sense of how variable how test error estimate is. In a perfect world, this distribution would be tightly centered close to zero as this would mean our out-of-sample error estimate is close to zero, on average, with low variance.

# Number of bootstrapped samples
B <- 1000

# Data table of 1000 NA's to store the bootstrapped MSE for each bootstrapped sample
dt_boot_mse <- data.table(MSE = rep(NA_real_, B))

# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))

# Loop from b = 1, 2, ..., B where b denotes the current bootstrapped sample
for (b in 1:B) 
{
  # Draw bootstrapped sample from dt
  dt_boot <- dt[sample(.N, replace = TRUE)]
  
  # Shuffle the data
  shuffled_indices <- sample(nrow(dt_boot))
  
  # Split the indices
  boot_train_indices <- shuffled_indices[1:train_size]
  boot_val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
  
  # Create the datasets
  dt_boot_train <- dt_boot[boot_train_indices, ]
  dt_boot_val <- dt_boot[boot_val_indices, ]

  # Fit a polynomial regression model of degree 2 using the bootstrapped sample
  reg <- lm(mpg ~ poly(horsepower, 2), data = dt_boot_train)

  # Residuals squared by evaluating the trained model using the validation data
  val_resid_sq <- (dt_boot_val$mpg - predict(reg, dt_boot_val))^2
  
  # Empirical MSE
  val_mse <- mean(val_resid_sq)
  
  # Store empirical MSE
  dt_boot_mse[b, MSE := val_mse]
}
# Plot the bootstrapped distribution of the MSE
ggplot(dt_boot_mse, aes(x = MSE)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = mean(dt_boot_mse[, MSE]), color = color_1, linetype = "dashed", size = 1) +
  labs(title = paste("Bootstrapped Distribution of MSE")) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 25, face = "bold")
  ) +
  xlab("MSE") +
  ylab("Density")

# Boostrapped SE of the MSE
sd(dt_boot_mse[, MSE])
## [1] 4.628835

We see the bootstrapped distribution of the MSE is fairly normally distributed with a mean of roughly nineteen and standard deviation of \(4.62\). This is quite variable and a fair amount of error. This indicates we likely want to alter our model by adding more covariates, including higher order polynomial terms, use something more complex than linear regression, etc. to improve model accuracy and shrink the out-of-sample error estimate’s variance.

Now, we will use R’s boot() function contained in the boot package. To do so, we have to create our own function fn_mse() which does a lot of the same things as above. This function will take in two arguments, data and indices (this is required by the boot() function so don’t worry too much about it) and the function will return the MSE computed based on the validation set.

# Load in boot package
pacman::p_load(boot)

# Number of bootstrapped samples
B <- 1000

# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))

# Define our own function that is needed for the boot function
fn_mse <- function(data, indices) 
{
  # Draw bootstrapped sample using the indices
  dt_boot <- data[indices, ]
  
  # Shuffle the data
  shuffled_indices <- sample(nrow(dt_boot))
  
  # Split the indices
  boot_train_indices <- shuffled_indices[1:train_size]
  boot_val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
  
  # Create the datasets
  dt_boot_train <- dt_boot[boot_train_indices, ]
  dt_boot_val <- dt_boot[boot_val_indices, ]

  # Fit a polynomial regression model of degree 2 using the bootstrapped sample
  reg <- lm(mpg ~ poly(horsepower, 2), data = dt_boot_train)

  # Residuals squared by evaluating the trained model using the validation data
  val_resid_sq <- (dt_boot_val$mpg - predict(reg, dt_boot_val))^2
  
  # Empirical MSE
  val_mse <- mean(val_resid_sq)
  
  # Return the empirical MSE
  return(val_mse)
}

# Show bootstrapped results using the boot function
boot(dt, fn_mse, R = B)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = dt, statistic = fn_mse, R = B)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1* 17.91097 1.24257    4.637709

The above results indicate that the MSE using the entire dataset dt without any splitting is \(17.91\). The bias indicates that on average the bootstrapped MSE is \(17.91 + 1.24 = 19.15\). The bootstrapped estimate of the standard error is is \(4.63\). These results align very well with what we found from doing this process manually above.

Bootstrapped \(\widehat{\boldsymbol{\beta}}\)

Let’s construct the bootstrapped distribution of the \(\widehat{\beta}_1\) and plot it. This will give a sense of how variable our parameter estimate is. We will also construct the bootstrapped confidence interval to perfrom hypothesis tests. There are subtle variations of this approach such as the percentile t-intveral, but to keep things simple will look at the basic case that simply involves computing the quantiles from our bootstrapped distribution. Bootstrapped hypothesis testing is my preferred way of conducting hypothesis tests because it does not rely on any asymptotic properties and additional assumptions of the underlying data generation process (DGP); we are actually using the data we gather as well as an approximation of its distribution to carry out hypothesis tests which is a very powerful method. Bootstrapping is used ubiquitously in econometrics.

Let us first estimate the model \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \epsilon_i\) using the original data contained in dt.

# Regression on original data
reg <- lm(mpg ~ horsepower + I(horsepower^2), data = dt)

# Summary of regression
summary(reg)
## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60 <0.0000000000000002 ***
## horsepower      -0.4661896  0.0311246  -14.98 <0.0000000000000002 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 0.00000000000000022

Now, we will construct the bootstrapped distribution of \(\widehat{\beta}_1\).

# Number of bootstrapped samples
B <- 1000

# Data table of 1000 NA's to store the bootstrapped beta_1 for each bootstrapped sample
dt_boot_beta <- data.table(beta_1 = rep(NA_real_, B))

# Loop from b = 1, 2, ..., B where b denotes the current bootstrapped sample
for (b in 1:B) 
{
  # Draw bootstrapped sample from dt
  dt_boot <- dt[sample(.N, replace = TRUE)]
  
  # Fit a polynomial regression model of degree 2 using the bootstrapped sample
  reg <- lm(mpg ~ horsepower + I(horsepower^2), data = dt_boot)

  # Extract the coefficient of horsepower (beta_1)
  beta_1 <- coef(reg)["horsepower"]
  
  # Store the bootstrapped beta_1
  dt_boot_beta$beta_1[b] <- beta_1
}
# Plot the bootstrapped distribution of the estimate of beta_1
ggplot(dt_boot_beta, aes(x = beta_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = mean(dt_boot_beta[, beta_1]), color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Bootstrapped Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

# Boostrapped SE of the MSE
sd(dt_boot_beta[, beta_1])
## [1] 0.03357908

We see these results are almost identical to that given by running a regression on the original dataset like we did above.

Now, suppose we want to test the hypothesis of \(H_0: \beta_1 = 0\) versus \(H_A: \beta_1 \neq 0\) at the 95% confidence (5% significance) level. We can do so by obtaining the bootstrapped confidence interval which would simply have the \(\alpha / 2\) and \(1 - \alpha/2\) quantiles as the lower and upper endpoints, respectively. Just by looking at the bootstrapped distribution above, we can see we will easily reject this null hypothesis as not a single estimate of \(\beta_1\) for any bootstrapped sample is zero, but let us compute these quantiles anyway.

# Significance level
alpha <- 0.05

# Construct a 95% bootstrapped confidence interval
conf_int <- quantile(dt_boot_beta[, beta_1], probs = c(alpha/2, 1 - alpha/2), na.rm = TRUE)

# Show the bootstrapped confidence interval
conf_int
##       2.5%      97.5% 
## -0.5349461 -0.4029471

As noted above, we easily reject this null hypothesis and conclude \(\beta_1 \neq 0\) with 95% confidence.