The below preliminaries are standard except now we are loading in the
ISLR2
package which contains datasets used in the second
edition of Introduction to Statistical Learning by Gareth James, Daniela
Witten, Trevor Hastie, and Rob Tibshirani. This is analogous to the
process of loading in the wooldridge
package to use data
from Wooldridge’s book to perform econometric analysis. All this does is
allow us to use data used in the textbook easily without actually
downloading it.
# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ISLR2, data.table, ggplot2)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)
If we wanted to download the book data online and use it, we can do this by following these instructions:
Control + Shift + H
on Windows
(Command + Shift + H
on OS) to begin the process of setting
your working directory.data.table
’s fread()
function to read
in the data. For instance, if your data is stored on your computer as a
.csv file named “data”, the code will be
dt <- fread("data.csv")
to read in the “data.csv” file
and save it as a data.table
named dt
.Following this process, you will be able to successfully load in a
dataset from your computer. You could also used R’s built in
read.csv()
function, but this function struggles with large
datasets which is why I always use fread()
(fast read).
Remember, after doing any sort of data preprocessing, you may want to
save that data to your computer so you don’t have to preprocess it again
once you return to work. This is especially prevalent in big data when
running the code to clean your data takes hours. If your data is defined
in R as dt
, then you can use the code
fwrite(dt, "data_new.csv")
to write dt
to a
.csv file named “data_new.csv” and it will be written to your current
working directory.
The above is an aside, but in the vast majority of cases you will have data that will be loaded in rather than stored on some package so you definitely should know how to read data into R and write data out to your computer.
Also, from this point forward we will be using a lot of new R functions that we haven’t touched yet. Remember, any code that doesn’t make sense you can always either ask me via Slack or ask ChatGPT.
ISLR2
DatasetWe’ll use the Auto
dataset contained in the
ISLR2
package for the remainder of this code. Let’s load in
the data and do some data analysis.
# Load in Auto data set from the ISLR2 package as a data table
dt <- data.table(ISLR2::Auto)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
# Get summary statistics
summary(dt)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
# Get the number of observations
paste0("There are ", nrow(dt), " observations in dt.")
## [1] "There are 392 observations in dt."
# Get the number of variables in the dataset
paste0("There are ", ncol(dt), " variables in dt.")
## [1] "There are 9 variables in dt."
# Get the number of NAs in the dataset
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 0 NAs in dt."
Let’s first explore the validation set approach before getting into cross-validation and bootstrapping. In the following code we will:
The below code is a good practice to always follow when first doing your train-validation-test split.
# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))
# Shuffle the data
shuffled_indices <- sample(nrow(dt))
# Split the indices
train_indices <- shuffled_indices[1:train_size]
val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
test_indices <- shuffled_indices[(train_size + val_test_size + 1):(train_size + 2 * val_test_size)]
# Create the datasets
dt_train <- dt[train_indices, ]
dt_val <- dt[val_indices, ]
dt_test <- dt[test_indices, ]
# Show first few rows of training and validation sets
head(dt_train)
head(dt_val)
Now, we will train a linear regression model using our training data.
The model to be estimated is \(mpg_i = \beta_0
+ \beta_1 horsepower_i + \epsilon_i\). You’ll see we use R’s
predict(reg, dt_val)
function. This function predicts the
outcome observations in dt_val
(the validation data) using
the parameters estimated by running a regression reg
with
the training data. By using the models’ estimated parameters based on
the training data now on the validation data, we gain an estimate of its
out-of-sample performance which is at the crux of machine learning.
# Linear regression
reg <- lm(mpg ~ horsepower, data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
val_rmse
## [1] 4.502003
Thus, the empirical root mean squared error (RMSE) of our model is 4.50.
Now, lets run a polynomial regression of degree two to see if we get a better model fit. This amounts to estimating \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \epsilon_i\).
# Quadratic regression
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
val_rmse
## [1] 3.650075
Thus, the empirical root mean squared error (RMSE) of our two degree polynomial model is 3.65, a solid improvement over the prior model. This indicates a non-linear relationship between MPG and horsepower for vehicles.
Now, lets run a polynomial regression of degree three to see if we get a better model fit. This amounts to estimating \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \beta_3 horsepower_i^3 + \epsilon_i\).
# Cubic regression
reg <- lm(mpg ~ poly(horsepower, 3), data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
val_rmse
## [1] 3.656845
Thus, the empirical root mean squared error (RMSE) of our three degree polynomial model is 3.65 as well.
Now, suppose we used different training and validation sets. Let’s carry out the estimation of identical models as before.
# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))
# Shuffle the data
shuffled_indices <- sample(nrow(dt))
# Split the indices
train_indices <- shuffled_indices[1:train_size]
val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
test_indices <- shuffled_indices[(train_size + val_test_size + 1):(train_size + 2 * val_test_size)]
# Create the datasets
dt_train <- dt[train_indices, ]
dt_val <- dt[val_indices, ]
dt_test <- dt[test_indices, ]
# Linear regression
reg <- lm(mpg ~ horsepower, data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
paste0("RMSE estimate for linear model: ", val_rmse)
## [1] "RMSE estimate for linear model: 5.12737343036503"
# Quadratic regression
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
paste0("RMSE estimate for quadratic model: ", val_rmse)
## [1] "RMSE estimate for quadratic model: 3.9962608243897"
# Cubic regression
reg <- lm(mpg ~ poly(horsepower, 3), data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
paste0("RMSE estimate for cubic model: ", val_rmse)
## [1] "RMSE estimate for cubic model: 4.00482530738087"
Either in the case of quadratic or cubic regression, the estimate of the test error is much lower now relative to before. This is simply due to us using a different data set for training and validation. This shows the high variability in test error estimates when using the validation set approach.
Now, suppose we used different training and validation sets. This time we’ll use 75% of the original data for training and carry out the estimation of identical models as before.
# Training, validation, and test set size
train_size <- floor(0.75 * nrow(dt))
val_test_size <- floor(0.125 * nrow(dt))
# Shuffle the data
shuffled_indices <- sample(nrow(dt))
# Split the indices
train_indices <- shuffled_indices[1:train_size]
val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
test_indices <- shuffled_indices[(train_size + val_test_size + 1):(train_size + 2 * val_test_size)]
# Create the datasets
dt_train <- dt[train_indices, ]
dt_val <- dt[val_indices, ]
dt_test <- dt[test_indices, ]
# Linear regression
reg <- lm(mpg ~ horsepower, data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
paste0("RMSE estimate for linear model: ", val_rmse)
## [1] "RMSE estimate for linear model: 4.93956951506883"
# Quadratic regression
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
paste0("RMSE estimate for quadratic model: ", val_rmse)
## [1] "RMSE estimate for quadratic model: 4.36793151811555"
# Cubic regression
reg <- lm(mpg ~ poly(horsepower, 3), data = dt_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_val[, mpg] - predict(reg, dt_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Empirical RMSE
val_rmse <- sqrt(val_mse)
# Show RMSE
paste0("RMSE estimate for cubic model: ", val_rmse)
## [1] "RMSE estimate for cubic model: 4.45365597968012"
As we can see, the test error estimate for the quadratic and cubic models is higher relative to the simpler models. In particular, the cubic model has an even higher test error estimate. This increase in the test error estimate is largely due to the fact that as the polynomial degree increases, the model becomes more prone to overfitting. Overfitting occurs when a model captures noise in the training data rather than the underlying trend, leading to poor generalization on new, unseen data.
Moreover, using more data for training and less data for testing can also contribute to a higher test error rate, especially when there is a significant disparity between the training and testing data. A smaller test set provides a less reliable estimate of the model’s performance on new data, making the test error more sensitive to variations in the data.
We can use the cv.glm()
function in R’s
boot
package to perform LOOCV. Recall that for logit and
probit regression, we added family = binomial
or
family = binomial(link = "probit")
within the
glm()
function. However, if hadn’t added these,
glm()
would simply give the same result as
lm()
, a linear regression model.
Now, let’s carry out the LOOCV procedure.
# Load in boot package
pacman::p_load(boot)
# Run linear regression of mpg on horsepower using all data in dt
reg <- glm(mpg ~ horsepower, data = dt)
# Get MSE produced by LOOCV
loocv_error <- cv.glm(dt, reg)
# Show RMSE produced by LOOCV
sqrt(loocv_error$delta[1])
## [1] 4.922552
Thus, the LOOCV estimate of the root mean squared test error is 4.92.
Notice this is on the higher end relative to Examples 1, 2, and 3 above.
The value produced by cv.glm(dt, reg)
is the LOOCV
empirical mean squared error defined as \[\begin{align}
CV_{n, MSE} = \frac{1}{n} \sum_{i=1}^n \left(y_i -
\widehat{f}_{-i}(\boldsymbol{x}_i) \right)^2.
\end{align}\]
Now, let’s obtain the LOOCV estimate of the mean squared testing error for a polynomial model of degrees one through ten. We can do this simply by using a for loop.
# Vector of 10 NA's to store the LOOCV errors
loocv_error <- rep(NA, 10)
# Loop from i = 1, 2, ... where i denotes the polynomial degree
for (i in 1:length(loocv_error))
{
# Fit a polynomial regression model of degree i
reg <- glm(mpg ~ poly(horsepower, i), data = dt)
# Store the LOOCV for a polynomial regression model of degree i
loocv_error[i] <- cv.glm(dt, reg)$delta[1]
}
# Show the LOOCV MSE for each model
sqrt(loocv_error)
## [1] 4.922552 4.387279 4.397156 4.407316 4.362707 4.356449 4.339706 4.354440
## [9] 4.366764 4.414854
We see a sharp decrease in empirical RMSE when going from a polynomial regression model of degree one (simple linear regression model) to a quadratic model. However, the MSE begins to rise going from a degree eight polynomial model to degrees nine and ten. Why is this? Overfitting! The higher the degree of the polynomial the more likely our model will overfit to our training data and not generalize well to unseen data. A higher degree polynomial model means our model is more complex and, consequently, more likely to suffer from overfitting. Sure, we may have lower training error, but that’s not nearly as important as the ability of our model to extrapolate its performance to out-of-sample data! In sum, we need to be weary about how complex we make our models; high model complexity isn’t always a good thing!
If you run this yourself, you’ll see that it takes decently long to
run. This is because each model is estimated however many observations
there are in dt
. Since there are ten models and
dt
has 392 observations, this corresponds to estimating a
regression model \(10 \times 392 =
3,290\) and computing the empirical MSE for each model. That’s
quite computationally expensive! k-Fold CV alleviates this problem
greatly as each model is only estimated \(k\) times where \(k\) is typically five or ten.
We can use the cv.glm()
function in R’s
boot
package to perform k-Fold CV as in the case of LOOCV,
but now simply adding the K = k
argument where
K
is the function parameter and k
is whatever
number of folds we desire.
Now, let’s carry out the k-Fold CV procedure with k = 5
and k = 10
.
# Specify vector of folds
folds <- c(5, 10)
# Specify vector of two NAs to store k-Fold CV test error estimate when k = 5 and k = 10
cv_error <- rep(NA, length(folds))
# Loop over indices of folds vector (so i = 1 and then i = 2)
for (i in seq_along(folds))
{
# Obtain the fold at index i of the folds vector (so k = 5 when i = 1 and then k = 10 when i = 2)
k <- folds[i]
# Fit a simple linear regression model
reg <- glm(mpg ~ horsepower, data = dt)
# Store the k-Fold CV estimate for the model
cv_error[i] <- cv.glm(dt, reg, K = k)$delta[1]
}
# Show the LOOCV mse for each model
sqrt(cv_error)
## [1] 4.935093 4.952289
Thus, the k-Fold CV estimate of root mean squared error is 4.90 and
4.93 for \(k = 5\) and \(k = 10\), respectively. Notice that the
\(k = 10\) empirical MSE is close to
the LOOCV estimate. This is because \(CV_k \to
CV_n\) as \(k \to n\). The value
produced by cv.glm(dt, reg)
is the k-Fold CV empirical mean
squared error defined as \[\begin{align}
CV_{k, MSE} = \frac{1}{k} \sum_{j=1}^k \left( \frac{1}{n_j} \sum_{i
\in D_j} \left(y_i - \widehat{f}_{-j}(\boldsymbol{x}_i) \right)^2
\right).
\end{align}\]
Now, let’s obtain the 5-Fold CV estimate of the mean squared testing error for a polynomial model of degrees one through ten. We can do this simply by using a for loop.
# Specify number of folds
k <- 5
# Specify vector of ten NAs to store k-Fold CV test error for polynomial models of degree one through ten
cv_error <- rep(NA, 10)
# Loop from i = 1, 2, ... where i denotes the polynomial degree
for (i in 1:length(cv_error))
{
# Fit a polynomial regression of degree i
reg <- glm(mpg ~ poly(horsepower, i), data = dt)
# Store the k-Fold CV estimate for the model
cv_error[i] <- cv.glm(dt, reg, K = k)$delta[1]
}
# Show the k-Fold CV mse for each model
sqrt(cv_error)
## [1] 4.927423 4.370023 4.383468 4.513201 4.350069 4.337293 4.314983 4.335217
## [9] 4.398627 4.483634
Similar to above, we get a large decrease in RMSE going to a quadratic model, but this RMSE begins to increase after a polynomial model of degree eight adding further evidence to our claim of high polynomial orders leading to overfitting.
Furthermore, if you run this by yourself, you’ll find that it takes only a few seconds to get these MSE estimates which is much faster than the LOOCV case. This is another reason to use k-Fold CV over LOOCV. You’ll pretty much always use k-Fold CV in practice and we likely won’t ever touch LOOCV again.
We will now delve into bootstrapping. We will first obtain the bootstrapped estimate of the MSE to assess its variability across samples. The model estimated across each bootstrapped sample will be \(mpg_i = \beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 + \epsilon_i\).
Then, we will use bootstrapping to assess the variability of OLS estimate of \(\widehat{\beta}_1\) by constructing the bootstrapped standard error and confidence intervals, as well as conducting bootstrapped hypothesis tests. Performing statistical inference this way relative to what we have discussed in the past is extremely useful when we have a small sample (say less than 50 or 100) so asymptotic results are not as robust or when we are estimating a more complex model relative to OLS where analystic solutions for standard errors are not straightfoward.
Let’s construct the bootstrapped distribution of the MSE and plot it. This will give a sense of how variable how test error estimate is. In a perfect world, this distribution would be tightly centered close to zero as this would mean our out-of-sample error estimate is close to zero, on average, with low variance.
# Number of bootstrapped samples
B <- 1000
# Data table of 1000 NA's to store the bootstrapped MSE for each bootstrapped sample
dt_boot_mse <- data.table(MSE = rep(NA_real_, B))
# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))
# Loop from b = 1, 2, ..., B where b denotes the current bootstrapped sample
for (b in 1:B)
{
# Draw bootstrapped sample from dt
dt_boot <- dt[sample(.N, replace = TRUE)]
# Shuffle the data
shuffled_indices <- sample(nrow(dt_boot))
# Split the indices
boot_train_indices <- shuffled_indices[1:train_size]
boot_val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
# Create the datasets
dt_boot_train <- dt_boot[boot_train_indices, ]
dt_boot_val <- dt_boot[boot_val_indices, ]
# Fit a polynomial regression model of degree 2 using the bootstrapped sample
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_boot_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_boot_val$mpg - predict(reg, dt_boot_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Store empirical MSE
dt_boot_mse[b, MSE := val_mse]
}
# Plot the bootstrapped distribution of the MSE
ggplot(dt_boot_mse, aes(x = MSE)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = mean(dt_boot_mse[, MSE]), color = color_1, linetype = "dashed", size = 1) +
labs(title = paste("Bootstrapped Distribution of MSE")) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 25, face = "bold")
) +
xlab("MSE") +
ylab("Density")
# Boostrapped SE of the MSE
sd(dt_boot_mse[, MSE])
## [1] 4.628835
We see the bootstrapped distribution of the MSE is fairly normally distributed with a mean of roughly nineteen and standard deviation of \(4.62\). This is quite variable and a fair amount of error. This indicates we likely want to alter our model by adding more covariates, including higher order polynomial terms, use something more complex than linear regression, etc. to improve model accuracy and shrink the out-of-sample error estimate’s variance.
Now, we will use R’s boot()
function contained in the
boot
package. To do so, we have to create our own function
fn_mse()
which does a lot of the same things as above. This
function will take in two arguments, data
and
indices
(this is required by the boot()
function so don’t worry too much about it) and the function will return
the MSE computed based on the validation set.
# Load in boot package
pacman::p_load(boot)
# Number of bootstrapped samples
B <- 1000
# Training, validation, and test set size
train_size <- floor(0.7 * nrow(dt))
val_test_size <- floor(0.15 * nrow(dt))
# Define our own function that is needed for the boot function
fn_mse <- function(data, indices)
{
# Draw bootstrapped sample using the indices
dt_boot <- data[indices, ]
# Shuffle the data
shuffled_indices <- sample(nrow(dt_boot))
# Split the indices
boot_train_indices <- shuffled_indices[1:train_size]
boot_val_indices <- shuffled_indices[(train_size + 1):(train_size + val_test_size)]
# Create the datasets
dt_boot_train <- dt_boot[boot_train_indices, ]
dt_boot_val <- dt_boot[boot_val_indices, ]
# Fit a polynomial regression model of degree 2 using the bootstrapped sample
reg <- lm(mpg ~ poly(horsepower, 2), data = dt_boot_train)
# Residuals squared by evaluating the trained model using the validation data
val_resid_sq <- (dt_boot_val$mpg - predict(reg, dt_boot_val))^2
# Empirical MSE
val_mse <- mean(val_resid_sq)
# Return the empirical MSE
return(val_mse)
}
# Show bootstrapped results using the boot function
boot(dt, fn_mse, R = B)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = dt, statistic = fn_mse, R = B)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 17.91097 1.24257 4.637709
The above results indicate that the MSE using the entire dataset
dt
without any splitting is \(17.91\). The bias indicates that on average
the bootstrapped MSE is \(17.91 + 1.24 =
19.15\). The bootstrapped estimate of the standard error is is
\(4.63\). These results align very well
with what we found from doing this process manually above.
Let’s construct the bootstrapped distribution of the \(\widehat{\beta}_1\) and plot it. This will give a sense of how variable our parameter estimate is. We will also construct the bootstrapped confidence interval to perfrom hypothesis tests. There are subtle variations of this approach such as the percentile t-intveral, but to keep things simple will look at the basic case that simply involves computing the quantiles from our bootstrapped distribution. Bootstrapped hypothesis testing is my preferred way of conducting hypothesis tests because it does not rely on any asymptotic properties and additional assumptions of the underlying data generation process (DGP); we are actually using the data we gather as well as an approximation of its distribution to carry out hypothesis tests which is a very powerful method. Bootstrapping is used ubiquitously in econometrics.
Let us first estimate the model \(mpg_i =
\beta_0 + \beta_1 horsepower_i + \beta_2 horsepower_i^2 +
\epsilon_i\) using the original data contained in
dt
.
# Regression on original data
reg <- lm(mpg ~ horsepower + I(horsepower^2), data = dt)
# Summary of regression
summary(reg)
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <0.0000000000000002 ***
## horsepower -0.4661896 0.0311246 -14.98 <0.0000000000000002 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 0.00000000000000022
Now, we will construct the bootstrapped distribution of \(\widehat{\beta}_1\).
# Number of bootstrapped samples
B <- 1000
# Data table of 1000 NA's to store the bootstrapped beta_1 for each bootstrapped sample
dt_boot_beta <- data.table(beta_1 = rep(NA_real_, B))
# Loop from b = 1, 2, ..., B where b denotes the current bootstrapped sample
for (b in 1:B)
{
# Draw bootstrapped sample from dt
dt_boot <- dt[sample(.N, replace = TRUE)]
# Fit a polynomial regression model of degree 2 using the bootstrapped sample
reg <- lm(mpg ~ horsepower + I(horsepower^2), data = dt_boot)
# Extract the coefficient of horsepower (beta_1)
beta_1 <- coef(reg)["horsepower"]
# Store the bootstrapped beta_1
dt_boot_beta$beta_1[b] <- beta_1
}
# Plot the bootstrapped distribution of the estimate of beta_1
ggplot(dt_boot_beta, aes(x = beta_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = mean(dt_boot_beta[, beta_1]), color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Bootstrapped Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
# Boostrapped SE of the MSE
sd(dt_boot_beta[, beta_1])
## [1] 0.03357908
We see these results are almost identical to that given by running a regression on the original dataset like we did above.
Now, suppose we want to test the hypothesis of \(H_0: \beta_1 = 0\) versus \(H_A: \beta_1 \neq 0\) at the 95% confidence (5% significance) level. We can do so by obtaining the bootstrapped confidence interval which would simply have the \(\alpha / 2\) and \(1 - \alpha/2\) quantiles as the lower and upper endpoints, respectively. Just by looking at the bootstrapped distribution above, we can see we will easily reject this null hypothesis as not a single estimate of \(\beta_1\) for any bootstrapped sample is zero, but let us compute these quantiles anyway.
# Significance level
alpha <- 0.05
# Construct a 95% bootstrapped confidence interval
conf_int <- quantile(dt_boot_beta[, beta_1], probs = c(alpha/2, 1 - alpha/2), na.rm = TRUE)
# Show the bootstrapped confidence interval
conf_int
## 2.5% 97.5%
## -0.5349461 -0.4029471
As noted above, we easily reject this null hypothesis and conclude \(\beta_1 \neq 0\) with 95% confidence.