Preliminaries

The below preliminaries are standard except now we are loading in the glmnet package which contains the glmnet() function to execute ridge, lasso, and elastic net regressions.

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(ISLR2, glmnet, boot, data.table, ggplot2)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Set random seed for reproducibility
set.seed(418518)

# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)

Data Analysis with a ISLR2 Dataset

We’ll use the Hitters dataset contained in the ISLR2 package. We’ll use this data to predict a baseball players salary using various statistics assoicated with performance of the previous year. Let’s load in the data and do some data analysis.

You’ll see a new function in the below code called sapply(). This function is a part of R’s apply family of functions which is super useful in data analysis. It allows us to apply any function to the columns of our dataset and will return a vector of values corresponding to each column where that value is the output of the function. For instance, the below code uses the code sapply(dt, function(x) sum(is.na(x))). Here, x is really just dt. So, the sapply function will use our defined function function(x) sum(is.na(x)) on each column of dt returning a vector of the total number of NAs in each column. Don’t worry too much if this code is confusing; when I first learned the apply functions it was for me as well.

# Load in Hitters data set from the ISLR2 package as a data table
dt <- data.table(ISLR2::Hitters)

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"
# Get summary statistics
summary(dt)
##      AtBat            Hits         HmRun            Runs       
##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
##                                                                
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##                                                                      
##      CHits            CHmRun           CRuns             CRBI        
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
##                                                                      
##      CWalks        League  Division    PutOuts          Assists     
##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
##  Median : 170.50                    Median : 212.0   Median : 39.5  
##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
##                                                                     
##      Errors          Salary       NewLeague
##  Min.   : 0.00   Min.   :  67.5   A:176    
##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
##  Median : 6.00   Median : 425.0            
##  Mean   : 8.04   Mean   : 535.9            
##  3rd Qu.:11.00   3rd Qu.: 750.0            
##  Max.   :32.00   Max.   :2460.0            
##                  NA's   :59
# Get the number of observations
paste0("There are ", nrow(dt), " observations in dt.")
## [1] "There are 322 observations in dt."
# Get the number of variables in the dataset 
paste0("There are ", ncol(dt), " variables in dt.")
## [1] "There are 20 variables in dt."
# Get the number of NAs in the dataset 
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 59 NAs in dt."
# Get NAs by column
sapply(dt, function(x) sum(is.na(x)))
##     AtBat      Hits     HmRun      Runs       RBI     Walks     Years    CAtBat 
##         0         0         0         0         0         0         0         0 
##     CHits    CHmRun     CRuns      CRBI    CWalks    League  Division   PutOuts 
##         0         0         0         0         0         0         0         0 
##   Assists    Errors    Salary NewLeague 
##         0         0        59         0
# First row of dt
dt[1, ]
# Median of dt's Salary column
paste0("The median salary is: ", median(dt[, Salary], na.rm = TRUE))
## [1] "The median salary is: 425"

We see there are 59 NAs in the Hitters dataset and they all come from the independent variable, Salary. For example, we can see the first observation’s Salary value is NA. We could just drop these observations using the code dt <- na.omit(dt) which drops observations with NAs in any column. However, we can do better than that. Let’s impute the median value of the Salary column for those observations with a NA. We like the so-called median imputation method over the mean imputation method because the median is more robust to outliers relative to the mean. The median Salary in our dataset is \(425\).

# Replace NA values in the salary column with the median of the salary column
dt[, Salary := ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary)]

# Get the number of NAs in the dataset 
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 0 NAs in dt."
# First row of dt
dt[1:2, ]

The above code dt[, Salary := ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary)] assigns the value of the Salary column as its current value if the observation is not a NA while replacing that value with the median if the value is a NA. For example, the first observation had a NA for salary while the second did not. We can see that the first observation’s Salary is now \(425\) (the median of the Salary variable) while the second’s is left at its original value of \(475\).

You’ll deal with NA values all the time in your data science life. Dropping observations is usually a waste and the median imputation method is a powerful way to avoid doing so.

Lastly, we’ll split dt into training and testing sets using a \(80-20\) split.

# Training, validation, and test set size
train_size <- floor(0.8 * nrow(dt))
test_size <- floor(0.2 * nrow(dt))

# Shuffle the data
shuffled_indices <- sample(nrow(dt))

# Split the indices
train_indices <- shuffled_indices[1:train_size]
test_indices <- shuffled_indices[(train_size + 1):nrow(dt)]

# Create the datasets
dt_train <- dt[train_indices, ]
dt_test <- dt[test_indices, ]

# Show first few rows of training and validation sets
head(dt_train)
head(dt_test)

Ridge Regression

In this section, we’ll use the glmnet() function in the glmnet package to run ridge regressions, and show that the ridge regression estimator displays increasing attenuation bias as the penalization parameter \(\lambda\) increases with simultaneously decreasing variance. Recall ridge regression solves the problem of \[\begin{align} \underset{\widehat{\boldsymbol{\beta}}_{R}}{\arg \min} \ \left[ \left( \boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{R} \right)^\prime \left( \boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{R} \right) + \lambda \boldsymbol{\widehat{\beta}}_R^\prime \boldsymbol{\widehat{\beta}}_R \right] \end{align}\] where \(\boldsymbol{\beta}_R^\prime \boldsymbol{\widehat{\beta}}_R\) is the squared L2-Norm of \(\boldsymbol{\widehat{\beta}}\).

Ridge Regression using glmnet()

We’ll use the glmnet() function to estimate a ridge regression model. You’ll see some new code we haven’t used before:

  1. The Salary ~ ., dt syntax tells R we want to run a regression of Salary on all covariates included the data table dt.
  2. model.matrix(Salary ~ ., dt) creates a matrix of our data table dt including all its covariates, but excluding the indepedendent variable Salary. Moreover, it turns any qualitative indicator variables into numerical indicator variables. For instance, we can see above the League column takes values of either A for American League or N for National League. The model.matrix() function will automatically turn this column into ones and zeros for us. In sum, this function gives us our feature matrix X.
  3. model.matrix(Salary ~ ., dt)[, -1] gives us all the columns besides ([, -1]) the first. We are dropping the vector of ones from our design matrix because the glmnet() function takes care of that for us.
  4. 10^seq(10, -2, length = 100) creates a vector of one-hundred evenly spaced numbers from \(10^{10}\) to \(10^{-2}\). This gives us a range of possible penalization parameters \(\lambda\) to estimate a ridge regression with. \(10^{10}\) is an extremely strong penalization term will essentially shrink all the slope coefficients almost to zero resulting in an intercept only model while \(10^{-2}\) is an extremely small penalization factor and should result in something close to the estimates produced by OLS.
  5. glmnet(X, y, alpha = 0, lambda = grid) will estimate one-hundred ridge regressions (because the length of grid is one-hundred) with \(y\) as the outcome, \(X\) as the covariates, and alpha = 0 specifies a ridge regression. Had we chosen \(\alpha = 1\) we would have estimated a lasso model and any \(\alpha \in (0,1)\) gives an elastic net regression.
# Create feature matrix
X_train <- model.matrix(Salary ~ ., dt_train)[, -1]

# Create outcome vector
y_train <- dt_train[, Salary]

# Grid of lambda values
grid <- 10^seq(10, -2, length = 100)

# Run ridge regression
reg_ridge <- glmnet(X_train, y_train, alpha = 0, lambda = grid)

# Obtain ridge regression coefficients for the 1st value of lambda (lambda = 10^10)
coef(reg_ridge)[, 1]
##         (Intercept)               AtBat                Hits               HmRun 
## 514.935784628806459   0.000000041218778   0.000000153846101   0.000000590764829 
##                Runs                 RBI               Walks               Years 
##   0.000000260481137   0.000000280776029   0.000000313499094   0.000001143286057 
##              CAtBat               CHits              CHmRun               CRuns 
##   0.000000003280380   0.000000012254447   0.000000082809535   0.000000024553163 
##                CRBI              CWalks             LeagueN           DivisionW 
##   0.000000024118505   0.000000025688399  -0.000002041338425  -0.000005295016077 
##             PutOuts             Assists              Errors          NewLeagueN 
##   0.000000012821811   0.000000002714155   0.000000015551732  -0.000001037287519
# Obtain ridge regression coefficients for the 50th value of lambda
coef(reg_ridge)[, 50]
##   (Intercept)         AtBat          Hits         HmRun          Runs 
## 422.551546711   0.028418370   0.109312309   0.395441499   0.183250699 
##           RBI         Walks         Years        CAtBat         CHits 
##   0.196145225   0.221622584   0.757267090   0.002255600   0.008535179 
##        CHmRun         CRuns          CRBI        CWalks       LeagueN 
##   0.056142130   0.017066139   0.016638372   0.017348694  -1.258836690 
##     DivisionW       PutOuts       Assists        Errors    NewLeagueN 
##  -4.608499806   0.009379127   0.002023299   0.005797051  -0.495926403
# Obtain ridge regression coefficients for the 100th value of lambda (lambda = 10^-2)
coef(reg_ridge)[, 100]
##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  326.3124984   -1.6872685    4.5579759    1.4570475   -0.8515717    2.0531080 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##    5.1312582  -14.7079304   -0.1213412    0.2528487   -0.5781641    1.1802070 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.5352592   -0.7361541  -31.5608524 -128.9620347    0.1470172    0.3656918 
##       Errors   NewLeagueN 
##   -4.1263295   26.9414634
# OLS regression
lm(Salary ~ ., data = dt_train)
## 
## Call:
## lm(formula = Salary ~ ., data = dt_train)
## 
## Coefficients:
## (Intercept)        AtBat         Hits        HmRun         Runs          RBI  
##    325.3464      -1.6849       4.6348       1.7999      -0.9987       1.9476  
##       Walks        Years       CAtBat        CHits       CHmRun        CRuns  
##      5.1475     -14.4570      -0.1200       0.1967      -0.7615       1.2478  
##        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists  
##      0.6155      -0.7474     -32.8734    -129.0135       0.1473       0.3649  
##      Errors   NewLeagueN  
##     -4.0849      28.5485

We see that in the case of \(\lambda = 10^{10}\), so there is large amount of shrinkage, the parameter estimates are very close to zero and much smaller than the other shown ridge regression estimates. On the flip side, when \(\lambda = 10^{-2}\), so there is not much shrinkage at all, the ridge regression estimates are extremely similar to that given by the OLS estimates.

Now, we’ll use glmnet’s handy cv.glmnet() function to perform k-Fold CV on our training set to select the optimal \(\lambda\) value. We’ll use \(k = 10\).

# 10 fold CV
cv.glmnet(X_train, y_train, alpha = 0, lambda = grid, nfolds = 10)
## 
## Call:  cv.glmnet(x = X_train, y = y_train, lambda = grid, nfolds = 10,      alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min     14    74  109809 21662      19
## 1se   3765    54  130831 28742      19

Looking at the above table, the row corresponding to min gives us a value of \(\lambda\) producing the lowest estimate of the MSE test error along with that estimate. We can see that \(\lambda = 14\), which is located at the 74th position of the vector grid, produces the lowest MSE of \(109,809\) with a standard error of \(21,662\). The upshot is that this is evidence of choosing \(\lambda = 14\) for our final ridge regression model.

Let us now do 10-Fold CV using a standard linear regression model to obtain an estimate of the test MSE.

# Specify number of folds
k <- 10

# Fit a polynomial regression of degree i
reg <- glm(Salary ~ ., data = dt_train)

# Store the k-Fold CV estimate for the model
cv_error <- cv.glm(dt_train, reg, K = k)$delta[1]

# Show the k-Fold CV mse for each model
cv_error
## [1] 114733.1

We see the estimate of the out-of-sample MSE is higher than that given by ridge regression suggesting ridge is a better predictor than standard OLS. This is a fairly typical result. In practice, you likely won’t even bother with using typical OLS for prediction because there are much better algorithms out there, ridge and lasso being two of them.

Ridge Regression Bias and Variance

Suppose the true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2x_{i1} + 3 x_{i2} + u_i\) and we estimate it via ridge regression. Recall the ridge regression estimator, its bias, and its variance are given by

\[\begin{align} \widehat{\boldsymbol{\beta}}_R &= \left( X^\prime X + \lambda I_k \right)^{-1} X^\prime \boldsymbol{y} \\ \text{bias} \left[ \widehat{\boldsymbol{\beta}}_R \right] &= -\lambda \left( X^\prime X + \lambda I_k \right)^{-1} X^\prime \boldsymbol{\beta} \\ \mathbb{V} \left[ \widehat{\boldsymbol{\beta}}_R \right] &= \sigma^2 \left( X^\prime X + \lambda I_k \right)^{-1} X^\prime X \left( X^\prime X + \lambda I_k \right)^{-1}. \end{align}\]

Let’s verify the ridge regression estimator has increasing downward (attenuation) bias and decreasing variance for increasing values of \(\lambda\). We will generate \(s = 1000\) samples each with \(n = 200\) data points, run the regression of \(y\) on an intercept term, \(x_1\), and \(x_2\) using penalization terms of \(\lambda = 1\) and \(\lambda = 20\). If we are correct, then the sampling distributions of \(\widehat{\boldsymbol{\beta}}_R\) should show greater attenuation bias in both cases, but it will be larger when \(\lambda\) is larger. Furthermore, the sampling distribution should be more tightly centered around the higher value of \(\lambda\). Let’s examine this for \(\widehat{\beta}_{1R}\) specifically.

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Lambda values
grid <- c(20, 1)

# Initialize empty list to store estimates of beta_1 depending on the value of lambda
estimates_list <- vector("list", length(grid))
names(estimates_list) <- paste0("lambda = ", grid)

# Initialize empty lists within estimates_list for storing estimates for each lambda
for (lambda in names(estimates_list)) 
{
  estimates_list[[lambda]] <- numeric(s)
}

# Intercept term when y is the outcome
beta_0 <- 1

# Coefficient on x_1 when y is the outcome
beta_1 <- 2

# Coefficient on x_2 when y is the outcome
beta_2 <- 3

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  # Error term when y is the outcome
  u <- rnorm(n, mean = 0, sd = 2)
  
  # x_1
  x_1 <- rnorm(n, mean = 1, sd = 2)
  
  # x_2 
  x_2 <- rnorm(n, mean = 2, sd = 2)
  
  # Design matrix
  X <- matrix(data = c(x_1, x_2), ncol = 2)
  
  # Outcome
  y <- beta_0 + beta_1*x_1 + beta_2*x_2 + u
  
  # Run ridge regression
  reg_ridge <- glmnet(X, y, alpha = 0, lambda = grid)
  
  # Obtain ridge regression estimates of beta_1
  beta_hat_1_l1 <- coef(reg_ridge)[2, 1]
  beta_hat_1_l2 <- coef(reg_ridge)[2, 2]
  
  # Store estimates of beta_1 for each lambda
  estimates_list[[paste0("lambda = ", grid[1])]][i] <- beta_hat_1_l1
  estimates_list[[paste0("lambda = ", grid[2])]][i] <- beta_hat_1_l2
}

# Combine estimates into a single data.table for plotting
dt_plot <- data.table(
  Estimate = unlist(estimates_list),
  Lambda = rep(names(estimates_list), each = s)
)

# Ensure Lambda is a factor with levels in the desired order
dt_plot[, Lambda := factor(Lambda, levels = paste0("lambda = ", grid))]
# Plot the sampling distributions
ggplot(dt_plot, aes(x = Estimate, fill = Lambda)) + 
  geom_density(alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Samplings Distribution of ", hat(beta)[1*R]))) +
  scale_fill_manual(
    values = c("#FFA07A", "#20B2AA"),
    labels = c(grid[1], grid[2]),
    name = expression(lambda)
  ) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, color = color_1, size = 25),
    legend.background = element_rect(fill = "black"),
    legend.text = element_text(color = color_1, size = 15),
    legend.title = element_text(color = color_2, size = 20)
  ) +
  xlab(expression(hat(beta)[1*R])) + 
  ylab("Density")

Indeed, large values of \(\lambda\) (more shrinkage) give more a more downward biased parameter estimate. Recall, this is called attenuation bias. Moreover, the sampling distribution of \(\widehat{\beta}_{1R}\) is more centered around its mean (has less variance) in the case of \(\lambda = 20\) versus the case of \(\lambda = 1\).

Lasso

Now, we will estimate the lasso regression model. This is extremely similar to estimating a ridge regression model, except now we specify \(\alpha = 1\) within the glmnet() function. Recall lasso regression solves the problem of \[\begin{align} \underset{\widehat{\boldsymbol{\beta}}_{L}}{\arg \min} \ \left[ \left( \boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{L} \right)^\prime \left( \boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{L} \right) + \lambda \left \|\widehat{\boldsymbol{\beta}}_{L} \right \|_1 \right] \end{align}\] where \(\left \|\widehat{\boldsymbol{\beta}}_{L} \right \|_1\) is the L1-Norm of \(\boldsymbol{\beta}\).

Lasso Regression using glmnet()

# Create feature matrix
X_train <- model.matrix(Salary ~ ., dt_train)[, -1]

# Create outcome vector
y_train <- dt_train[, Salary]

# Grid of lambda values
grid <- 10^seq(10, -2, length = 100)

# Run lasso regression
reg_lasso <- glmnet(X_train, y_train, alpha = 1, lambda = grid)

# Obtain lasso regression coefficients for the 1st value of lambda (lambda = 10^10)
coef(reg_lasso)[, 1]
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##    514.9359      0.0000      0.0000      0.0000      0.0000      0.0000 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
##      Errors  NewLeagueN 
##      0.0000      0.0000
# Obtain lasso regression coefficients for the 50th value of lambda
coef(reg_lasso)[, 50]
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##    514.9359      0.0000      0.0000      0.0000      0.0000      0.0000 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##      0.0000      0.0000      0.0000      0.0000      0.0000      0.0000 
##      Errors  NewLeagueN 
##      0.0000      0.0000
# Obtain lasso regression coefficients for the 100th value of lambda (lambda = 10^-2)
coef(reg_lasso)[, 100]
##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
##  326.2424567   -1.6959250    4.5968967    1.5001535   -0.8761376    2.0479429 
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
##    5.1404726  -14.8068162   -0.1169668    0.2299028   -0.6167888    1.1935473 
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
##    0.5498061   -0.7408376  -31.7484036 -129.0429670    0.1472011    0.3638426 
##       Errors   NewLeagueN 
##   -4.1016873   27.1372056
# OLS regression
lm(Salary ~ ., data = dt_train)
## 
## Call:
## lm(formula = Salary ~ ., data = dt_train)
## 
## Coefficients:
## (Intercept)        AtBat         Hits        HmRun         Runs          RBI  
##    325.3464      -1.6849       4.6348       1.7999      -0.9987       1.9476  
##       Walks        Years       CAtBat        CHits       CHmRun        CRuns  
##      5.1475     -14.4570      -0.1200       0.1967      -0.7615       1.2478  
##        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists  
##      0.6155      -0.7474     -32.8734    -129.0135       0.1473       0.3649  
##      Errors   NewLeagueN  
##     -4.0849      28.5485

We see that in the case of \(\lambda = 10^{10}\), so there is large amount of shrinkage, the parameter estimates are all zero. This the big proponent of lasso regression; it is a form of model selection as some of the covariates are effectively removed from the model. Ridge regression does not do this as we can see in the prior section that even in the case of \(\lambda = 10^{10}\) none of the coefficients were identically zero. On the flip side, when \(\lambda = 10^{-2}\), so there is not much shrinkage at all, the lasso regression estimates are extremely similar to that given by the OLS estimates.

Now, we’ll use glmnet’s handy cv.glmnet() function to perform k-Fold CV on our training set to select the optimal \(\lambda\) value. We’ll use \(k = 10\).

# 10 fold CV
cv.glmnet(X_train, y_train, alpha = 1, lambda = grid, nfolds = 10)
## 
## Call:  cv.glmnet(x = X_train, y = y_train, lambda = grid, nfolds = 10,      alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min   2.01    81  109021 14339      14
## 1se  57.22    69  119426 19343       7

We see the chosen value is \(\lambda = 2.01\) for lasso regression giving an estimated test MSE of \(109,021\), very similar to the chosen ridge regression model.

Model Evaluation on Test Set

In the prior section, we saw that the “best” lasso model does similar to the “best” ridge model. We’ll just use the ridge regression model moving forward. So, let us now evaluate the ridge regression model on our test set to see how it the MSE compares to the estimate given by k-Fold CV. Recall, that the chosen ridge regression model by 10-Fold CV is with \(\lambda = 14\).

# Penalization parameter
lambda = 14

# Create feature matrix
X_test <- model.matrix(Salary ~ ., dt_test)[, -1]

# Create outcome vector
y_test <- dt_test[, Salary]

# Run ridge regression
reg_ridge <- glmnet(X_train, y_train, alpha = 0, lambda = lambda)

# Predict on the test data
predictions <- predict(reg_ridge, s = lambda, newx = X_test)

# Calculate MSE
mse <- mean((y_test - predictions)^2)

# Show MSE
mse
## [1] 99342.42

We see that the ridge regression with \(\lambda = 14\) produces a test MSE of \(99,342\), roughly \(10,000\) higher than estimate of this test MSE given by 10-Fold CV. Thus, our model may suffer from high variance, so we may want to think about choosing a higher value of \(\lambda\).

Shrinkage with Binary Response Models (BRMs)

We’ll now estimate a ridge and lasso logistic regression model. The first thing we need to do is convert our outcome to a binary variable. Recall the median Salary is \(425\). Let’s reconstruct this Salary column to be one if Salary is above \(425\) and zero otherwise.

# Convert Salary to binary variable to training and testing data
dt_train[, Salary := ifelse(Salary > 425, 1, 0)]
dt_test[, Salary := ifelse(Salary > 425, 1, 0)]

Estimating a ridge or lasso logistic regression is very similar to estimating the usual ridge regression except we now specify the family = "binomial" argument within the glmnet() function. This is similar to what we did when estimating a logistic regression model using glm().

library(glmnet)

# Create feature matrix
X_train <- model.matrix(Salary ~ ., dt_train)[, -1]

# Create outcome vector
y_train <- dt_train[, Salary]

# Grid of lambda values
grid <- 10^seq(10, -2, length = 100)

# Perform 10-fold CV for ridge
cv_ridge <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 0, lambda = grid, nfolds = 10)

# Perform 10-fold CV for lasso
cv_lasso <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 1, lambda = grid, nfolds = 10)

# Extract lambda.1se for ridge and lasso
best_lambda_ridge <- cv_ridge$lambda.min
best_lambda_lasso <- cv_lasso$lambda.min

# Make predictions using the cross-validated ridge model
pred_probs_ridge <- predict(cv_ridge, s = best_lambda_ridge, newx = X_train, type = "response")

# Convert probabilities to binary predictions for ridge
predictions_ridge <- ifelse(pred_probs_ridge > 0.5, 1, 0)

# Calculate accuracy for ridge
accuracy_ridge <- mean(predictions_ridge == y_train)

# Make predictions using the cross-validated lasso model
pred_probs_lasso <- predict(cv_lasso, s = best_lambda_lasso, newx = X_train, type = "response")

# Convert probabilities to binary predictions for lasso
predictions_lasso <- ifelse(pred_probs_lasso > 0.5, 1, 0)

# Calculate accuracy for lasso
accuracy_lasso <- mean(predictions_lasso == y_train)

# Show results
cat("Best Lambda for Ridge:", best_lambda_ridge, "\n")
## Best Lambda for Ridge: 0.1232847
cat("Accuracy for Ridge:", accuracy_ridge, "\n")
## Accuracy for Ridge: 0.7937743
cat("Best Lambda for Lasso:", best_lambda_lasso, "\n")
## Best Lambda for Lasso: 0.01747528
cat("Accuracy for Lasso:", accuracy_lasso, "\n")
## Accuracy for Lasso: 0.7782101

We see the best ridge model with \(\lambda = 0.1233\) does better than the best lasso model with \(\lambda = 0.0174\).

# Penalization parameter
lambda <- best_lambda_ridge

# Create feature matrix
X_test <- model.matrix(Salary ~ ., dt_test)[, -1]

# Create outcome vector
y_test <- dt_test[, Salary]

# Run lasso regression
reg_lasso <- glmnet(X_train, y_train, family = "binomial", alpha = 0, lambda = lambda)

# Predict on the test data
pred_probs <- predict(reg_lasso, s = lambda, newx = X_test, type = "response")

# Convert probabilities to binary predictions
predictions <- ifelse(pred_probs > 0.5, 1, 0)

# Calculate accuracy rate
accuracy <- mean(y_test == predictions)

# Show accuracy rate
accuracy
## [1] 0.7846154

We see our test accuracy is \(78.46\%\), similar to what we saw from the k-Fold CV estimate of the test error rate.