The below preliminaries are standard except now we are loading in the
glmnet
package which contains the glmnet()
function to execute ridge, lasso, and elastic net regressions.
# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ISLR2, glmnet, boot, data.table, ggplot2)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)
ISLR2
DatasetWe’ll use the Hitters
dataset contained in the
ISLR2
package. We’ll use this data to predict a baseball
players salary using various statistics assoicated with performance of
the previous year. Let’s load in the data and do some data analysis.
You’ll see a new function in the below code called
sapply()
. This function is a part of R’s apply
family of functions which is super useful in data analysis. It allows us
to apply any function to the columns of our dataset and will return a
vector of values corresponding to each column where that value is the
output of the function. For instance, the below code uses the code
sapply(dt, function(x) sum(is.na(x)))
. Here, x
is really just dt
. So, the sapply
function
will use our defined function function(x) sum(is.na(x))
on
each column of dt
returning a vector of the total number of
NAs in each column. Don’t worry too much if this code is confusing; when
I first learned the apply
functions it was for me as
well.
# Load in Hitters data set from the ISLR2 package as a data table
dt <- data.table(ISLR2::Hitters)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
## [19] "Salary" "NewLeague"
# Get summary statistics
summary(dt)
## AtBat Hits HmRun Runs
## Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00
## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
## Median :379.5 Median : 96 Median : 8.00 Median : 48.00
## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8
## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0
## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7
## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75
## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50
## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12
## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
##
## CWalks League Division PutOuts Assists
## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
## Median : 170.50 Median : 212.0 Median : 39.5
## Mean : 260.24 Mean : 288.9 Mean :106.9
## 3rd Qu.: 339.25 3rd Qu.: 325.0 3rd Qu.:166.0
## Max. :1566.00 Max. :1378.0 Max. :492.0
##
## Errors Salary NewLeague
## Min. : 0.00 Min. : 67.5 A:176
## 1st Qu.: 3.00 1st Qu.: 190.0 N:146
## Median : 6.00 Median : 425.0
## Mean : 8.04 Mean : 535.9
## 3rd Qu.:11.00 3rd Qu.: 750.0
## Max. :32.00 Max. :2460.0
## NA's :59
# Get the number of observations
paste0("There are ", nrow(dt), " observations in dt.")
## [1] "There are 322 observations in dt."
# Get the number of variables in the dataset
paste0("There are ", ncol(dt), " variables in dt.")
## [1] "There are 20 variables in dt."
# Get the number of NAs in the dataset
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 59 NAs in dt."
# Get NAs by column
sapply(dt, function(x) sum(is.na(x)))
## AtBat Hits HmRun Runs RBI Walks Years CAtBat
## 0 0 0 0 0 0 0 0
## CHits CHmRun CRuns CRBI CWalks League Division PutOuts
## 0 0 0 0 0 0 0 0
## Assists Errors Salary NewLeague
## 0 0 59 0
# First row of dt
dt[1, ]
# Median of dt's Salary column
paste0("The median salary is: ", median(dt[, Salary], na.rm = TRUE))
## [1] "The median salary is: 425"
We see there are 59 NAs in the Hitters
dataset and they
all come from the independent variable, Salary
. For
example, we can see the first observation’s Salary
value is
NA. We could just drop these observations using the code
dt <- na.omit(dt)
which drops observations with NAs in
any column. However, we can do better than that. Let’s impute the median
value of the Salary
column for those observations with a
NA. We like the so-called median imputation method over
the mean imputation method because the median is more
robust to outliers relative to the mean. The median Salary
in our dataset is \(425\).
# Replace NA values in the salary column with the median of the salary column
dt[, Salary := ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary)]
# Get the number of NAs in the dataset
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 0 NAs in dt."
# First row of dt
dt[1:2, ]
The above code
dt[, Salary := ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary)]
assigns the value of the Salary
column as its current value
if the observation is not a NA while replacing that value with the
median if the value is a NA. For example, the first observation had a NA
for salary while the second did not. We can see that the first
observation’s Salary
is now \(425\) (the median of the Salary variable)
while the second’s is left at its original value of \(475\).
You’ll deal with NA values all the time in your data science life. Dropping observations is usually a waste and the median imputation method is a powerful way to avoid doing so.
Lastly, we’ll split dt
into training and testing sets
using a \(80-20\) split.
# Training, validation, and test set size
train_size <- floor(0.8 * nrow(dt))
test_size <- floor(0.2 * nrow(dt))
# Shuffle the data
shuffled_indices <- sample(nrow(dt))
# Split the indices
train_indices <- shuffled_indices[1:train_size]
test_indices <- shuffled_indices[(train_size + 1):nrow(dt)]
# Create the datasets
dt_train <- dt[train_indices, ]
dt_test <- dt[test_indices, ]
# Show first few rows of training and validation sets
head(dt_train)
head(dt_test)
In this section, we’ll use the glmnet()
function in the
glmnet
package to run ridge regressions, and show that the
ridge regression estimator displays increasing attenuation bias as the
penalization parameter \(\lambda\)
increases with simultaneously decreasing variance. Recall ridge
regression solves the problem of \[\begin{align}
\underset{\widehat{\boldsymbol{\beta}}_{R}}{\arg \min} \ \left[ \left(
\boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{R} \right)^\prime
\left( \boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{R} \right) +
\lambda \boldsymbol{\widehat{\beta}}_R^\prime
\boldsymbol{\widehat{\beta}}_R \right]
\end{align}\] where \(\boldsymbol{\beta}_R^\prime
\boldsymbol{\widehat{\beta}}_R\) is the squared L2-Norm of \(\boldsymbol{\widehat{\beta}}\).
glmnet()
We’ll use the glmnet()
function to estimate a ridge
regression model. You’ll see some new code we haven’t used before:
Salary ~ ., dt
syntax tells R we want to run a
regression of Salary
on all covariates included the data
table dt
.model.matrix(Salary ~ ., dt)
creates a matrix of our
data table dt
including all its covariates, but excluding
the indepedendent variable Salary
. Moreover, it turns any
qualitative indicator variables into numerical indicator variables. For
instance, we can see above the League
column takes values
of either A
for American League or N
for
National League. The model.matrix()
function will
automatically turn this column into ones and zeros for us. In sum, this
function gives us our feature matrix X
.model.matrix(Salary ~ ., dt)[, -1]
gives us all the
columns besides ([, -1]) the first. We are dropping the vector of ones
from our design matrix because the glmnet()
function takes
care of that for us.10^seq(10, -2, length = 100)
creates a vector of
one-hundred evenly spaced numbers from \(10^{10}\) to \(10^{-2}\). This gives us a range of
possible penalization parameters \(\lambda\) to estimate a ridge regression
with. \(10^{10}\) is an extremely
strong penalization term will essentially shrink all the slope
coefficients almost to zero resulting in an intercept only model while
\(10^{-2}\) is an extremely small
penalization factor and should result in something close to the
estimates produced by OLS.glmnet(X, y, alpha = 0, lambda = grid)
will estimate
one-hundred ridge regressions (because the length of grid is
one-hundred) with \(y\) as the outcome,
\(X\) as the covariates, and
alpha = 0
specifies a ridge regression. Had we chosen \(\alpha = 1\) we would have estimated a
lasso model and any \(\alpha \in
(0,1)\) gives an elastic net regression.# Create feature matrix
X_train <- model.matrix(Salary ~ ., dt_train)[, -1]
# Create outcome vector
y_train <- dt_train[, Salary]
# Grid of lambda values
grid <- 10^seq(10, -2, length = 100)
# Run ridge regression
reg_ridge <- glmnet(X_train, y_train, alpha = 0, lambda = grid)
# Obtain ridge regression coefficients for the 1st value of lambda (lambda = 10^10)
coef(reg_ridge)[, 1]
## (Intercept) AtBat Hits HmRun
## 514.935784628806459 0.000000041218778 0.000000153846101 0.000000590764829
## Runs RBI Walks Years
## 0.000000260481137 0.000000280776029 0.000000313499094 0.000001143286057
## CAtBat CHits CHmRun CRuns
## 0.000000003280380 0.000000012254447 0.000000082809535 0.000000024553163
## CRBI CWalks LeagueN DivisionW
## 0.000000024118505 0.000000025688399 -0.000002041338425 -0.000005295016077
## PutOuts Assists Errors NewLeagueN
## 0.000000012821811 0.000000002714155 0.000000015551732 -0.000001037287519
# Obtain ridge regression coefficients for the 50th value of lambda
coef(reg_ridge)[, 50]
## (Intercept) AtBat Hits HmRun Runs
## 422.551546711 0.028418370 0.109312309 0.395441499 0.183250699
## RBI Walks Years CAtBat CHits
## 0.196145225 0.221622584 0.757267090 0.002255600 0.008535179
## CHmRun CRuns CRBI CWalks LeagueN
## 0.056142130 0.017066139 0.016638372 0.017348694 -1.258836690
## DivisionW PutOuts Assists Errors NewLeagueN
## -4.608499806 0.009379127 0.002023299 0.005797051 -0.495926403
# Obtain ridge regression coefficients for the 100th value of lambda (lambda = 10^-2)
coef(reg_ridge)[, 100]
## (Intercept) AtBat Hits HmRun Runs RBI
## 326.3124984 -1.6872685 4.5579759 1.4570475 -0.8515717 2.0531080
## Walks Years CAtBat CHits CHmRun CRuns
## 5.1312582 -14.7079304 -0.1213412 0.2528487 -0.5781641 1.1802070
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.5352592 -0.7361541 -31.5608524 -128.9620347 0.1470172 0.3656918
## Errors NewLeagueN
## -4.1263295 26.9414634
# OLS regression
lm(Salary ~ ., data = dt_train)
##
## Call:
## lm(formula = Salary ~ ., data = dt_train)
##
## Coefficients:
## (Intercept) AtBat Hits HmRun Runs RBI
## 325.3464 -1.6849 4.6348 1.7999 -0.9987 1.9476
## Walks Years CAtBat CHits CHmRun CRuns
## 5.1475 -14.4570 -0.1200 0.1967 -0.7615 1.2478
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.6155 -0.7474 -32.8734 -129.0135 0.1473 0.3649
## Errors NewLeagueN
## -4.0849 28.5485
We see that in the case of \(\lambda = 10^{10}\), so there is large amount of shrinkage, the parameter estimates are very close to zero and much smaller than the other shown ridge regression estimates. On the flip side, when \(\lambda = 10^{-2}\), so there is not much shrinkage at all, the ridge regression estimates are extremely similar to that given by the OLS estimates.
Now, we’ll use glmnet
’s handy cv.glmnet()
function to perform k-Fold CV on our training set to select the optimal
\(\lambda\) value. We’ll use \(k = 10\).
# 10 fold CV
cv.glmnet(X_train, y_train, alpha = 0, lambda = grid, nfolds = 10)
##
## Call: cv.glmnet(x = X_train, y = y_train, lambda = grid, nfolds = 10, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 14 74 109809 21662 19
## 1se 3765 54 130831 28742 19
Looking at the above table, the row corresponding to min
gives us a value of \(\lambda\)
producing the lowest estimate of the MSE test error along with that
estimate. We can see that \(\lambda =
14\), which is located at the 74th position of the vector
grid
, produces the lowest MSE of \(109,809\) with a standard error of \(21,662\). The upshot is that this is
evidence of choosing \(\lambda = 14\)
for our final ridge regression model.
Let us now do 10-Fold CV using a standard linear regression model to obtain an estimate of the test MSE.
# Specify number of folds
k <- 10
# Fit a polynomial regression of degree i
reg <- glm(Salary ~ ., data = dt_train)
# Store the k-Fold CV estimate for the model
cv_error <- cv.glm(dt_train, reg, K = k)$delta[1]
# Show the k-Fold CV mse for each model
cv_error
## [1] 114733.1
We see the estimate of the out-of-sample MSE is higher than that given by ridge regression suggesting ridge is a better predictor than standard OLS. This is a fairly typical result. In practice, you likely won’t even bother with using typical OLS for prediction because there are much better algorithms out there, ridge and lasso being two of them.
Suppose the true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2x_{i1} + 3 x_{i2} + u_i\) and we estimate it via ridge regression. Recall the ridge regression estimator, its bias, and its variance are given by
\[\begin{align} \widehat{\boldsymbol{\beta}}_R &= \left( X^\prime X + \lambda I_k \right)^{-1} X^\prime \boldsymbol{y} \\ \text{bias} \left[ \widehat{\boldsymbol{\beta}}_R \right] &= -\lambda \left( X^\prime X + \lambda I_k \right)^{-1} X^\prime \boldsymbol{\beta} \\ \mathbb{V} \left[ \widehat{\boldsymbol{\beta}}_R \right] &= \sigma^2 \left( X^\prime X + \lambda I_k \right)^{-1} X^\prime X \left( X^\prime X + \lambda I_k \right)^{-1}. \end{align}\]
Let’s verify the ridge regression estimator has increasing downward (attenuation) bias and decreasing variance for increasing values of \(\lambda\). We will generate \(s = 1000\) samples each with \(n = 200\) data points, run the regression of \(y\) on an intercept term, \(x_1\), and \(x_2\) using penalization terms of \(\lambda = 1\) and \(\lambda = 20\). If we are correct, then the sampling distributions of \(\widehat{\boldsymbol{\beta}}_R\) should show greater attenuation bias in both cases, but it will be larger when \(\lambda\) is larger. Furthermore, the sampling distribution should be more tightly centered around the higher value of \(\lambda\). Let’s examine this for \(\widehat{\beta}_{1R}\) specifically.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Lambda values
grid <- c(20, 1)
# Initialize empty list to store estimates of beta_1 depending on the value of lambda
estimates_list <- vector("list", length(grid))
names(estimates_list) <- paste0("lambda = ", grid)
# Initialize empty lists within estimates_list for storing estimates for each lambda
for (lambda in names(estimates_list))
{
estimates_list[[lambda]] <- numeric(s)
}
# Intercept term when y is the outcome
beta_0 <- 1
# Coefficient on x_1 when y is the outcome
beta_1 <- 2
# Coefficient on x_2 when y is the outcome
beta_2 <- 3
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Error term when y is the outcome
u <- rnorm(n, mean = 0, sd = 2)
# x_1
x_1 <- rnorm(n, mean = 1, sd = 2)
# x_2
x_2 <- rnorm(n, mean = 2, sd = 2)
# Design matrix
X <- matrix(data = c(x_1, x_2), ncol = 2)
# Outcome
y <- beta_0 + beta_1*x_1 + beta_2*x_2 + u
# Run ridge regression
reg_ridge <- glmnet(X, y, alpha = 0, lambda = grid)
# Obtain ridge regression estimates of beta_1
beta_hat_1_l1 <- coef(reg_ridge)[2, 1]
beta_hat_1_l2 <- coef(reg_ridge)[2, 2]
# Store estimates of beta_1 for each lambda
estimates_list[[paste0("lambda = ", grid[1])]][i] <- beta_hat_1_l1
estimates_list[[paste0("lambda = ", grid[2])]][i] <- beta_hat_1_l2
}
# Combine estimates into a single data.table for plotting
dt_plot <- data.table(
Estimate = unlist(estimates_list),
Lambda = rep(names(estimates_list), each = s)
)
# Ensure Lambda is a factor with levels in the desired order
dt_plot[, Lambda := factor(Lambda, levels = paste0("lambda = ", grid))]
# Plot the sampling distributions
ggplot(dt_plot, aes(x = Estimate, fill = Lambda)) +
geom_density(alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Samplings Distribution of ", hat(beta)[1*R]))) +
scale_fill_manual(
values = c("#FFA07A", "#20B2AA"),
labels = c(grid[1], grid[2]),
name = expression(lambda)
) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, color = color_1, size = 25),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
) +
xlab(expression(hat(beta)[1*R])) +
ylab("Density")
Indeed, large values of \(\lambda\) (more shrinkage) give more a more downward biased parameter estimate. Recall, this is called attenuation bias. Moreover, the sampling distribution of \(\widehat{\beta}_{1R}\) is more centered around its mean (has less variance) in the case of \(\lambda = 20\) versus the case of \(\lambda = 1\).
Now, we will estimate the lasso regression model. This is extremely
similar to estimating a ridge regression model, except now we specify
\(\alpha = 1\) within the
glmnet()
function. Recall lasso regression solves the
problem of \[\begin{align}
\underset{\widehat{\boldsymbol{\beta}}_{L}}{\arg \min} \ \left[ \left(
\boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{L} \right)^\prime
\left( \boldsymbol{y} - X \boldsymbol{\widehat{\beta}}_{L} \right) +
\lambda \left \|\widehat{\boldsymbol{\beta}}_{L} \right \|_1 \right]
\end{align}\] where \(\left
\|\widehat{\boldsymbol{\beta}}_{L} \right \|_1\) is the L1-Norm
of \(\boldsymbol{\beta}\).
glmnet()
# Create feature matrix
X_train <- model.matrix(Salary ~ ., dt_train)[, -1]
# Create outcome vector
y_train <- dt_train[, Salary]
# Grid of lambda values
grid <- 10^seq(10, -2, length = 100)
# Run lasso regression
reg_lasso <- glmnet(X_train, y_train, alpha = 1, lambda = grid)
# Obtain lasso regression coefficients for the 1st value of lambda (lambda = 10^10)
coef(reg_lasso)[, 1]
## (Intercept) AtBat Hits HmRun Runs RBI
## 514.9359 0.0000 0.0000 0.0000 0.0000 0.0000
## Walks Years CAtBat CHits CHmRun CRuns
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Errors NewLeagueN
## 0.0000 0.0000
# Obtain lasso regression coefficients for the 50th value of lambda
coef(reg_lasso)[, 50]
## (Intercept) AtBat Hits HmRun Runs RBI
## 514.9359 0.0000 0.0000 0.0000 0.0000 0.0000
## Walks Years CAtBat CHits CHmRun CRuns
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Errors NewLeagueN
## 0.0000 0.0000
# Obtain lasso regression coefficients for the 100th value of lambda (lambda = 10^-2)
coef(reg_lasso)[, 100]
## (Intercept) AtBat Hits HmRun Runs RBI
## 326.2424567 -1.6959250 4.5968967 1.5001535 -0.8761376 2.0479429
## Walks Years CAtBat CHits CHmRun CRuns
## 5.1404726 -14.8068162 -0.1169668 0.2299028 -0.6167888 1.1935473
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.5498061 -0.7408376 -31.7484036 -129.0429670 0.1472011 0.3638426
## Errors NewLeagueN
## -4.1016873 27.1372056
# OLS regression
lm(Salary ~ ., data = dt_train)
##
## Call:
## lm(formula = Salary ~ ., data = dt_train)
##
## Coefficients:
## (Intercept) AtBat Hits HmRun Runs RBI
## 325.3464 -1.6849 4.6348 1.7999 -0.9987 1.9476
## Walks Years CAtBat CHits CHmRun CRuns
## 5.1475 -14.4570 -0.1200 0.1967 -0.7615 1.2478
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 0.6155 -0.7474 -32.8734 -129.0135 0.1473 0.3649
## Errors NewLeagueN
## -4.0849 28.5485
We see that in the case of \(\lambda = 10^{10}\), so there is large amount of shrinkage, the parameter estimates are all zero. This the big proponent of lasso regression; it is a form of model selection as some of the covariates are effectively removed from the model. Ridge regression does not do this as we can see in the prior section that even in the case of \(\lambda = 10^{10}\) none of the coefficients were identically zero. On the flip side, when \(\lambda = 10^{-2}\), so there is not much shrinkage at all, the lasso regression estimates are extremely similar to that given by the OLS estimates.
Now, we’ll use glmnet
’s handy cv.glmnet()
function to perform k-Fold CV on our training set to select the optimal
\(\lambda\) value. We’ll use \(k = 10\).
# 10 fold CV
cv.glmnet(X_train, y_train, alpha = 1, lambda = grid, nfolds = 10)
##
## Call: cv.glmnet(x = X_train, y = y_train, lambda = grid, nfolds = 10, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 2.01 81 109021 14339 14
## 1se 57.22 69 119426 19343 7
We see the chosen value is \(\lambda = 2.01\) for lasso regression giving an estimated test MSE of \(109,021\), very similar to the chosen ridge regression model.
In the prior section, we saw that the “best” lasso model does similar to the “best” ridge model. We’ll just use the ridge regression model moving forward. So, let us now evaluate the ridge regression model on our test set to see how it the MSE compares to the estimate given by k-Fold CV. Recall, that the chosen ridge regression model by 10-Fold CV is with \(\lambda = 14\).
# Penalization parameter
lambda = 14
# Create feature matrix
X_test <- model.matrix(Salary ~ ., dt_test)[, -1]
# Create outcome vector
y_test <- dt_test[, Salary]
# Run ridge regression
reg_ridge <- glmnet(X_train, y_train, alpha = 0, lambda = lambda)
# Predict on the test data
predictions <- predict(reg_ridge, s = lambda, newx = X_test)
# Calculate MSE
mse <- mean((y_test - predictions)^2)
# Show MSE
mse
## [1] 99342.42
We see that the ridge regression with \(\lambda = 14\) produces a test MSE of \(99,342\), roughly \(10,000\) higher than estimate of this test MSE given by 10-Fold CV. Thus, our model may suffer from high variance, so we may want to think about choosing a higher value of \(\lambda\).
We’ll now estimate a ridge and lasso logistic regression model. The
first thing we need to do is convert our outcome to a binary variable.
Recall the median Salary
is \(425\). Let’s reconstruct this
Salary
column to be one if Salary
is above
\(425\) and zero otherwise.
# Convert Salary to binary variable to training and testing data
dt_train[, Salary := ifelse(Salary > 425, 1, 0)]
dt_test[, Salary := ifelse(Salary > 425, 1, 0)]
Estimating a ridge or lasso logistic regression is very similar to
estimating the usual ridge regression except we now specify the
family = "binomial"
argument within the
glmnet()
function. This is similar to what we did when
estimating a logistic regression model using glm()
.
library(glmnet)
# Create feature matrix
X_train <- model.matrix(Salary ~ ., dt_train)[, -1]
# Create outcome vector
y_train <- dt_train[, Salary]
# Grid of lambda values
grid <- 10^seq(10, -2, length = 100)
# Perform 10-fold CV for ridge
cv_ridge <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 0, lambda = grid, nfolds = 10)
# Perform 10-fold CV for lasso
cv_lasso <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 1, lambda = grid, nfolds = 10)
# Extract lambda.1se for ridge and lasso
best_lambda_ridge <- cv_ridge$lambda.min
best_lambda_lasso <- cv_lasso$lambda.min
# Make predictions using the cross-validated ridge model
pred_probs_ridge <- predict(cv_ridge, s = best_lambda_ridge, newx = X_train, type = "response")
# Convert probabilities to binary predictions for ridge
predictions_ridge <- ifelse(pred_probs_ridge > 0.5, 1, 0)
# Calculate accuracy for ridge
accuracy_ridge <- mean(predictions_ridge == y_train)
# Make predictions using the cross-validated lasso model
pred_probs_lasso <- predict(cv_lasso, s = best_lambda_lasso, newx = X_train, type = "response")
# Convert probabilities to binary predictions for lasso
predictions_lasso <- ifelse(pred_probs_lasso > 0.5, 1, 0)
# Calculate accuracy for lasso
accuracy_lasso <- mean(predictions_lasso == y_train)
# Show results
cat("Best Lambda for Ridge:", best_lambda_ridge, "\n")
## Best Lambda for Ridge: 0.1232847
cat("Accuracy for Ridge:", accuracy_ridge, "\n")
## Accuracy for Ridge: 0.7937743
cat("Best Lambda for Lasso:", best_lambda_lasso, "\n")
## Best Lambda for Lasso: 0.01747528
cat("Accuracy for Lasso:", accuracy_lasso, "\n")
## Accuracy for Lasso: 0.7782101
We see the best ridge model with \(\lambda = 0.1233\) does better than the best lasso model with \(\lambda = 0.0174\).
# Penalization parameter
lambda <- best_lambda_ridge
# Create feature matrix
X_test <- model.matrix(Salary ~ ., dt_test)[, -1]
# Create outcome vector
y_test <- dt_test[, Salary]
# Run lasso regression
reg_lasso <- glmnet(X_train, y_train, family = "binomial", alpha = 0, lambda = lambda)
# Predict on the test data
pred_probs <- predict(reg_lasso, s = lambda, newx = X_test, type = "response")
# Convert probabilities to binary predictions
predictions <- ifelse(pred_probs > 0.5, 1, 0)
# Calculate accuracy rate
accuracy <- mean(y_test == predictions)
# Show accuracy rate
accuracy
## [1] 0.7846154
We see our test accuracy is \(78.46\%\), similar to what we saw from the k-Fold CV estimate of the test error rate.