The below code is nothing we haven’t seen before except for including
the package margins
in pacman::p_load()
. The
margins
package contains the margins
function
which is an easy way to compute the average marginal effect (AME).
# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# Set working directory
setwd("C:/Users/wbras/OneDrive/Documents/Desktop/UA/Fall_2024/ECON_418-518/ECON_418-518_Lecture_Slides/ECON_418-518_Lecture_Slides_E/ECON_418-518_E_17_BRM/ECON_418-518_E_17_BRM_Code")
# Install pacman
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(data.table, sandwich, lmtest, margins, ggplot2, gridExtra)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Disable scientific notation
options(scipen = 999)
wooldridge
DatasetWe will use the mroz
dataset in the
wooldridge
package which contains data on 753 women. This
dataset has the binary outcome variable inlf
which is equal
to one if the individual is in the labor force and zero otherwise. We
also have covariates educ
for education level,
experience
for labor market experience, age
for the individual’s age, husage
for their husband’s wage,
kidslt6
is the number of kids they have aged less than six,
kidsge6
is the number of kids they have between ages six
and eighteen. Let’s do some basic data exploration as usual.
# Load in mroz dataset from wooldridge as a data table
dt <- data.table(wooldridge::mroz)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Filter out unwanted columns
dt <- dt[, .(inlf, educ, exper, age, husage, kidslt6, kidsge6)]
# Show many NA values there are in the data
paste("There are", sum(is.na(dt)), "NAs in the dataset")
## [1] "There are 0 NAs in the dataset"
# Get summary statistics
summary(dt)
## inlf educ exper age
## Min. :0.0000 Min. : 5.00 Min. : 0.00 Min. :30.00
## 1st Qu.:0.0000 1st Qu.:12.00 1st Qu.: 4.00 1st Qu.:36.00
## Median :1.0000 Median :12.00 Median : 9.00 Median :43.00
## Mean :0.5684 Mean :12.29 Mean :10.63 Mean :42.54
## 3rd Qu.:1.0000 3rd Qu.:13.00 3rd Qu.:15.00 3rd Qu.:49.00
## Max. :1.0000 Max. :17.00 Max. :45.00 Max. :60.00
## husage kidslt6 kidsge6
## Min. :30.00 Min. :0.0000 Min. :0.000
## 1st Qu.:38.00 1st Qu.:0.0000 1st Qu.:0.000
## Median :46.00 Median :0.0000 Median :1.000
## Mean :45.12 Mean :0.2377 Mean :1.353
## 3rd Qu.:52.00 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :60.00 Max. :3.0000 Max. :8.000
# Get the number of observations
paste("There are", nrow(dt), "obsrevations in the dataset")
## [1] "There are 753 obsrevations in the dataset"
Let’s convert kidslt6
to a binary explanatory variable
which is one if the woman has kids aged less than six and zero
otherwise. We’ll follow a similar process for kidsge6
.
# Convert kidslt6 to indicator variable
dt[, kidslt6 := ifelse(kidslt6 > 0, 1, 0)]
# Convert kidsge6 to indicator variable
dt[, kidsge6 := ifelse(kidsge6 > 0, 1, 0)]
Now we’ll estimate a LPM using outcome inlf
and the
covariates included in dt
.
# LPM
reg <- lm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, dt)
# Obtain HC1 standard errors
HC1 <- vcovHC(reg, type = "HC1")
# Show regression summary using HC1 standard errors
coeftest(reg, vcov = HC1)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.67290283 0.16221018 4.1483 0.000037348692269 ***
## educ 0.03064770 0.00694784 4.4111 0.000011800458838 ***
## exper 0.03912839 0.00590745 6.6236 0.000000000067160 ***
## I(exper^2) -0.00055846 0.00019485 -2.8661 0.0042721 **
## age -0.01632657 0.00432044 -3.7789 0.0001701 ***
## husage -0.00107646 0.00423600 -0.2541 0.7994714
## kidslt6 -0.33046516 0.04738533 -6.9740 0.000000000006803 ***
## kidsge6 0.01516378 0.03678687 0.4122 0.6803069
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that an additional year of education increases the probability that a woman will be in the labor force by 0.03, i.e., increasing a woman’s education level by a year is associated with a 3 percentage point increase in the probability of a woman entering the labor force. This effect is significant at all conventional levels as indicated by its p-value.
Moreover, women with children aged less than six are more likely to not be in the labor force. We see that having a child aged less than six decreases the likelihood that a woman will be in the labor force by roughly 33 percentage points. This effect is extremely large and statistically significant at all conventional levels.
Recall that the logit model uses the logistic CDF while the probit model uses the standard normal CDF. Let’s graph these distribution functions to see what they look like.
# Generate dot product of x and theta values
x_theta <- seq(-10, 10, by = 0.1)
# Generate G(x) where G is the logistic CDF
G_x_theta <- exp(x_theta) / (1 + exp(x_theta))
# Create data table for x_theta and G(x_theta)
dt_logit_cdf <- data.table(x_theta = x_theta, G_x_theta = G_x_theta)
# Plot the logistic CDF using ggplot2
logistic_cdf <- ggplot(dt_logit_cdf, aes(x = x_theta, y = G_x_theta)) +
geom_line(color = color_1, size = 2) +
labs(title = "Logistic CDF",
x = expression(x * minute * theta),
y = expression(Lambda(x * minute * theta))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 8),
axis.title = element_text(color = color_2, size = 10),
plot.title = element_text(hjust = 0.5, color = color_2, size = 15, face = "bold")
) +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
scale_x_continuous(breaks = c(-10, 0, 10))
# Generate dot product of x and theta values
x_theta <- seq(-10, 10, by = 0.1)
# Generate G(x) where G is the standard normal CDF
G_x_theta <- pnorm(x_theta)
# Create data table for x_theta and G(x_theta)
dt_sn_cdf <- data.table(x_theta = x_theta, G_x_theta = G_x_theta)
# Plot the logistic CDF using ggplot2
sn_cdf <- ggplot(dt_sn_cdf, aes(x = x_theta, y = G_x_theta)) +
geom_line(color = color_1, size = 2) +
labs(title = "Standard Normal CDF",
x = expression(x * minute * theta),
y = expression(Phi(x * minute * theta))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 8),
axis.title = element_text(color = color_2, size = 10),
plot.title = element_text(hjust = 0.5, color = color_2, size = 15, face = "bold")
) +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
scale_x_continuous(breaks = c(-10, 0, 10))
# Combine the plots side by side
grid.arrange(logistic_cdf, sn_cdf, nrow = 1)
We see that the y-axis ranges from 0 and 1 as we would expect because the CDF outputs a probability which must lie in the unit interval. Also, notice that for the logistic CDF y-axis label I replaced \(G\) with \(\Lambda\) and for the standard normal CDF y-axis label I replaced \(G\) with \(\Phi\). This is the standard notation for these respective distribution functions. Moreover, as an example, for \(\boldsymbol{x}^\prime \boldsymbol{\theta} = 0\), \(\Lambda (\boldsymbol{x}^\prime \boldsymbol{\theta}) = \Phi(\boldsymbol{x}^\prime \boldsymbol{\theta}) = 0.5\) meaning 50% of the generate data lies below zero. The reasoning for this is that the PDF of both the logistic and standard normal distributions are symmetric around zero.
To estimate a logistic regression model in R
, we use the
glm
(Generalized Linear Model) function. One parameter of
this function is called family
and we set this parameter
equal to binomial
to let R
know that we wish
to estimate a logistic regression model. Let’s use this function with
the outcome and covariates the same as the LPM above.
# Estimate logistic regression model
logit <- glm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, data = dt, family = binomial)
# Summary of logit model
summary(logit)
##
## Call:
## glm(formula = inlf ~ educ + exper + I(exper^2) + age + husage +
## kidslt6 + kidsge6, family = binomial, data = dt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.903828 0.893678 1.011 0.311846
## educ 0.170704 0.039647 4.306 0.0000166513189 ***
## exper 0.201650 0.031824 6.336 0.0000000002351 ***
## I(exper^2) -0.002905 0.001015 -2.861 0.004229 **
## age -0.086845 0.024211 -3.587 0.000335 ***
## husage -0.006720 0.023306 -0.288 0.773078
## kidslt6 -1.710658 0.252273 -6.781 0.0000000000119 ***
## kidsge6 0.057484 0.208894 0.275 0.783176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 820.39 on 745 degrees of freedom
## AIC: 836.39
##
## Number of Fisher Scoring iterations: 4
# Get the log-likelihood
logit_log_likelihood <- logLik(logit)
# Show the log-likelihood
logit_log_likelihood
## 'log Lik.' -410.1935 (df=8)
# Number of estimated paramters
k <- length(coef(logit))
# AIC
aic <- round(2 * k -2 * as.numeric(logit_log_likelihood), 2)
# Show the aic
aic
## [1] 836.39
The summary output looks fairly similar to when we estimate a model via OLS, but with a few important differences:
The null deviance represents how well an intercept only model predicts the outcome variable. It is given by \(-2(\text{Log-likelihood of intercept only model})\).
The residual deviance represents how well our estimated model predicts the outcome variable. It is given by \(-2( \text{Log-likelihood of estimated model})\). A smaller residual deviance than the null deviance means our estimated model predicts our outcome better than an intercept only model.
The Akaike Information Criterion (AIC) is a measures of how good our model is. It is given by \(2k - 2 (\text{Log-likelihood of estimated model})\). The \(2k\) term penalizes us for adding more covariates (prevents overfitting which we will discuss when we get to machine learning) and the latter term measures model fit which is high when the log likelihood is high. In sum, we want a model with low AIC. Often in practice you will estimate multiple models and compare their AIC to select which model we actually want to use. This is called model selection which is a salient topic in machine learning.
We can also get HCSE as usual. Let’s do that for robustness now.
# Obtain HC1 standard errors
HC1 <- vcovHC(logit, type = "HC1")
# Show regression summary using HC1 standard errors
coeftest(logit, vcov = HC1)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9038278 0.8815851 1.0252 0.3052545
## educ 0.1707040 0.0402956 4.2363 0.0000227234313 ***
## exper 0.2016500 0.0327302 6.1610 0.0000000007230 ***
## I(exper^2) -0.0029048 0.0010431 -2.7846 0.0053591 **
## age -0.0868450 0.0235296 -3.6909 0.0002235 ***
## husage -0.0067203 0.0229643 -0.2926 0.7697977
## kidslt6 -1.7106582 0.2648639 -6.4586 0.0000000001057 ***
## kidsge6 0.0574842 0.2077620 0.2767 0.7820237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficients here CANNOT be interpreted as marginal effects.
Rather, they represent changes in raw logits which is rather meaningless
for us. Instead we can use the margins()
function contained
in the margins
package to get the average marginal effect
(AME).
# Get the marginal effects
me <- margins(logit)
# Show the marginal effects
summary(me)
We see these AMEs are actually quite similar to that given when using the LPM. Moreover, this output allows us to construct confidence intervals and, consequently, conduct hypothesis tests of the expected marginal effect being equal to zero or not. As an example, on average across all individuals in our sample, having a child aged less than six decreases the likelihood a woman will be in the labor force by roughly 31 percentage points.
To estimate a probit regression model in R
, we use the
glm
(Generalized Linear Model) function. One parameter of
this function is called family
and we set this parameter
equal to binomial(link = probit)
to let R
know
that we wish to estimate a probit regression model. Let’s use this
function with the outcome and covariates the same as the LPM above.
# Estimate probit regression model
probit <- glm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, data = dt, family = binomial(link = "probit"))
# Summary of probit model
summary(probit)
##
## Call:
## glm(formula = inlf ~ educ + exper + I(exper^2) + age + husage +
## kidslt6 + kidsge6, family = binomial(link = "probit"), data = dt)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5197537 0.5305941 0.980 0.327299
## educ 0.1023297 0.0233373 4.385 0.00001160837605 ***
## exper 0.1207894 0.0186502 6.477 0.00000000009383 ***
## I(exper^2) -0.0017284 0.0006001 -2.880 0.003973 **
## age -0.0527930 0.0143210 -3.686 0.000227 ***
## husage -0.0029324 0.0138234 -0.212 0.832005
## kidslt6 -1.0186390 0.1480194 -6.882 0.00000000000591 ***
## kidsge6 0.0469525 0.1234680 0.380 0.703737
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.7 on 752 degrees of freedom
## Residual deviance: 819.6 on 745 degrees of freedom
## AIC: 835.6
##
## Number of Fisher Scoring iterations: 4
# Get the log-likelihood
probit_log_likelihood <- logLik(probit)
# Show the log-likelihood
probit_log_likelihood
## 'log Lik.' -409.8005 (df=8)
# Number of estimated paramters
k <- length(coef(probit))
# AIC
aic <- round(2 * k -2 * as.numeric(probit_log_likelihood), 2)
# Show the AIC
aic
## [1] 835.6
We can also get HCSE as usual. Let’s do that for robustness now.
# Obtain HC1 standard errors
HC1 <- vcovHC(probit, type = "HC1")
# Show regression summary using HC1 standard errors
coeftest(probit, vcov = HC1)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.51975374 0.52070625 0.9982 0.3181966
## educ 0.10232972 0.02356009 4.3434 0.00001403262649 ***
## exper 0.12078939 0.01922635 6.2825 0.00000000033319 ***
## I(exper^2) -0.00172843 0.00061785 -2.7975 0.0051505 **
## age -0.05279304 0.01378123 -3.8308 0.0001277 ***
## husage -0.00293238 0.01349105 -0.2174 0.8279298
## kidslt6 -1.01863895 0.15526448 -6.5607 0.00000000005357 ***
## kidsge6 0.04695248 0.12257206 0.3831 0.7016751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficients here CANNOT be interpreted as marginal effects.
Rather, they represent changes in raw z-scores which is rather
meaningless for us. Instead we can use the margins()
function contained in the margins
package to get the
average marginal effect (AME).
# Get the marginal effects
me <- margins(probit)
# Show the marginal effects
summary(me)
We see these AMEs are almost identical to that given when estimating a logistic regression model. Moreover, this output allows us to construct confidence intervals and, consequently, conduct hypothesis tests of the expected marginal effect being equal to zero or not.
As an example, on average across all individuals in our sample, having a child aged less than six decreases the likelihood a woman will be in the labor force by roughly 31 percentage points.
Let’s manually set up the log-likelihood function and use R’s
optim()
function to numerically solve for the probit
parameter estimates.
# Log-likelihood function for the probit model
probit_log_likelihood <- function(params)
{
# Extract parameters for each variable
theta_0 <- params[1]
theta_educ <- params[2]
theta_exper <- params[3]
theta_exper2 <- params[4]
theta_age <- params[5]
theta_husage <- params[6]
theta_kidslt6 <- params[7]
theta_kidsge6 <- params[8]
# Dot product of data and parameters
x_prime_theta <- theta_0 +
theta_educ * dt[, educ] +
theta_exper * dt[, exper] +
theta_exper2 * (dt[, exper]^2) +
theta_age * dt[, age] +
theta_husage * dt[, husage] +
theta_kidslt6 * dt[, kidslt6] +
theta_kidsge6 * dt[, kidsge6]
# Phi(x_prime_theta)
p <- pnorm(x_prime_theta)
# Log-likelihood
log_likelihood <- sum(dt[, inlf] * log(p) + (1 - dt[, inlf]) * log(1 - p))
# Minimize negative of log-likelihood (same as maximizing the log-likelihood itself)
return(-log_likelihood)
}
# Initial parameter guesses
initial_params <- rep(0, 8)
# Optimize the log-likelihood function
result <- optim(initial_params, probit_log_likelihood, method = "BFGS", control = list(maxit = 10000, reltol = 1e-8))
# Output parameter estimates
probit_estimates <- result$par
names(probit_estimates) <- c("Intercept", "educ", "exper", "exper^2", "age", "husage", "kidslt6", "kidsge6")
print(probit_estimates)
## Intercept educ exper exper^2 age husage
## 0.600605073 0.103568580 0.104215338 -0.001186213 -0.054157720 -0.002021734
## kidslt6 kidsge6
## -1.023348574 0.051761890
# Maximum log-likelihood value
result$value
## [1] 410.2277
# Number of iterations to converge
result$counts
## function gradient
## 171 15
# Did optim converge (0 means yes)
result$convergence
## [1] 0
We see the parameter estimates are very similar to when we used R’s
glm()
function, but not exactly the same. This is due to
the underlying numerical solver used by R’s optim()
function compared to glm()
.
Let’s now use the probit model to carry out a likelihood ratio test. We will test the hypothesis of \(H_0: \theta_7 = \theta_8 = 0\) versus \(H_A: \theta_7 \neq 0 \ \text{or} \ \theta_8 \neq 0\) at the \(\alpha = 0.05\) level. Essentially, this tests whether having children of any age is statistically significant determinant of a woman entering the labor force.
To carry out this test we:
# Number of restrictions
q = 2
# Estimate unrestricted probit regression model
probit_ur <- glm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, data = dt, family = binomial(link = "probit"))
# Get the log-likelihood of the unrestricted probit regression model
probit_ur_log_likelihood <- logLik(probit_ur)
# Estimate restricted probit regression model
probit_r <- glm(inlf ~ educ + exper + I(exper^2) + age + husage, data = dt, family = binomial(link = "probit"))
# Get the log-likelihood of the restricted probit regression model
probit_r_log_likelihood <- logLik(probit_r)
# Compute the likelihood ratio test statistic
lr <- 2 * (probit_ur_log_likelihood - probit_r_log_likelihood)
# Show the likelihood ratio test statistic
paste("Likelihood ratio test statistic:", round(lr, 2))
## [1] "Likelihood ratio test statistic: 51.56"
# Calculate the chi-squared critical value with q degrees of freedom
crit <- qchisq(0.95, df = q)
# Show the critical value
paste("Chi-squared critical value with q =", q, "degrees of freedom:", round(crit, 2))
## [1] "Chi-squared critical value with q = 2 degrees of freedom: 5.99"
# Print results
paste("Reject Null Hypothesis:", lr > crit)
## [1] "Reject Null Hypothesis: TRUE"
Since we reject this null hypothesis, this means at least one of \(\beta_7\) or \(\beta_8\) is statistically different from zero at the 95% confidence level.