Preliminaries

The below code is nothing we haven’t seen before except for including the package margins in pacman::p_load(). The margins package contains the margins function which is an easy way to compute the average marginal effect (AME).

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# Set working directory
setwd("C:/Users/wbras/OneDrive/Documents/Desktop/UA/Fall_2024/ECON_418-518/ECON_418-518_Lecture_Slides/ECON_418-518_Lecture_Slides_E/ECON_418-518_E_17_BRM/ECON_418-518_E_17_BRM_Code")

# Install pacman
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(data.table, sandwich, lmtest, margins, ggplot2, gridExtra)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Disable scientific notation
options(scipen = 999)

Data Analysis with a wooldridge Dataset

We will use the mroz dataset in the wooldridge package which contains data on 753 women. This dataset has the binary outcome variable inlf which is equal to one if the individual is in the labor force and zero otherwise. We also have covariates educ for education level, experience for labor market experience, age for the individual’s age, husage for their husband’s wage, kidslt6 is the number of kids they have aged less than six, kidsge6 is the number of kids they have between ages six and eighteen. Let’s do some basic data exploration as usual.

# Load in mroz dataset from wooldridge as a data table
dt <- data.table(wooldridge::mroz)

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Filter out unwanted columns
dt <- dt[, .(inlf, educ, exper, age, husage, kidslt6, kidsge6)]

# Show many NA values there are in the data
paste("There are", sum(is.na(dt)), "NAs in the dataset")
## [1] "There are 0 NAs in the dataset"
# Get summary statistics
summary(dt)
##       inlf             educ           exper            age       
##  Min.   :0.0000   Min.   : 5.00   Min.   : 0.00   Min.   :30.00  
##  1st Qu.:0.0000   1st Qu.:12.00   1st Qu.: 4.00   1st Qu.:36.00  
##  Median :1.0000   Median :12.00   Median : 9.00   Median :43.00  
##  Mean   :0.5684   Mean   :12.29   Mean   :10.63   Mean   :42.54  
##  3rd Qu.:1.0000   3rd Qu.:13.00   3rd Qu.:15.00   3rd Qu.:49.00  
##  Max.   :1.0000   Max.   :17.00   Max.   :45.00   Max.   :60.00  
##      husage         kidslt6          kidsge6     
##  Min.   :30.00   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:38.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :46.00   Median :0.0000   Median :1.000  
##  Mean   :45.12   Mean   :0.2377   Mean   :1.353  
##  3rd Qu.:52.00   3rd Qu.:0.0000   3rd Qu.:2.000  
##  Max.   :60.00   Max.   :3.0000   Max.   :8.000
# Get the number of observations
paste("There are", nrow(dt), "obsrevations in the dataset")
## [1] "There are 753 obsrevations in the dataset"

Let’s convert kidslt6 to a binary explanatory variable which is one if the woman has kids aged less than six and zero otherwise. We’ll follow a similar process for kidsge6.

# Convert kidslt6 to indicator variable
dt[, kidslt6 := ifelse(kidslt6 > 0, 1, 0)]

# Convert kidsge6 to indicator variable 
dt[, kidsge6 := ifelse(kidsge6 > 0, 1, 0)]

Binary Response Models (BRM)

Linear Probability Model (LPM)

Now we’ll estimate a LPM using outcome inlf and the covariates included in dt.

# LPM
reg <- lm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, dt)

# Obtain HC1 standard errors
HC1 <- vcovHC(reg, type = "HC1")

# Show regression summary using HC1 standard errors
coeftest(reg, vcov = HC1)
## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value          Pr(>|t|)    
## (Intercept)  0.67290283  0.16221018  4.1483 0.000037348692269 ***
## educ         0.03064770  0.00694784  4.4111 0.000011800458838 ***
## exper        0.03912839  0.00590745  6.6236 0.000000000067160 ***
## I(exper^2)  -0.00055846  0.00019485 -2.8661         0.0042721 ** 
## age         -0.01632657  0.00432044 -3.7789         0.0001701 ***
## husage      -0.00107646  0.00423600 -0.2541         0.7994714    
## kidslt6     -0.33046516  0.04738533 -6.9740 0.000000000006803 ***
## kidsge6      0.01516378  0.03678687  0.4122         0.6803069    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that an additional year of education increases the probability that a woman will be in the labor force by 0.03, i.e., increasing a woman’s education level by a year is associated with a 3 percentage point increase in the probability of a woman entering the labor force. This effect is significant at all conventional levels as indicated by its p-value.

Moreover, women with children aged less than six are more likely to not be in the labor force. We see that having a child aged less than six decreases the likelihood that a woman will be in the labor force by roughly 33 percentage points. This effect is extremely large and statistically significant at all conventional levels.

Logitistic and Standard Normal CDFs

Recall that the logit model uses the logistic CDF while the probit model uses the standard normal CDF. Let’s graph these distribution functions to see what they look like.

# Generate dot product of x and theta values
x_theta <- seq(-10, 10, by = 0.1)

# Generate G(x) where G is the logistic CDF
G_x_theta <- exp(x_theta) / (1 + exp(x_theta))

# Create data table for x_theta and G(x_theta)
dt_logit_cdf <- data.table(x_theta = x_theta, G_x_theta = G_x_theta)

# Plot the logistic CDF using ggplot2
logistic_cdf <- ggplot(dt_logit_cdf, aes(x = x_theta, y = G_x_theta)) +
  geom_line(color = color_1, size = 2) +
  labs(title = "Logistic CDF",
       x = expression(x * minute * theta),
       y = expression(Lambda(x * minute * theta))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 8),
    axis.title = element_text(color = color_2, size = 10),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 15, face = "bold")
  ) + 
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  scale_x_continuous(breaks = c(-10, 0, 10))

# Generate dot product of x and theta values
x_theta <- seq(-10, 10, by = 0.1)

# Generate G(x) where G is the standard normal CDF
G_x_theta <- pnorm(x_theta)

# Create data table for x_theta and G(x_theta)
dt_sn_cdf <- data.table(x_theta = x_theta, G_x_theta = G_x_theta)

# Plot the logistic CDF using ggplot2
sn_cdf <- ggplot(dt_sn_cdf, aes(x = x_theta, y = G_x_theta)) +
  geom_line(color = color_1, size = 2) +
  labs(title = "Standard Normal CDF",
       x = expression(x * minute * theta),
       y = expression(Phi(x * minute * theta))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 8),
    axis.title = element_text(color = color_2, size = 10),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 15, face = "bold")
  ) + 
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  scale_x_continuous(breaks = c(-10, 0, 10))

# Combine the plots side by side
grid.arrange(logistic_cdf, sn_cdf, nrow = 1)

We see that the y-axis ranges from 0 and 1 as we would expect because the CDF outputs a probability which must lie in the unit interval. Also, notice that for the logistic CDF y-axis label I replaced \(G\) with \(\Lambda\) and for the standard normal CDF y-axis label I replaced \(G\) with \(\Phi\). This is the standard notation for these respective distribution functions. Moreover, as an example, for \(\boldsymbol{x}^\prime \boldsymbol{\theta} = 0\), \(\Lambda (\boldsymbol{x}^\prime \boldsymbol{\theta}) = \Phi(\boldsymbol{x}^\prime \boldsymbol{\theta}) = 0.5\) meaning 50% of the generate data lies below zero. The reasoning for this is that the PDF of both the logistic and standard normal distributions are symmetric around zero.

Logit Model

To estimate a logistic regression model in R, we use the glm(Generalized Linear Model) function. One parameter of this function is called family and we set this parameter equal to binomial to let R know that we wish to estimate a logistic regression model. Let’s use this function with the outcome and covariates the same as the LPM above.

# Estimate logistic regression model
logit <- glm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, data = dt, family = binomial)

# Summary of logit model
summary(logit)
## 
## Call:
## glm(formula = inlf ~ educ + exper + I(exper^2) + age + husage + 
##     kidslt6 + kidsge6, family = binomial, data = dt)
## 
## Coefficients:
##              Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)  0.903828   0.893678   1.011        0.311846    
## educ         0.170704   0.039647   4.306 0.0000166513189 ***
## exper        0.201650   0.031824   6.336 0.0000000002351 ***
## I(exper^2)  -0.002905   0.001015  -2.861        0.004229 ** 
## age         -0.086845   0.024211  -3.587        0.000335 ***
## husage      -0.006720   0.023306  -0.288        0.773078    
## kidslt6     -1.710658   0.252273  -6.781 0.0000000000119 ***
## kidsge6      0.057484   0.208894   0.275        0.783176    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.75  on 752  degrees of freedom
## Residual deviance:  820.39  on 745  degrees of freedom
## AIC: 836.39
## 
## Number of Fisher Scoring iterations: 4
# Get the log-likelihood
logit_log_likelihood <- logLik(logit)

# Show the log-likelihood
logit_log_likelihood
## 'log Lik.' -410.1935 (df=8)
# Number of estimated paramters
k <- length(coef(logit))

# AIC
aic <- round(2 * k  -2 * as.numeric(logit_log_likelihood), 2)

# Show the aic
aic
## [1] 836.39

The summary output looks fairly similar to when we estimate a model via OLS, but with a few important differences:

  1. The null deviance represents how well an intercept only model predicts the outcome variable. It is given by \(-2(\text{Log-likelihood of intercept only model})\).

  2. The residual deviance represents how well our estimated model predicts the outcome variable. It is given by \(-2( \text{Log-likelihood of estimated model})\). A smaller residual deviance than the null deviance means our estimated model predicts our outcome better than an intercept only model.

  3. The Akaike Information Criterion (AIC) is a measures of how good our model is. It is given by \(2k - 2 (\text{Log-likelihood of estimated model})\). The \(2k\) term penalizes us for adding more covariates (prevents overfitting which we will discuss when we get to machine learning) and the latter term measures model fit which is high when the log likelihood is high. In sum, we want a model with low AIC. Often in practice you will estimate multiple models and compare their AIC to select which model we actually want to use. This is called model selection which is a salient topic in machine learning.

We can also get HCSE as usual. Let’s do that for robustness now.

# Obtain HC1 standard errors
HC1 <- vcovHC(logit, type = "HC1")

# Show regression summary using HC1 standard errors
coeftest(logit, vcov = HC1)
## 
## z test of coefficients:
## 
##               Estimate Std. Error z value        Pr(>|z|)    
## (Intercept)  0.9038278  0.8815851  1.0252       0.3052545    
## educ         0.1707040  0.0402956  4.2363 0.0000227234313 ***
## exper        0.2016500  0.0327302  6.1610 0.0000000007230 ***
## I(exper^2)  -0.0029048  0.0010431 -2.7846       0.0053591 ** 
## age         -0.0868450  0.0235296 -3.6909       0.0002235 ***
## husage      -0.0067203  0.0229643 -0.2926       0.7697977    
## kidslt6     -1.7106582  0.2648639 -6.4586 0.0000000001057 ***
## kidsge6      0.0574842  0.2077620  0.2767       0.7820237    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficients here CANNOT be interpreted as marginal effects. Rather, they represent changes in raw logits which is rather meaningless for us. Instead we can use the margins() function contained in the margins package to get the average marginal effect (AME).

# Get the marginal effects
me <- margins(logit)

# Show the marginal effects
summary(me)

We see these AMEs are actually quite similar to that given when using the LPM. Moreover, this output allows us to construct confidence intervals and, consequently, conduct hypothesis tests of the expected marginal effect being equal to zero or not. As an example, on average across all individuals in our sample, having a child aged less than six decreases the likelihood a woman will be in the labor force by roughly 31 percentage points.

Probit Model

To estimate a probit regression model in R, we use the glm(Generalized Linear Model) function. One parameter of this function is called family and we set this parameter equal to binomial(link = probit) to let R know that we wish to estimate a probit regression model. Let’s use this function with the outcome and covariates the same as the LPM above.

# Estimate probit regression model
probit <- glm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, data = dt, family = binomial(link = "probit"))

# Summary of probit model
summary(probit)
## 
## Call:
## glm(formula = inlf ~ educ + exper + I(exper^2) + age + husage + 
##     kidslt6 + kidsge6, family = binomial(link = "probit"), data = dt)
## 
## Coefficients:
##               Estimate Std. Error z value         Pr(>|z|)    
## (Intercept)  0.5197537  0.5305941   0.980         0.327299    
## educ         0.1023297  0.0233373   4.385 0.00001160837605 ***
## exper        0.1207894  0.0186502   6.477 0.00000000009383 ***
## I(exper^2)  -0.0017284  0.0006001  -2.880         0.003973 ** 
## age         -0.0527930  0.0143210  -3.686         0.000227 ***
## husage      -0.0029324  0.0138234  -0.212         0.832005    
## kidslt6     -1.0186390  0.1480194  -6.882 0.00000000000591 ***
## kidsge6      0.0469525  0.1234680   0.380         0.703737    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1029.7  on 752  degrees of freedom
## Residual deviance:  819.6  on 745  degrees of freedom
## AIC: 835.6
## 
## Number of Fisher Scoring iterations: 4
# Get the log-likelihood
probit_log_likelihood <- logLik(probit)

# Show the log-likelihood
probit_log_likelihood
## 'log Lik.' -409.8005 (df=8)
# Number of estimated paramters
k <- length(coef(probit))

# AIC
aic <- round(2 * k  -2 * as.numeric(probit_log_likelihood), 2)

# Show the AIC
aic
## [1] 835.6

We can also get HCSE as usual. Let’s do that for robustness now.

# Obtain HC1 standard errors
HC1 <- vcovHC(probit, type = "HC1")

# Show regression summary using HC1 standard errors
coeftest(probit, vcov = HC1)
## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value         Pr(>|z|)    
## (Intercept)  0.51975374  0.52070625  0.9982        0.3181966    
## educ         0.10232972  0.02356009  4.3434 0.00001403262649 ***
## exper        0.12078939  0.01922635  6.2825 0.00000000033319 ***
## I(exper^2)  -0.00172843  0.00061785 -2.7975        0.0051505 ** 
## age         -0.05279304  0.01378123 -3.8308        0.0001277 ***
## husage      -0.00293238  0.01349105 -0.2174        0.8279298    
## kidslt6     -1.01863895  0.15526448 -6.5607 0.00000000005357 ***
## kidsge6      0.04695248  0.12257206  0.3831        0.7016751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficients here CANNOT be interpreted as marginal effects. Rather, they represent changes in raw z-scores which is rather meaningless for us. Instead we can use the margins() function contained in the margins package to get the average marginal effect (AME).

# Get the marginal effects
me <- margins(probit)

# Show the marginal effects
summary(me)

We see these AMEs are almost identical to that given when estimating a logistic regression model. Moreover, this output allows us to construct confidence intervals and, consequently, conduct hypothesis tests of the expected marginal effect being equal to zero or not.

As an example, on average across all individuals in our sample, having a child aged less than six decreases the likelihood a woman will be in the labor force by roughly 31 percentage points.

Probit Model Manually

Let’s manually set up the log-likelihood function and use R’s optim() function to numerically solve for the probit parameter estimates.

# Log-likelihood function for the probit model
probit_log_likelihood <- function(params) 
  {
  # Extract parameters for each variable
  theta_0 <- params[1]
  theta_educ <- params[2]
  theta_exper <- params[3]
  theta_exper2 <- params[4]
  theta_age <- params[5]
  theta_husage <- params[6]
  theta_kidslt6 <- params[7]
  theta_kidsge6 <- params[8]
  
  # Dot product of data and parameters
  x_prime_theta <- theta_0 + 
    theta_educ * dt[, educ] + 
    theta_exper * dt[, exper] + 
    theta_exper2 * (dt[, exper]^2) + 
    theta_age * dt[, age] + 
    theta_husage * dt[, husage] + 
    theta_kidslt6 * dt[, kidslt6] + 
    theta_kidsge6 * dt[, kidsge6]
  
  # Phi(x_prime_theta)
  p <- pnorm(x_prime_theta)
  
  # Log-likelihood
  log_likelihood <- sum(dt[, inlf] * log(p) + (1 - dt[, inlf]) * log(1 - p))
  
  # Minimize negative of log-likelihood (same as maximizing the log-likelihood itself)
  return(-log_likelihood)
}

# Initial parameter guesses
initial_params <- rep(0, 8) 

# Optimize the log-likelihood function
result <- optim(initial_params, probit_log_likelihood, method = "BFGS", control = list(maxit = 10000, reltol = 1e-8))

# Output parameter estimates
probit_estimates <- result$par
names(probit_estimates) <- c("Intercept", "educ", "exper", "exper^2", "age", "husage", "kidslt6", "kidsge6")
print(probit_estimates)
##    Intercept         educ        exper      exper^2          age       husage 
##  0.600605073  0.103568580  0.104215338 -0.001186213 -0.054157720 -0.002021734 
##      kidslt6      kidsge6 
## -1.023348574  0.051761890
# Maximum log-likelihood value
result$value
## [1] 410.2277
# Number of iterations to converge
result$counts
## function gradient 
##      171       15
# Did optim converge (0 means yes)
result$convergence
## [1] 0

We see the parameter estimates are very similar to when we used R’s glm() function, but not exactly the same. This is due to the underlying numerical solver used by R’s optim() function compared to glm().

Likelihood Ratio Test

Let’s now use the probit model to carry out a likelihood ratio test. We will test the hypothesis of \(H_0: \theta_7 = \theta_8 = 0\) versus \(H_A: \theta_7 \neq 0 \ \text{or} \ \theta_8 \neq 0\) at the \(\alpha = 0.05\) level. Essentially, this tests whether having children of any age is statistically significant determinant of a woman entering the labor force.

To carry out this test we:

  1. Estimate the unrestricted model and obtain the log-likelihood
  2. Estimate the restricted model and obtain the log-likelihood
  3. Compute the likelihood ratio test statistic of \(LR = 2\left(\mathscr{L}_n\left(\widehat{\boldsymbol{\theta}}_{UR}\right) - \mathscr{L}_n\left(\widehat{\boldsymbol{\theta}}_{R}\right)\right)\). Remember this test statistic asymptotically follows the \(\chi^2_q\) distribution where \(q\) is the number of restrictions (equal signs) listed in the null hypothesis. This test statistic asymptotically follows a \(\chi^2_q\) distribution because the F-distribution converges to a \(\chi^2\) distribution as the former’s denominator degrees of freedom gets large implying \(n-k\) gets large.
  4. Compute the \(\chi^2_q\) critical value.
  5. Reject the null hypothesis if and only if the likelihood ratio test statistic is greater than the \(\chi^2_q\) critical value.
# Number of restrictions
q = 2

# Estimate unrestricted probit regression model
probit_ur <- glm(inlf ~ educ + exper + I(exper^2) + age + husage + kidslt6 + kidsge6, data = dt, family = binomial(link = "probit"))

# Get the log-likelihood of the unrestricted probit regression model 
probit_ur_log_likelihood <- logLik(probit_ur)

# Estimate restricted probit regression model
probit_r <- glm(inlf ~ educ + exper + I(exper^2) + age + husage, data = dt, family = binomial(link = "probit"))

# Get the log-likelihood of the restricted probit regression model 
probit_r_log_likelihood <- logLik(probit_r)

# Compute the likelihood ratio test statistic
lr <- 2 * (probit_ur_log_likelihood - probit_r_log_likelihood)

# Show the likelihood ratio test statistic
paste("Likelihood ratio test statistic:", round(lr, 2))
## [1] "Likelihood ratio test statistic: 51.56"
# Calculate the chi-squared critical value with q degrees of freedom
crit <- qchisq(0.95, df = q)

# Show the critical value
paste("Chi-squared critical value with q =", q,  "degrees of freedom:", round(crit, 2))
## [1] "Chi-squared critical value with q = 2 degrees of freedom: 5.99"
# Print results
paste("Reject Null Hypothesis:", lr > crit)
## [1] "Reject Null Hypothesis: TRUE"

Since we reject this null hypothesis, this means at least one of \(\beta_7\) or \(\beta_8\) is statistically different from zero at the 95% confidence level.