Preliminaries

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(ggplot2, data.table)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Set random seed for reproducibility
set.seed(418518)

# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)

Data Analysis with a wooldridge Dataset

Any computer based homework assignment from the book should have a dataset MindTap will ask you download as a .RData file. There is no need to download it because you can use the wooldridge package to load the dataset directly into R. Let’s load in the econmath dataset from the wooldridge package and do some basic data analysis.

# Load in econmath dataset from wooldridge as a data table
dt <- data.table(wooldridge::econmath)

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
##  [1] "age"      "work"     "study"    "econhs"   "colgpa"   "hsgpa"   
##  [7] "acteng"   "actmth"   "act"      "mathscr"  "male"     "calculus"
## [13] "attexc"   "attgood"  "fathcoll" "mothcoll" "score"
# Get summary statistics
summary(dt)
##       age             work            study           econhs      
##  Min.   :18.00   Min.   : 0.000   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.:19.00   1st Qu.: 0.000   1st Qu.: 8.50   1st Qu.:0.0000  
##  Median :19.00   Median : 8.000   Median :12.00   Median :0.0000  
##  Mean   :19.41   Mean   : 8.626   Mean   :13.92   Mean   :0.3703  
##  3rd Qu.:20.00   3rd Qu.:15.000   3rd Qu.:18.00   3rd Qu.:1.0000  
##  Max.   :29.00   Max.   :37.500   Max.   :50.00   Max.   :1.0000  
##                                                                   
##      colgpa          hsgpa           acteng          actmth     
##  Min.   :0.875   Min.   :2.355   Min.   :12.00   Min.   :12.00  
##  1st Qu.:2.446   1st Qu.:3.110   1st Qu.:20.00   1st Qu.:20.00  
##  Median :2.813   Median :3.321   Median :23.00   Median :23.00  
##  Mean   :2.815   Mean   :3.342   Mean   :22.59   Mean   :23.21  
##  3rd Qu.:3.207   3rd Qu.:3.589   3rd Qu.:25.00   3rd Qu.:26.00  
##  Max.   :4.000   Max.   :4.260   Max.   :34.00   Max.   :36.00  
##                                  NA's   :42      NA's   :42     
##       act           mathscr            male        calculus     
##  Min.   :13.00   Min.   : 0.000   Min.   :0.0   Min.   :0.0000  
##  1st Qu.:21.00   1st Qu.: 7.000   1st Qu.:0.0   1st Qu.:0.0000  
##  Median :23.00   Median : 8.000   Median :0.5   Median :1.0000  
##  Mean   :23.12   Mean   : 7.875   Mean   :0.5   Mean   :0.6764  
##  3rd Qu.:25.00   3rd Qu.: 9.000   3rd Qu.:1.0   3rd Qu.:1.0000  
##  Max.   :33.00   Max.   :10.000   Max.   :1.0   Max.   :1.0000  
##  NA's   :42                                                     
##      attexc          attgood          fathcoll         mothcoll     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.2967   Mean   :0.5864   Mean   :0.5245   Mean   :0.6285  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##      score      
##  Min.   :19.53  
##  1st Qu.:64.06  
##  Median :74.22  
##  Mean   :72.60  
##  3rd Qu.:82.79  
##  Max.   :98.44  
## 
# Get the number of observations
nrow(dt)
## [1] 856
# Get the number of variables in the dataset 
ncol(dt)
## [1] 17

Multiple Linear Regression (MLR)

Running a MLR using lm()

Let’s study determinants of college gpa, colgpa. I do include some indicator/binary/dummy variables in the model which we haven’t discussed yet. Indicator variables only take on two values: zero and one. The covariates in the below regression include:

  1. High school GPA, hsgpa
  2. ACT score, act
  3. Hours per week spent studying, study
  4. Squared hours per week spent studying, sq_study. The rationale is that at some point studying so much may start to actually hurt your performance.
  5. Indicator variable for if a student took calculus in high school, calculus
  6. Indicator variable for if a student studied economics in high school, econhs
  7. Indicator variable for if a student’s mother has a bachelors degree, mothcoll
  8. Indicator variable for if a student’s father has a bachelors degree, fathcoll
  9. Interaction effect between the variables fathcoll and mothcoll, fathcoll * mothcoll. We include interaction effects when we think the effect of one predictor may depend on the effect of another predictor.
# Create variable for the square of total hours per week spent studying
dt[, sq_study := study^2]

# Run regression of colgpa on covariates
reg <- lm(colgpa ~ hsgpa + act + study + sq_study + calculus + econhs + fathcoll + mothcoll + fathcoll*mothcoll, data = dt)

# Print the summary of reg
summary(reg)
## 
## Call:
## lm(formula = colgpa ~ hsgpa + act + study + sq_study + calculus + 
##     econhs + fathcoll + mothcoll + fathcoll * mothcoll, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82242 -0.27923  0.02023  0.33206  1.30971 
## 
## Coefficients:
##                       Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)       -0.015799153  0.174786049  -0.090               0.9280    
## hsgpa              0.653619400  0.053371308  12.247 < 0.0000000000000002 ***
## act                0.022209043  0.005623968   3.949            0.0000854 ***
## study              0.003385113  0.006576197   0.515               0.6069    
## sq_study          -0.000001781  0.000168130  -0.011               0.9916    
## calculus           0.051133944  0.035352278   1.446               0.1485    
## econhs            -0.058266376  0.034038218  -1.712               0.0873 .  
## fathcoll           0.025177473  0.057065827   0.441               0.6592    
## mothcoll           0.116296128  0.047784650   2.434               0.0152 *  
## fathcoll:mothcoll -0.043977057  0.071666463  -0.614               0.5396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4628 on 804 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.2715, Adjusted R-squared:  0.2633 
## F-statistic: 33.29 on 9 and 804 DF,  p-value: < 0.00000000000000022

Our regression model is \[\begin{align} colgpa &= \beta_0 + \beta_1 hsgpa + \beta_2 act + \beta_3 study + \beta_4 study^2 + \beta_5 calculus \\ &+ \beta_6 econhs + \beta_7 fathcoll + \beta_8 mothcoll + \beta_9 (fathcoll \times mothcoll)\\ &+ u. \end{align}\]

Upon estimation, our regression equation is \[\begin{align} \widehat{colgpa} &= \widehat{\beta}_0 + \widehat{\beta}_1 hsgpa + \widehat{\beta}_2 act + \widehat{\beta}_3 study + \widehat{\beta}_4 study^2 + \widehat{\beta}_5 calculus \\ &+ \widehat{\beta}_6 econhs + \widehat{\beta}_7 fathcoll + \widehat{\beta}_8 mothcoll + \widehat{\beta}_9 (fathcoll \times mothcoll) \\ &= -0.015 + 0.6536 hsgpa + 0.022 act + 0.0034 study - 0.000002 study^2 + 0.0511 calculus \\ &- 0.0582 econhs + 0.0252 fathcoll + 0.1163 mothcoll \\ &- 0.044 (fathcoll \times mothcoll). \end{align}\]

The partial derivative of colgpa with respect to act is \(\frac{\partial colgpa}{\partial act} = \beta_2\) giving the effect act has on colgpa. This estimated effect is \(\widehat{\beta}_2 = 0.022\). This means for every additional point gained on a student’s ACT score, their college GPA is predicted to increase by \(0.022\). Similarly for every ten additional points gained on a student’s ACT score, their college GPA is predicted to increase by \(0.22\).

The partial derivative of colgpa with respect to study is \(\frac{\partial colgpa}{\partial act} = \beta_2 + 2 \beta_3 study\) giving the effect study has on colgpa. This estimated effect is \(\widehat{\beta}_3 + 2\widehat{\beta}_4 study = 0.0034 + 2(-0.000002) study\). This means for every additional hour spent studying, a student’s college GPA is predicted to increase by \(0.0034 + 2(-0.000002) = 0.003396\).

The partial derivative of colgpa with respect to mothcoll is \(\frac{\partial colgpa}{\partial mothcoll} = \beta_8 + \beta_9 fathcoll\) giving the effect mothcoll has on colgpa. This estimated effect is \(\widehat{\beta}_8 + \widehat{\beta}_9 fathcoll = 0.1163 - 0.044fathcoll\). This means if a father does have a bachelors degree (fathcoll = 1), then the effect of a mother having a college degree on their child’s college GPA is an increase of \(0.1163 - 0.044(1) = 0.0723\) in their GPA. If a father does have a bachelors degree (fathcoll = 0), then the effect of a mother having a college degree on their child’s college GPA is an increase of \(0.1163 - 0.044(0) = 0.1163\). Albeit, we would expect the estimate \(\widehat{\beta}_9\) to be positive, but it is statistically insignificant at any practical level of \(\alpha\) so we would likely drop it from our model anyhow before estimating causal effects.

Computing the OLS Estimates Manually

Let’s use the OLS solution \((X^\prime X)^{-1} X^\prime \boldsymbol{y}\) to replicate the parameter estimates produced by lm(). If you haven’t looked at the file ECON_418-518_Matrix_Algebra_Code.html on D2L, I replicate the entire summary() table there and detail the functions used below.

# Drop rows with any NA values
dt <- na.omit(dt)

# Make our y vector
y <- dt[, colgpa]

# Make a vector of 1's to add as the first column of our X matrix for the intercept
Intercept <- rep(1, nrow(dt))

# Create a column in dt for the mothcoll * fathcoll interaction
dt[, mothcoll_fathcoll := mothcoll * fathcoll]

# Construct our X matrix
X <- as.matrix(cbind(Intercept, dt[, .(hsgpa, act, study, sq_study, calculus, econhs, mothcoll, fathcoll, mothcoll_fathcoll)]))

# Show the first 5 rows and all columns of X
X[1:5, ]
##      Intercept hsgpa act study sq_study calculus econhs mothcoll fathcoll
## [1,]         1 3.355  27  10.0   100.00        1      0        1        1
## [2,]         1 3.219  24  22.5   506.25        0      1        1        0
## [3,]         1 3.306  21  12.0   144.00        1      0        1        0
## [4,]         1 3.977  31  40.0  1600.00        1      0        1        1
## [5,]         1 3.890  32  15.0   225.00        1      1        1        0
##      mothcoll_fathcoll
## [1,]                 1
## [2,]                 0
## [3,]                 0
## [4,]                 1
## [5,]                 0
# Transpose of X
Xprime <- t(X)

# Compute the OLS solution for beta_hat 
beta_hat <- solve(Xprime %*% X) %*% Xprime %*% y

# Show beta_hat
beta_hat
##                              [,1]
## Intercept         -0.015799153368
## hsgpa              0.653619400480
## act                0.022209042854
## study              0.003385113176
## sq_study          -0.000001780703
## calculus           0.051133943678
## econhs            -0.058266376397
## mothcoll           0.116296127712
## fathcoll           0.025177473186
## mothcoll_fathcoll -0.043977056506

Notice, the coefficient estimates are exactly the same to that produced using the lm() function.

Computing the Adjusted R-squared Manually

Recall the Adjusted R-squared is given by \(\widetilde{R}^2 = 1 - \frac{SSR / (n-k)}{SST / (n - 1)}\) where \(n - k\) is the number of estimated parameters (intercept included). Let’s compute this manually.

# Get the number of observations
n <- nrow(X)

# Number of estimated parameters
k <- ncol(X)

# Residuals
u_hat <- y - X %*% beta_hat

# Dot product of residuals gives the SSR
ssr <- t(u_hat) %*% u_hat

# Get the mean of outcome
y_bar <- mean(y)

# Dot product of the difference between outcomes and their mean gives the SST
sst <- t(y - y_bar) %*% (y - y_bar)

# Compute the numerator for the adjusted R-squared
r_sq_adj_num <- ssr / (n - k)

# Compute the denominator for the adjusted R-squared
r_sq_adj_denom <- sst / (n - 1)

# Compute the adjusted R-sqaured
r_sq_adj <- 1 - (r_sq_adj_num / r_sq_adj_denom)

# Show the adjusted R-squared
r_sq_adj
##           [,1]
## [1,] 0.2633391

Computing the Standard Errors Manually

Recall the estimated variance-covariance matrix of the OLS estimators under Assumptions 1-5 is given by \(\widehat{\mathbb{V}} \left[ \widehat{\boldsymbol{\beta}} \right] = \widehat{\sigma}^2 (X^\prime X)^{-1}\) where \(\widehat{\sigma}^2 = \frac{\boldsymbol{u}^\prime \boldsymbol{u}}{n-k} = \frac{SSR}{n-k}\) is the estimated error variance and \(n - k\) is the number of estimated parameters (intercept included). Consequently, the standard error of the OLS estimator is given by the elements along the diagonal of the estimated variance-covariance matrix. Let’s compute these standard errors manually.

# Obtain the estimated variance of u
sigma_hat_sq <- ssr / (n - k)

# Get inverse of X_prime %*% X
Xprime_X_inverse <- solve(Xprime %*% X)

# Obtain the estimated variance-covariance matrix
var_cov <- as.numeric(sigma_hat_sq) * Xprime_X_inverse

# Get the diagonal elements of the estimated variance-covaraince matrix (elements along diagonal are standard errors)
beta_hat_var <- diag(var_cov)

# Take the square root of the estimated variance of the estimated parameters to get the standard error
beta_hat_se <- sqrt(beta_hat_var)

# Show the standard errors
beta_hat_se
##         Intercept             hsgpa               act             study 
##      0.1747860488      0.0533713079      0.0056239677      0.0065761969 
##          sq_study          calculus            econhs          mothcoll 
##      0.0001681297      0.0353522777      0.0340382178      0.0477846496 
##          fathcoll mothcoll_fathcoll 
##      0.0570658274      0.0716664631

Notice, the standard errors are exactly the same to that produced using the lm() function.

Omitted Variable Bias (OVB)

We discussed scenarios under which we have an omitted variable bias (OVB) issue. Let’s construct two situations with OVB to understand this better.

Upward Bias

The true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2 x_{1i} + 3x_{2i} + u_i\), but suppose we don’t include \(x_2\) (perhaps we don’t have data on it which is often an issue when doing empirical work). We know \(x_2\) is positively related to \(y\) as indicated by its positive coefficient in the model \(\left(\beta_2 = 3 \right)\). Moreover, suppose \(x_{1i} = \widetilde{\beta}_0 + \widetilde{\beta}_1 x_{2i} + v_i = 4 + 5 x_{2i} + v_i\) so \(x_{2}\) is also positively related to \(x_1\) as indicated by its positive coefficient in the model \(\left(\widetilde{\beta}_1 = 5 \right)\). Consequently, \(x_2\) is positively related with \(y\) and \(x_1\) meaning when we estimate the model \(y_i = \beta_0 + \beta_1 x_{i1} + w_i\), \(\beta_1\) should be overestimated (think positive times positive gives positive bias). Let’s verify this by generating \(s = 1000\) samples each with \(n = 200\) data points, run the regression of \(y\) on \(x_1\) using each sample, and plot the sampling distribution of \(\widehat{\beta}_1\). If we are correct, then the sampling distribution should be centered around some value to the right of \(\beta_1 = 2\) because \(\widehat{\beta}_1\) is overestimating this parameter.

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Create a data table with empty values to store the estimates of beta_hat_1
dt_estimates <- data.table(beta_hat_1 = rep(NA, s))

# Intercept term when y is the outcome
beta_0 <- 1

# Coefficient on x_1 when y is the outcome
beta_1 <- 2

# Coefficient on x_2 when y is the outcome
beta_2 <- 3

# Intercept term when x_1 is the outcome
beta_tilde_0 <- 4

# Coefficient on x_2 when x_1 is the outcome
beta_tilde_1 <- 5

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  
  # Error term when y is the outcome
  u <- rnorm(n, mean = 0, sd = 2)
  
  # Error term when x_1 is the outcome
  v <- rnorm(n, mean = 0, sd = 3)
  
  # x_2
  x_2 <- rnorm(n, mean = 0, sd = 1)
  
  # x_1
  x_1 <- beta_tilde_0 + beta_tilde_1 * x_2 + v
  
  # Outcome
  y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + u
  
  # Create a data table
  dt <- data.table(y = y, x_1 = x_1, x_2 = x_2)
  
  # Run regression of y on x_1
  reg <- lm(y ~ x_1, data = dt)
  
  # Place the current slope estimate in dt_estimates
  dt_estimates$beta_hat_1[i] <- reg$coefficients["x_1"]
}
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

Indeed, the sampling distribuiton of \(\widehat{\beta}_1\) is centered to the right of the parameter it is estimating (\(\beta_1 = 2\)) implying the estimator is biased upward.

Downward Bias

The true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2 x_{1i} + 3x_{2i} + u_i\), but suppose we don’t include \(x_2\) (perhaps we don’t have data on it which is often an issue when doing empirical work). We know \(x_2\) is positively related to \(y\) as indicated by its positive coefficient in the model \(\left(\beta_2 = 3 \right)\). Moreover, suppose \(x_{1i} = \widetilde{\beta}_0 + \widetilde{\beta}_1 x_{2i} + v_i = 4 - 5 x_{2i} + v_i\) so \(x_{2}\) is also negatively related to \(x_1\) as indicated by its negative coefficient in the model \(\left(\widetilde{\beta}_1 = -5 \right)\). Consequently, \(x_2\) is positively related with \(y\) and inversely related with \(x_1\) meaning when we estimate the model \(y_i = \beta_0 + \beta_1 x_{i1} + w_i\), \(\beta_1\) should be underestimated (think positive times negative gives negative bias). Let’s verify this by generating \(s = 1000\) samples each with \(n = 200\) data points, run the regression of \(y\) on \(x_1\) using each sample, and plot the sampling distribution of \(\widehat{\beta}_1\). If we are correct, then the sampling distribution should be centered around some value to the left of \(\beta_1 = 2\) because \(\widehat{\beta}_1\) is underestimating this parameter.

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Create a data table with empty values to store the estimates of beta_hat_1
dt_estimates <- data.table(beta_hat_1 = rep(NA, s))

# Intercept term when y is the outcome
beta_0 <- 1

# Coefficient on x_1 when y is the outcome
beta_1 <- 2

# Coefficient on x_2 when y is the outcome
beta_2 <- 3

# Intercept term when x_1 is the outcome
beta_tilde_0 <- 4

# Coefficient on x_2 when x_1 is the outcome
beta_tilde_1 <- 5

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  
  # Error term when y is the outcome
  u <- rnorm(n, mean = 0, sd = 2)
  
  # Error term when x_1 is the outcome
  v <- rnorm(n, mean = 0, sd = 3)
  
  # x_2
  x_2 <- rnorm(n, mean = 0, sd = 1)
  
  # x_1
  x_1 <- beta_tilde_0 - beta_tilde_1 * x_2 + v
  
  # Outcome
  y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + u
  
  # Create a data table
  dt <- data.table(y = y, x_1 = x_1, x_2 = x_2)
  
  # Run regression of y on x_1
  reg <- lm(y ~ x_1, data = dt)
  
  # Place the current slope estimate in dt_estimates
  dt_estimates$beta_hat_1[i] <- reg$coefficients["x_1"]
}
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

Indeed, the sampling distribuiton of \(\widehat{\beta}_1\) is centered to the left of the parameter it is estimating (\(\beta_1 = 2\)) implying the estimator is biased downward.