# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ggplot2, data.table)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)
wooldridge
DatasetAny computer based homework assignment from the book should have a
dataset MindTap will ask you download as a .RData
file.
There is no need to download it because you can use the
wooldridge
package to load the dataset directly into
R
. Let’s load in the econmath
dataset from the
wooldridge
package and do some basic data analysis.
# Load in econmath dataset from wooldridge as a data table
dt <- data.table(wooldridge::econmath)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "age" "work" "study" "econhs" "colgpa" "hsgpa"
## [7] "acteng" "actmth" "act" "mathscr" "male" "calculus"
## [13] "attexc" "attgood" "fathcoll" "mothcoll" "score"
# Get summary statistics
summary(dt)
## age work study econhs
## Min. :18.00 Min. : 0.000 Min. : 0.00 Min. :0.0000
## 1st Qu.:19.00 1st Qu.: 0.000 1st Qu.: 8.50 1st Qu.:0.0000
## Median :19.00 Median : 8.000 Median :12.00 Median :0.0000
## Mean :19.41 Mean : 8.626 Mean :13.92 Mean :0.3703
## 3rd Qu.:20.00 3rd Qu.:15.000 3rd Qu.:18.00 3rd Qu.:1.0000
## Max. :29.00 Max. :37.500 Max. :50.00 Max. :1.0000
##
## colgpa hsgpa acteng actmth
## Min. :0.875 Min. :2.355 Min. :12.00 Min. :12.00
## 1st Qu.:2.446 1st Qu.:3.110 1st Qu.:20.00 1st Qu.:20.00
## Median :2.813 Median :3.321 Median :23.00 Median :23.00
## Mean :2.815 Mean :3.342 Mean :22.59 Mean :23.21
## 3rd Qu.:3.207 3rd Qu.:3.589 3rd Qu.:25.00 3rd Qu.:26.00
## Max. :4.000 Max. :4.260 Max. :34.00 Max. :36.00
## NA's :42 NA's :42
## act mathscr male calculus
## Min. :13.00 Min. : 0.000 Min. :0.0 Min. :0.0000
## 1st Qu.:21.00 1st Qu.: 7.000 1st Qu.:0.0 1st Qu.:0.0000
## Median :23.00 Median : 8.000 Median :0.5 Median :1.0000
## Mean :23.12 Mean : 7.875 Mean :0.5 Mean :0.6764
## 3rd Qu.:25.00 3rd Qu.: 9.000 3rd Qu.:1.0 3rd Qu.:1.0000
## Max. :33.00 Max. :10.000 Max. :1.0 Max. :1.0000
## NA's :42
## attexc attgood fathcoll mothcoll
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.2967 Mean :0.5864 Mean :0.5245 Mean :0.6285
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## score
## Min. :19.53
## 1st Qu.:64.06
## Median :74.22
## Mean :72.60
## 3rd Qu.:82.79
## Max. :98.44
##
# Get the number of observations
nrow(dt)
## [1] 856
# Get the number of variables in the dataset
ncol(dt)
## [1] 17
lm()
Let’s study determinants of college gpa, colgpa
. I do
include some indicator/binary/dummy variables in the model which we
haven’t discussed yet. Indicator variables only take on two values: zero
and one. The covariates in the below regression include:
hsgpa
act
study
sq_study
. The
rationale is that at some point studying so much may start to actually
hurt your performance.calculus
econhs
mothcoll
fathcoll
fathcoll
and
mothcoll
, fathcoll * mothcoll
. We include
interaction effects when we think the effect of one predictor may depend
on the effect of another predictor.# Create variable for the square of total hours per week spent studying
dt[, sq_study := study^2]
# Run regression of colgpa on covariates
reg <- lm(colgpa ~ hsgpa + act + study + sq_study + calculus + econhs + fathcoll + mothcoll + fathcoll*mothcoll, data = dt)
# Print the summary of reg
summary(reg)
##
## Call:
## lm(formula = colgpa ~ hsgpa + act + study + sq_study + calculus +
## econhs + fathcoll + mothcoll + fathcoll * mothcoll, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82242 -0.27923 0.02023 0.33206 1.30971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.015799153 0.174786049 -0.090 0.9280
## hsgpa 0.653619400 0.053371308 12.247 < 0.0000000000000002 ***
## act 0.022209043 0.005623968 3.949 0.0000854 ***
## study 0.003385113 0.006576197 0.515 0.6069
## sq_study -0.000001781 0.000168130 -0.011 0.9916
## calculus 0.051133944 0.035352278 1.446 0.1485
## econhs -0.058266376 0.034038218 -1.712 0.0873 .
## fathcoll 0.025177473 0.057065827 0.441 0.6592
## mothcoll 0.116296128 0.047784650 2.434 0.0152 *
## fathcoll:mothcoll -0.043977057 0.071666463 -0.614 0.5396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4628 on 804 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.2715, Adjusted R-squared: 0.2633
## F-statistic: 33.29 on 9 and 804 DF, p-value: < 0.00000000000000022
Our regression model is \[\begin{align} colgpa &= \beta_0 + \beta_1 hsgpa + \beta_2 act + \beta_3 study + \beta_4 study^2 + \beta_5 calculus \\ &+ \beta_6 econhs + \beta_7 fathcoll + \beta_8 mothcoll + \beta_9 (fathcoll \times mothcoll)\\ &+ u. \end{align}\]
Upon estimation, our regression equation is \[\begin{align} \widehat{colgpa} &= \widehat{\beta}_0 + \widehat{\beta}_1 hsgpa + \widehat{\beta}_2 act + \widehat{\beta}_3 study + \widehat{\beta}_4 study^2 + \widehat{\beta}_5 calculus \\ &+ \widehat{\beta}_6 econhs + \widehat{\beta}_7 fathcoll + \widehat{\beta}_8 mothcoll + \widehat{\beta}_9 (fathcoll \times mothcoll) \\ &= -0.015 + 0.6536 hsgpa + 0.022 act + 0.0034 study - 0.000002 study^2 + 0.0511 calculus \\ &- 0.0582 econhs + 0.0252 fathcoll + 0.1163 mothcoll \\ &- 0.044 (fathcoll \times mothcoll). \end{align}\]
The partial derivative of colgpa
with respect to
act
is \(\frac{\partial
colgpa}{\partial act} = \beta_2\) giving the effect
act
has on colgpa
. This estimated effect is
\(\widehat{\beta}_2 = 0.022\). This
means for every additional point gained on a student’s ACT score, their
college GPA is predicted to increase by \(0.022\). Similarly for every ten additional
points gained on a student’s ACT score, their college GPA is predicted
to increase by \(0.22\).
The partial derivative of colgpa
with respect to
study
is \(\frac{\partial
colgpa}{\partial act} = \beta_2 + 2 \beta_3 study\) giving the
effect study
has on colgpa
. This estimated
effect is \(\widehat{\beta}_3 +
2\widehat{\beta}_4 study = 0.0034 + 2(-0.000002) study\). This
means for every additional hour spent studying, a student’s college GPA
is predicted to increase by \(0.0034 +
2(-0.000002) = 0.003396\).
The partial derivative of colgpa
with respect to
mothcoll
is \(\frac{\partial
colgpa}{\partial mothcoll} = \beta_8 + \beta_9 fathcoll\) giving
the effect mothcoll
has on colgpa
. This
estimated effect is \(\widehat{\beta}_8 +
\widehat{\beta}_9 fathcoll = 0.1163 - 0.044fathcoll\). This means
if a father does have a bachelors degree (fathcoll = 1
),
then the effect of a mother having a college degree on their child’s
college GPA is an increase of \(0.1163 -
0.044(1) = 0.0723\) in their GPA. If a father does have a
bachelors degree (fathcoll = 0
), then the effect of a
mother having a college degree on their child’s college GPA is an
increase of \(0.1163 - 0.044(0) =
0.1163\). Albeit, we would expect the estimate \(\widehat{\beta}_9\) to be positive, but it
is statistically insignificant at any practical level of \(\alpha\) so we would likely drop it from
our model anyhow before estimating causal effects.
Let’s use the OLS solution \((X^\prime
X)^{-1} X^\prime \boldsymbol{y}\) to replicate the parameter
estimates produced by lm()
. If you haven’t looked at the
file ECON_418-518_Matrix_Algebra_Code.html
on D2L, I
replicate the entire summary()
table there and detail the
functions used below.
# Drop rows with any NA values
dt <- na.omit(dt)
# Make our y vector
y <- dt[, colgpa]
# Make a vector of 1's to add as the first column of our X matrix for the intercept
Intercept <- rep(1, nrow(dt))
# Create a column in dt for the mothcoll * fathcoll interaction
dt[, mothcoll_fathcoll := mothcoll * fathcoll]
# Construct our X matrix
X <- as.matrix(cbind(Intercept, dt[, .(hsgpa, act, study, sq_study, calculus, econhs, mothcoll, fathcoll, mothcoll_fathcoll)]))
# Show the first 5 rows and all columns of X
X[1:5, ]
## Intercept hsgpa act study sq_study calculus econhs mothcoll fathcoll
## [1,] 1 3.355 27 10.0 100.00 1 0 1 1
## [2,] 1 3.219 24 22.5 506.25 0 1 1 0
## [3,] 1 3.306 21 12.0 144.00 1 0 1 0
## [4,] 1 3.977 31 40.0 1600.00 1 0 1 1
## [5,] 1 3.890 32 15.0 225.00 1 1 1 0
## mothcoll_fathcoll
## [1,] 1
## [2,] 0
## [3,] 0
## [4,] 1
## [5,] 0
# Transpose of X
Xprime <- t(X)
# Compute the OLS solution for beta_hat
beta_hat <- solve(Xprime %*% X) %*% Xprime %*% y
# Show beta_hat
beta_hat
## [,1]
## Intercept -0.015799153368
## hsgpa 0.653619400480
## act 0.022209042854
## study 0.003385113176
## sq_study -0.000001780703
## calculus 0.051133943678
## econhs -0.058266376397
## mothcoll 0.116296127712
## fathcoll 0.025177473186
## mothcoll_fathcoll -0.043977056506
Notice, the coefficient estimates are exactly the same to that
produced using the lm()
function.
Recall the Adjusted R-squared is given by \(\widetilde{R}^2 = 1 - \frac{SSR / (n-k)}{SST / (n - 1)}\) where \(n - k\) is the number of estimated parameters (intercept included). Let’s compute this manually.
# Get the number of observations
n <- nrow(X)
# Number of estimated parameters
k <- ncol(X)
# Residuals
u_hat <- y - X %*% beta_hat
# Dot product of residuals gives the SSR
ssr <- t(u_hat) %*% u_hat
# Get the mean of outcome
y_bar <- mean(y)
# Dot product of the difference between outcomes and their mean gives the SST
sst <- t(y - y_bar) %*% (y - y_bar)
# Compute the numerator for the adjusted R-squared
r_sq_adj_num <- ssr / (n - k)
# Compute the denominator for the adjusted R-squared
r_sq_adj_denom <- sst / (n - 1)
# Compute the adjusted R-sqaured
r_sq_adj <- 1 - (r_sq_adj_num / r_sq_adj_denom)
# Show the adjusted R-squared
r_sq_adj
## [,1]
## [1,] 0.2633391
Recall the estimated variance-covariance matrix of the OLS estimators under Assumptions 1-5 is given by \(\widehat{\mathbb{V}} \left[ \widehat{\boldsymbol{\beta}} \right] = \widehat{\sigma}^2 (X^\prime X)^{-1}\) where \(\widehat{\sigma}^2 = \frac{\boldsymbol{u}^\prime \boldsymbol{u}}{n-k} = \frac{SSR}{n-k}\) is the estimated error variance and \(n - k\) is the number of estimated parameters (intercept included). Consequently, the standard error of the OLS estimator is given by the elements along the diagonal of the estimated variance-covariance matrix. Let’s compute these standard errors manually.
# Obtain the estimated variance of u
sigma_hat_sq <- ssr / (n - k)
# Get inverse of X_prime %*% X
Xprime_X_inverse <- solve(Xprime %*% X)
# Obtain the estimated variance-covariance matrix
var_cov <- as.numeric(sigma_hat_sq) * Xprime_X_inverse
# Get the diagonal elements of the estimated variance-covaraince matrix (elements along diagonal are standard errors)
beta_hat_var <- diag(var_cov)
# Take the square root of the estimated variance of the estimated parameters to get the standard error
beta_hat_se <- sqrt(beta_hat_var)
# Show the standard errors
beta_hat_se
## Intercept hsgpa act study
## 0.1747860488 0.0533713079 0.0056239677 0.0065761969
## sq_study calculus econhs mothcoll
## 0.0001681297 0.0353522777 0.0340382178 0.0477846496
## fathcoll mothcoll_fathcoll
## 0.0570658274 0.0716664631
Notice, the standard errors are exactly the same to that produced
using the lm()
function.
We discussed scenarios under which we have an omitted variable bias (OVB) issue. Let’s construct two situations with OVB to understand this better.
The true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2 x_{1i} + 3x_{2i} + u_i\), but suppose we don’t include \(x_2\) (perhaps we don’t have data on it which is often an issue when doing empirical work). We know \(x_2\) is positively related to \(y\) as indicated by its positive coefficient in the model \(\left(\beta_2 = 3 \right)\). Moreover, suppose \(x_{1i} = \widetilde{\beta}_0 + \widetilde{\beta}_1 x_{2i} + v_i = 4 + 5 x_{2i} + v_i\) so \(x_{2}\) is also positively related to \(x_1\) as indicated by its positive coefficient in the model \(\left(\widetilde{\beta}_1 = 5 \right)\). Consequently, \(x_2\) is positively related with \(y\) and \(x_1\) meaning when we estimate the model \(y_i = \beta_0 + \beta_1 x_{i1} + w_i\), \(\beta_1\) should be overestimated (think positive times positive gives positive bias). Let’s verify this by generating \(s = 1000\) samples each with \(n = 200\) data points, run the regression of \(y\) on \(x_1\) using each sample, and plot the sampling distribution of \(\widehat{\beta}_1\). If we are correct, then the sampling distribution should be centered around some value to the right of \(\beta_1 = 2\) because \(\widehat{\beta}_1\) is overestimating this parameter.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Create a data table with empty values to store the estimates of beta_hat_1
dt_estimates <- data.table(beta_hat_1 = rep(NA, s))
# Intercept term when y is the outcome
beta_0 <- 1
# Coefficient on x_1 when y is the outcome
beta_1 <- 2
# Coefficient on x_2 when y is the outcome
beta_2 <- 3
# Intercept term when x_1 is the outcome
beta_tilde_0 <- 4
# Coefficient on x_2 when x_1 is the outcome
beta_tilde_1 <- 5
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Error term when y is the outcome
u <- rnorm(n, mean = 0, sd = 2)
# Error term when x_1 is the outcome
v <- rnorm(n, mean = 0, sd = 3)
# x_2
x_2 <- rnorm(n, mean = 0, sd = 1)
# x_1
x_1 <- beta_tilde_0 + beta_tilde_1 * x_2 + v
# Outcome
y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + u
# Create a data table
dt <- data.table(y = y, x_1 = x_1, x_2 = x_2)
# Run regression of y on x_1
reg <- lm(y ~ x_1, data = dt)
# Place the current slope estimate in dt_estimates
dt_estimates$beta_hat_1[i] <- reg$coefficients["x_1"]
}
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
Indeed, the sampling distribuiton of \(\widehat{\beta}_1\) is centered to the right of the parameter it is estimating (\(\beta_1 = 2\)) implying the estimator is biased upward.
The true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2 x_{1i} + 3x_{2i} + u_i\), but suppose we don’t include \(x_2\) (perhaps we don’t have data on it which is often an issue when doing empirical work). We know \(x_2\) is positively related to \(y\) as indicated by its positive coefficient in the model \(\left(\beta_2 = 3 \right)\). Moreover, suppose \(x_{1i} = \widetilde{\beta}_0 + \widetilde{\beta}_1 x_{2i} + v_i = 4 - 5 x_{2i} + v_i\) so \(x_{2}\) is also negatively related to \(x_1\) as indicated by its negative coefficient in the model \(\left(\widetilde{\beta}_1 = -5 \right)\). Consequently, \(x_2\) is positively related with \(y\) and inversely related with \(x_1\) meaning when we estimate the model \(y_i = \beta_0 + \beta_1 x_{i1} + w_i\), \(\beta_1\) should be underestimated (think positive times negative gives negative bias). Let’s verify this by generating \(s = 1000\) samples each with \(n = 200\) data points, run the regression of \(y\) on \(x_1\) using each sample, and plot the sampling distribution of \(\widehat{\beta}_1\). If we are correct, then the sampling distribution should be centered around some value to the left of \(\beta_1 = 2\) because \(\widehat{\beta}_1\) is underestimating this parameter.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Create a data table with empty values to store the estimates of beta_hat_1
dt_estimates <- data.table(beta_hat_1 = rep(NA, s))
# Intercept term when y is the outcome
beta_0 <- 1
# Coefficient on x_1 when y is the outcome
beta_1 <- 2
# Coefficient on x_2 when y is the outcome
beta_2 <- 3
# Intercept term when x_1 is the outcome
beta_tilde_0 <- 4
# Coefficient on x_2 when x_1 is the outcome
beta_tilde_1 <- 5
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Error term when y is the outcome
u <- rnorm(n, mean = 0, sd = 2)
# Error term when x_1 is the outcome
v <- rnorm(n, mean = 0, sd = 3)
# x_2
x_2 <- rnorm(n, mean = 0, sd = 1)
# x_1
x_1 <- beta_tilde_0 - beta_tilde_1 * x_2 + v
# Outcome
y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + u
# Create a data table
dt <- data.table(y = y, x_1 = x_1, x_2 = x_2)
# Run regression of y on x_1
reg <- lm(y ~ x_1, data = dt)
# Place the current slope estimate in dt_estimates
dt_estimates$beta_hat_1[i] <- reg$coefficients["x_1"]
}
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
Indeed, the sampling distribuiton of \(\widehat{\beta}_1\) is centered to the left of the parameter it is estimating (\(\beta_1 = 2\)) implying the estimator is biased downward.