# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ggplot2, data.table)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
wooldridge
DatasetLet’s load in the wage1
dataset from the
wooldridge
package and do some basic data analysis.
# Load in wage1 dataset from wooldridge as a data table
dt <- data.table(wooldridge::wage1)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Filter out unwanted columns
dt <- dt[, .(wage, educ, exper, tenure, female, married)]
# Create experience and tenure squared variables as well as the natural log of wage
dt[, `:=` (exper_sq = exper^2, tenure_sq = tenure^2, ln_wage = log(wage))]
# Get summary statistics for female variable
summary(dt[, female])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4791 1.0000 1.0000
Notice in the summary summary statistics for female
that
the mean of this covariate is \(0.4791\). This means \(47.91\%\) of our sample consists of female.
Consequently, \(52.09\%\) of our sample
consists of males.
Suppose we also wanted to create an indicators for gender and marital
status simultaneously. For example, we wish to create the variables
\(\mathbb{1}(female \ and \ single)\),
\(\mathbb{1}(female \ and \ married)\),
and \(\mathbb{1}(male \ and \
married)\). So, \(\mathbb{1}(male \ and
\ single)\) will be our base case that we must exclude from our
regression in order for MLR Assumption 3 of No Perfect Multicollinearity
to be satisfied. Remember, female = 1
when the individual a
female and female = 0
when the individual is a male.
Similarly, married = 1
when the individual married and
married = 0
when the individual is single.
R
has a nice ifelse()
function that allows
us to easily create these variables. For instance,
dt[, female_single := ifelse(female == 1 & married == 0, 1, 0)]
means create a variable female_single
in our data table
that equals 1
when female = 1
and
married = 0
while equaling 0
otherwise. Notice
that in the code we use a double equal sign for female == 1
and married == 0
. This is because ==
checks if
a condition is true while =
is an assignment operator.
Let us now generate these variables
# Create indicator for when an individual is both female and single
dt[, female_single := ifelse(female == 1 & married == 0, 1, 0)]
# Create indicator for when an individual is both female and married
dt[, female_married := ifelse(female == 1 & married == 1, 1, 0)]
# Create indicator for when an individual is both male and married
dt[, male_married := ifelse(female == 0 & married == 0, 1, 0)]
# Show the summary statistics for these variables
summary(dt[, .(female_single, female_married, male_married)])
## female_single female_married male_married
## Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.2281 Mean :0.251 Mean :0.1635
## 3rd Qu.:0.0000 3rd Qu.:0.750 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
So, \(22.81\%\) of our sample are single females, \(25.1\%\) are married females, \(16.35\%\) are married males, and \(35.74\%\) are single males.
Suppose we are interested in estimating the model
\[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 female_i + \beta_2 educ_i + u_i. \end{align}\]
Let’s estimate this model.
# Run regression
reg <- lm(ln_wage ~ female + educ, data = dt)
# Show summary of regression
summary(reg)
##
## Call:
## lm(formula = ln_wage ~ female + educ, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02672 -0.27470 -0.03731 0.26219 1.34738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.826269 0.094054 8.785 <2e-16 ***
## female -0.360865 0.039024 -9.247 <2e-16 ***
## educ 0.077203 0.007047 10.955 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4455 on 523 degrees of freedom
## Multiple R-squared: 0.3002, Adjusted R-squared: 0.2975
## F-statistic: 112.2 on 2 and 523 DF, p-value: < 2.2e-16
This means a female individual will be estimated to make \(36.09\%\) less than males, on average.
Let’s show the regression line for males and females when
ln_wage
is the outcome variable.
# Run separate regressions for males and females
reg_male <- lm(ln_wage ~ educ, data = dt[female == 0])
reg_female <- lm(ln_wage ~ educ, data = dt[female == 1])
# Create a data.table for predictions
dt_predict <- data.table(
educ = seq(min(dt[, educ]), max(dt[, educ]), length.out = 100)
)
# Add predicted values to the data.table
dt_predict[, pred_ln_wage_male := predict(reg_male, newdata = .SD), .SDcols = "educ"]
dt_predict[, pred_ln_wage_female := predict(reg_female, newdata = .SD), .SDcols = "educ"]
# Line plot of education versus log wage with predictions
ggplot() +
geom_line(data = dt_predict, aes(x = educ, y = pred_ln_wage_female, color = "Female"), linewidth = 2) +
geom_line(data = dt_predict, aes(x = educ, y = pred_ln_wage_male, color = "Male"), linewidth = 2) +
labs(title = "Log Wage Versus Education", x = "Education", y = "Log Wage", color = "Gender") +
scale_color_manual(
values = c("Female" = "#20B2AA", "Male" = "#FFA07A"),
labels = c("Female", "Male")
) +
guides(color = guide_legend(order = 1, override.aes = list(order = c(2, 1)))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
)
As we can see, the slope for both females and males is \(\widehat{\beta}_2 = 0.08\). However, the intercepts are different! The intercept for males is \(\widehat{\beta}_0 = 0.826\) while the intercept for females is \(\widehat{\beta}_0 + \widehat{\beta}_1 = 0.826 - 0.361 = 0.465\).
Let’s create a male
variable and include it in our above
regression to see what happens. This amounts to estimating the model
\[\begin{align}
\ln(wage_i)
&= \beta_0 + \beta_1 female_i + \beta_2 male_i + \beta_3 educ_i +
u_i.
\end{align}\]
# Create male variable
dt[, male := ifelse(female == 0, 1, 0)]
# Run regression
reg <- lm(ln_wage ~ female + male + educ, data = dt)
# Show regression summary
summary(reg)
##
## Call:
## lm(formula = ln_wage ~ female + male + educ, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.02672 -0.27470 -0.03731 0.26219 1.34738
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.826269 0.094054 8.785 <2e-16 ***
## female -0.360865 0.039024 -9.247 <2e-16 ***
## male NA NA NA NA
## educ 0.077203 0.007047 10.955 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4455 on 523 degrees of freedom
## Multiple R-squared: 0.3002, Adjusted R-squared: 0.2975
## F-statistic: 112.2 on 2 and 523 DF, p-value: < 2.2e-16
Notice how there are NA
’s in the male
row
from the regression summary. This is because R
can tell
that including male
in the regression will create a
multicollinearity issue causing the columns of our feature matrix \(X\) to be linearly dependent implying \((X^\prime X)^{-1}\) does not exist. Thus,
R
kicks male
out of the regression
automatically treating it has the base case so we don’t have to worry
about this issue.
Suppose we are interested in estimating the model \[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 female_i + \beta_2 educ_i + \beta_3 female_i \times exper_i + u_i. \end{align}\]
# Run regression
reg <- lm(ln_wage ~ female + exper + female*exper, data = dt)
# Show summary of regression
summary(reg)
##
## Call:
## lm(formula = ln_wage ~ female + exper + female * exper, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04133 -0.32424 -0.05026 0.35537 1.66394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.697672 0.048639 34.903 < 2e-16 ***
## female -0.293432 0.068596 -4.278 2.25e-05 ***
## exper 0.006601 0.002198 3.004 0.0028 **
## female:exper -0.005863 0.003157 -1.857 0.0638 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4902 on 522 degrees of freedom
## Multiple R-squared: 0.1544, Adjusted R-squared: 0.1496
## F-statistic: 31.78 on 3 and 522 DF, p-value: < 2.2e-16
So, the estimated impact of experience on log wage is \[\begin{align} \frac{\partial \ln \left(\widehat{wage} \right)}{\partial exper} &= 0.007 - 0.006 female. \end{align}\]
This means the estimated effect of two years of experience for male individuals is an increase in wage of approximately \(100 \cdot 2(0.007) = 1.4\%\). On the other hand, the estimated effect of two years of experience for female individuals is an increase in wage of approximately \(100 \cdot 2(0.007 - 0.006) = 0.2\%\).
Let’s show the regression line for males and females when
ln_wage
is the outcome variable.
# Create a data table for predictions
dt_predict <- data.table(
exper = seq(-1 + min(dt[, exper]), 1 + max(dt[, exper]), length.out = 100),
female = rep(c(0, 1), each = 100)
)
# Add predicted values to the data table
dt_predict[, pred_ln_wage := predict(reg, newdata = dt_predict)]
# Line plot of education versus log wage with predictions
ggplot(dt_predict, aes(x = exper, y = pred_ln_wage, color = factor(female))) +
geom_line(linewidth = 2) +
labs(title = "Log Wage Versus Experience", x = "Experience", y = "Log Wage", color = "Gender") +
scale_color_manual(
values = c("0" = "#FFA07A", "1" = "#20B2AA"),
labels = c("Male", "Female")
) +
guides(color = guide_legend(order = 1)) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
)
As we can see, the intercept for males is \(\widehat{\beta}_0 = 1.69\) while the intercept for females is \(\widehat{\beta}_0 + \widehat{\beta}_1 = 1.40\). These intercepts are different just like in the figure before. However, including the interaction term now changes the slope as well! The slope for males is \(\widehat{\beta}_2 = 0.006\) while the slope for females is only \(\widehat{\beta}_2 + \widehat{\beta}_3 = 0.0007\).
Let’s test the hypothesis that the effect of experience on wage is different for females than males at the \(\alpha = 0.05\) significance level. This amounts to testing \(H_0: \beta_3 = 0\) versus \(H_A: \beta_3 \neq 0\). Since the p-value corresponding to \(\widehat{\beta}_3\) is \(0.0638 > 0.05\), we fail to reject this null hypothesis and conclude that the effect of experience on wage does not differ across gender with \(95\%\) confidence.
Let’s test the hypothesis that the effect of experience on wage is less for females than males at the \(\alpha = 0.05\) significance level. This amounts to testing \(H_0: \beta_3 \geq 0\) versus \(H_A: \beta_3 < 0\). Since the p-value corresponding to \(\widehat{\beta}_3\) is \(\frac{0.0639}{2} = 0.0319\), we reject this null hypothesis and conclude that the effect of experience on wage is less for females than for males.
The above is an example of what is commonly called “p-hacking.” p-hacking corresponds to manipulating statistical analysis until statistically significant results are achieved. In the first test we concluded that \(\beta_3 = 0\) due to the p-value being marginally greater than the significance level. Then, we conducted a one-tailed test which involves dividing the prior test’s p-value by two resulting in a one-tailed p-value of less than the significance level. Essentially, this can be seen as first not getting a statistically significant result and then manipulating our hypothesis test so that we do. p-hacking diminishes the integrity of your results and you never should do it. I bring it to your attention because it is a common term used in econometrics and statistics that you should be aware of.
Suppose we are interested in estimating the model
\[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 exper^2_i + \beta_4 tenure_i \\ &+ \beta_5 tenure^2_i + \beta_6 female_i + \beta_7 female_i \times educ_i \\ &+ \beta_8 female\_single_i + \beta_9 female\_married_i + \beta_{10} male\_married_i \\ &+ u_i. \end{align}\]
# Run regression
reg <- lm(ln_wage ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + female + I(female * educ)
+ female_single + female_married + male_married, data = dt)
# Show summary of regression
summary(reg)
##
## Call:
## lm(formula = ln_wage ~ educ + exper + I(exper^2) + tenure + I(tenure^2) +
## female + I(female * educ) + female_single + female_married +
## male_married, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89717 -0.23921 -0.02731 0.23064 1.09619
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5229468 0.1251879 4.177 3.46e-05 ***
## educ 0.0797647 0.0083781 9.521 < 2e-16 ***
## exper 0.0267623 0.0052526 5.095 4.90e-07 ***
## I(exper^2) -0.0005343 0.0001107 -4.829 1.82e-06 ***
## tenure 0.0291650 0.0067837 4.299 2.05e-05 ***
## I(tenure^2) -0.0005351 0.0002318 -2.309 0.021337 *
## female -0.3832114 0.1695317 -2.260 0.024212 *
## I(female * educ) -0.0021880 0.0128780 -0.170 0.865155
## female_single 0.0875291 0.0524472 1.669 0.095744 .
## female_married NA NA NA NA
## male_married -0.2122818 0.0554577 -3.828 0.000145 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3937 on 516 degrees of freedom
## Multiple R-squared: 0.4609, Adjusted R-squared: 0.4515
## F-statistic: 49.02 on 9 and 516 DF, p-value: < 2.2e-16
Two things:
I()
function in the regression is useful when
including things like the squaring of a variable in a regression. This
permits us to not need to first create the squared variable and add it
to our data table before running the regression.female_married
got kicked out of the
regression. This is because
female = female_single + female_married
so we have a
perfect multicollinearity issue which R
kindly takes care
of for us.In the above model, we see the estimated effect of being a married male is an approximate decrease of \(21.12\%\) in wage relative to a single male. Similarly, the estimated effect of being a single female is an approximate increase of \(8.75\%\) in wage relative to a married female. As our intuition would lead us to believe, married individuals seem to have lower wages relative to single individuals.