Preliminaries

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(ggplot2, data.table)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Set random seed for reproducibility
set.seed(418518)

Data Analysis with a wooldridge Dataset

Let’s load in the wage1 dataset from the wooldridge package and do some basic data analysis.

# Load in wage1 dataset from wooldridge as a data table
dt <- data.table(wooldridge::wage1)

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Filter out unwanted columns
dt <- dt[, .(wage, educ, exper, tenure, female, married)]

# Create experience and tenure squared variables as well as the natural log of wage
dt[, `:=` (exper_sq = exper^2, tenure_sq = tenure^2, ln_wage = log(wage))]

# Get summary statistics for female variable
summary(dt[, female])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4791  1.0000  1.0000

Notice in the summary summary statistics for female that the mean of this covariate is \(0.4791\). This means \(47.91\%\) of our sample consists of female. Consequently, \(52.09\%\) of our sample consists of males.

Suppose we also wanted to create an indicators for gender and marital status simultaneously. For example, we wish to create the variables \(\mathbb{1}(female \ and \ single)\), \(\mathbb{1}(female \ and \ married)\), and \(\mathbb{1}(male \ and \ married)\). So, \(\mathbb{1}(male \ and \ single)\) will be our base case that we must exclude from our regression in order for MLR Assumption 3 of No Perfect Multicollinearity to be satisfied. Remember, female = 1 when the individual a female and female = 0 when the individual is a male. Similarly, married = 1 when the individual married and married = 0 when the individual is single.

R has a nice ifelse() function that allows us to easily create these variables. For instance, dt[, female_single := ifelse(female == 1 & married == 0, 1, 0)] means create a variable female_single in our data table that equals 1 when female = 1 and married = 0 while equaling 0 otherwise. Notice that in the code we use a double equal sign for female == 1 and married == 0. This is because == checks if a condition is true while = is an assignment operator.

Let us now generate these variables

# Create indicator for when an individual is both female and single
dt[, female_single := ifelse(female == 1 & married == 0, 1, 0)]

# Create indicator for when an individual is both female and married
dt[, female_married := ifelse(female == 1 & married == 1, 1, 0)]

# Create indicator for when an individual is both male and married
dt[, male_married := ifelse(female == 0 & married == 0, 1, 0)]

# Show the summary statistics for these variables
summary(dt[, .(female_single, female_married, male_married)])
##  female_single    female_married   male_married   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.000   Median :0.0000  
##  Mean   :0.2281   Mean   :0.251   Mean   :0.1635  
##  3rd Qu.:0.0000   3rd Qu.:0.750   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000

So, \(22.81\%\) of our sample are single females, \(25.1\%\) are married females, \(16.35\%\) are married males, and \(35.74\%\) are single males.

Indicators and Regressions

Interpretation and Visualization

Suppose we are interested in estimating the model

\[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 female_i + \beta_2 educ_i + u_i. \end{align}\]

Let’s estimate this model.

# Run regression
reg <- lm(ln_wage ~ female + educ, data = dt)

# Show summary of regression
summary(reg)
## 
## Call:
## lm(formula = ln_wage ~ female + educ, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02672 -0.27470 -0.03731  0.26219  1.34738 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.826269   0.094054   8.785   <2e-16 ***
## female      -0.360865   0.039024  -9.247   <2e-16 ***
## educ         0.077203   0.007047  10.955   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4455 on 523 degrees of freedom
## Multiple R-squared:  0.3002, Adjusted R-squared:  0.2975 
## F-statistic: 112.2 on 2 and 523 DF,  p-value: < 2.2e-16

This means a female individual will be estimated to make \(36.09\%\) less than males, on average. Let’s show the regression line for males and females when ln_wage is the outcome variable.

# Run separate regressions for males and females
reg_male <- lm(ln_wage ~ educ, data = dt[female == 0])
reg_female <- lm(ln_wage ~ educ, data = dt[female == 1])

# Create a data.table for predictions
dt_predict <- data.table(
  educ = seq(min(dt[, educ]), max(dt[, educ]), length.out = 100)
)

# Add predicted values to the data.table
dt_predict[, pred_ln_wage_male := predict(reg_male, newdata = .SD), .SDcols = "educ"]
dt_predict[, pred_ln_wage_female := predict(reg_female, newdata = .SD), .SDcols = "educ"]

# Line plot of education versus log wage with predictions
ggplot() +
  geom_line(data = dt_predict, aes(x = educ, y = pred_ln_wage_female, color = "Female"), linewidth = 2) +
  geom_line(data = dt_predict, aes(x = educ, y = pred_ln_wage_male, color = "Male"), linewidth = 2) +
  labs(title = "Log Wage Versus Education", x = "Education", y = "Log Wage", color = "Gender") +
  scale_color_manual(
    values = c("Female" = "#20B2AA", "Male" = "#FFA07A"),
    labels = c("Female", "Male")
  ) +
  guides(color = guide_legend(order = 1, override.aes = list(order = c(2, 1)))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
    legend.background = element_rect(fill = "black"),
    legend.text = element_text(color = color_1, size = 15),
    legend.title = element_text(color = color_2, size = 20)
  )

As we can see, the slope for both females and males is \(\widehat{\beta}_2 = 0.08\). However, the intercepts are different! The intercept for males is \(\widehat{\beta}_0 = 0.826\) while the intercept for females is \(\widehat{\beta}_0 + \widehat{\beta}_1 = 0.826 - 0.361 = 0.465\).

Perfect Multicollinearity

Let’s create a male variable and include it in our above regression to see what happens. This amounts to estimating the model \[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 female_i + \beta_2 male_i + \beta_3 educ_i + u_i. \end{align}\]

# Create male variable
dt[, male := ifelse(female == 0, 1, 0)]

# Run regression
reg <- lm(ln_wage ~ female + male + educ, data = dt)

# Show regression summary
summary(reg)
## 
## Call:
## lm(formula = ln_wage ~ female + male + educ, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02672 -0.27470 -0.03731  0.26219  1.34738 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.826269   0.094054   8.785   <2e-16 ***
## female      -0.360865   0.039024  -9.247   <2e-16 ***
## male               NA         NA      NA       NA    
## educ         0.077203   0.007047  10.955   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4455 on 523 degrees of freedom
## Multiple R-squared:  0.3002, Adjusted R-squared:  0.2975 
## F-statistic: 112.2 on 2 and 523 DF,  p-value: < 2.2e-16

Notice how there are NA’s in the male row from the regression summary. This is because R can tell that including male in the regression will create a multicollinearity issue causing the columns of our feature matrix \(X\) to be linearly dependent implying \((X^\prime X)^{-1}\) does not exist. Thus, R kicks male out of the regression automatically treating it has the base case so we don’t have to worry about this issue.

Interactions

Suppose we are interested in estimating the model \[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 female_i + \beta_2 educ_i + \beta_3 female_i \times exper_i + u_i. \end{align}\]

# Run regression
reg <- lm(ln_wage ~ female + exper + female*exper, data = dt)

# Show summary of regression
summary(reg)
## 
## Call:
## lm(formula = ln_wage ~ female + exper + female * exper, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04133 -0.32424 -0.05026  0.35537  1.66394 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.697672   0.048639  34.903  < 2e-16 ***
## female       -0.293432   0.068596  -4.278 2.25e-05 ***
## exper         0.006601   0.002198   3.004   0.0028 ** 
## female:exper -0.005863   0.003157  -1.857   0.0638 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4902 on 522 degrees of freedom
## Multiple R-squared:  0.1544, Adjusted R-squared:  0.1496 
## F-statistic: 31.78 on 3 and 522 DF,  p-value: < 2.2e-16

So, the estimated impact of experience on log wage is \[\begin{align} \frac{\partial \ln \left(\widehat{wage} \right)}{\partial exper} &= 0.007 - 0.006 female. \end{align}\]

This means the estimated effect of two years of experience for male individuals is an increase in wage of approximately \(100 \cdot 2(0.007) = 1.4\%\). On the other hand, the estimated effect of two years of experience for female individuals is an increase in wage of approximately \(100 \cdot 2(0.007 - 0.006) = 0.2\%\).

Let’s show the regression line for males and females when ln_wage is the outcome variable.

# Create a data table for predictions
dt_predict <- data.table(
  exper = seq(-1 + min(dt[, exper]), 1 + max(dt[, exper]), length.out = 100),
  female = rep(c(0, 1), each = 100)
)

# Add predicted values to the data table
dt_predict[, pred_ln_wage := predict(reg, newdata = dt_predict)]

# Line plot of education versus log wage with predictions
ggplot(dt_predict, aes(x = exper, y = pred_ln_wage, color = factor(female))) +
  geom_line(linewidth = 2) +
  labs(title = "Log Wage Versus Experience", x = "Experience", y = "Log Wage", color = "Gender") +
  scale_color_manual(
    values = c("0" = "#FFA07A", "1" = "#20B2AA"),
    labels = c("Male", "Female")
  ) +
  guides(color = guide_legend(order = 1)) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
    legend.background = element_rect(fill = "black"),
    legend.text = element_text(color = color_1, size = 15),
    legend.title = element_text(color = color_2, size = 20)
  )

As we can see, the intercept for males is \(\widehat{\beta}_0 = 1.69\) while the intercept for females is \(\widehat{\beta}_0 + \widehat{\beta}_1 = 1.40\). These intercepts are different just like in the figure before. However, including the interaction term now changes the slope as well! The slope for males is \(\widehat{\beta}_2 = 0.006\) while the slope for females is only \(\widehat{\beta}_2 + \widehat{\beta}_3 = 0.0007\).

Let’s test the hypothesis that the effect of experience on wage is different for females than males at the \(\alpha = 0.05\) significance level. This amounts to testing \(H_0: \beta_3 = 0\) versus \(H_A: \beta_3 \neq 0\). Since the p-value corresponding to \(\widehat{\beta}_3\) is \(0.0638 > 0.05\), we fail to reject this null hypothesis and conclude that the effect of experience on wage does not differ across gender with \(95\%\) confidence.

Let’s test the hypothesis that the effect of experience on wage is less for females than males at the \(\alpha = 0.05\) significance level. This amounts to testing \(H_0: \beta_3 \geq 0\) versus \(H_A: \beta_3 < 0\). Since the p-value corresponding to \(\widehat{\beta}_3\) is \(\frac{0.0639}{2} = 0.0319\), we reject this null hypothesis and conclude that the effect of experience on wage is less for females than for males.

The above is an example of what is commonly called “p-hacking.” p-hacking corresponds to manipulating statistical analysis until statistically significant results are achieved. In the first test we concluded that \(\beta_3 = 0\) due to the p-value being marginally greater than the significance level. Then, we conducted a one-tailed test which involves dividing the prior test’s p-value by two resulting in a one-tailed p-value of less than the significance level. Essentially, this can be seen as first not getting a statistically significant result and then manipulating our hypothesis test so that we do. p-hacking diminishes the integrity of your results and you never should do it. I bring it to your attention because it is a common term used in econometrics and statistics that you should be aware of.

Multiple Categories

Suppose we are interested in estimating the model

\[\begin{align} \ln(wage_i) &= \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 exper^2_i + \beta_4 tenure_i \\ &+ \beta_5 tenure^2_i + \beta_6 female_i + \beta_7 female_i \times educ_i \\ &+ \beta_8 female\_single_i + \beta_9 female\_married_i + \beta_{10} male\_married_i \\ &+ u_i. \end{align}\]

# Run regression
reg <- lm(ln_wage ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + female + I(female * educ)
          + female_single + female_married + male_married, data = dt)

# Show summary of regression
summary(reg)
## 
## Call:
## lm(formula = ln_wage ~ educ + exper + I(exper^2) + tenure + I(tenure^2) + 
##     female + I(female * educ) + female_single + female_married + 
##     male_married, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89717 -0.23921 -0.02731  0.23064  1.09619 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.5229468  0.1251879   4.177 3.46e-05 ***
## educ              0.0797647  0.0083781   9.521  < 2e-16 ***
## exper             0.0267623  0.0052526   5.095 4.90e-07 ***
## I(exper^2)       -0.0005343  0.0001107  -4.829 1.82e-06 ***
## tenure            0.0291650  0.0067837   4.299 2.05e-05 ***
## I(tenure^2)      -0.0005351  0.0002318  -2.309 0.021337 *  
## female           -0.3832114  0.1695317  -2.260 0.024212 *  
## I(female * educ) -0.0021880  0.0128780  -0.170 0.865155    
## female_single     0.0875291  0.0524472   1.669 0.095744 .  
## female_married           NA         NA      NA       NA    
## male_married     -0.2122818  0.0554577  -3.828 0.000145 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3937 on 516 degrees of freedom
## Multiple R-squared:  0.4609, Adjusted R-squared:  0.4515 
## F-statistic: 49.02 on 9 and 516 DF,  p-value: < 2.2e-16

Two things:

  1. The I() function in the regression is useful when including things like the squaring of a variable in a regression. This permits us to not need to first create the squared variable and add it to our data table before running the regression.
  2. Notice how female_married got kicked out of the regression. This is because female = female_single + female_married so we have a perfect multicollinearity issue which R kindly takes care of for us.

In the above model, we see the estimated effect of being a married male is an approximate decrease of \(21.12\%\) in wage relative to a single male. Similarly, the estimated effect of being a single female is an approximate increase of \(8.75\%\) in wage relative to a married female. As our intuition would lead us to believe, married individuals seem to have lower wages relative to single individuals.