# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ggplot2, data.table)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
wooldridge
DatasetLet’s load in the wage1
dataset from the
wooldridge
package and do some basic data analysis.
# Load in wage1 dataset from wooldridge as a data table
dt <- data.table(wooldridge::wage1)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "wage" "educ" "exper" "tenure" "nonwhite" "female"
## [7] "married" "numdep" "smsa" "northcen" "south" "west"
## [13] "construc" "ndurman" "trcommpu" "trade" "services" "profserv"
## [19] "profocc" "clerocc" "servocc" "lwage" "expersq" "tenursq"
# Filter out unwanted columns
dt <- dt[, .(wage, educ, exper, tenure)]
# Get summary statistics
summary(dt)
## wage educ exper tenure
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.650 Median :12.00 Median :13.50 Median : 2.000
## Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000
## Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000
# Get the number of observations
nrow(dt)
## [1] 526
# Show many NA values there are
sum(is.na(dt))
## [1] 0
We talked about including a squared term for one of models’ covariates if we believe there exists a quadratic relationship between the outcome and this specific covariate. Suppose we think wage is quadratically related to experience. Let’s test this hypothesis by creating a simple scatter plot with a line generated by running this polynomial regression.
# Scatter plot of wage vs experience with a parabola
ggplot(dt, aes(x = exper, y = wage)) +
geom_point(color = "white") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), color = color_1) +
labs(title = "Wage Versus Experience", x = "Experience", y = "Wage") +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
)
The above plot fits a polynomial regression model of degree two, i.e., \(wage_i = \beta_0 + \beta_1 exper_i + \beta_2 exper_i^2 + u_i\). As we can see, there does seem to be a quadratic relationship between these covariates. So, let’s add \(exper^2\) to our data table and run a regression of wage on education, experience, and experience squared.
# Add experience squared column to data table
dt[, exper_sq := exper^2]
# Run regression
reg <- lm(wage ~ educ + exper + exper_sq, data = dt)
# Show regression summary
summary(reg)
##
## Call:
## lm(formula = wage ~ educ + exper + exper_sq, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0692 -2.0837 -0.5417 1.2860 15.1363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.964890 0.752153 -5.271 1.99e-07 ***
## educ 0.595343 0.053025 11.228 < 2e-16 ***
## exper 0.268287 0.036897 7.271 1.31e-12 ***
## exper_sq -0.004612 0.000822 -5.611 3.26e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.166 on 522 degrees of freedom
## Multiple R-squared: 0.2692, Adjusted R-squared: 0.265
## F-statistic: 64.11 on 3 and 522 DF, p-value: < 2.2e-16
Wage is measured in dollars per hour. We can easily see that the effect of an additional two years of education on wage is an increase of \(2(0.5953) = \$1.19\) per hour.
What about experience though? We can see that the parameter estimate for squared experience is negative indicating the diminishing impact of experience on wage like we would expect. Let’s differentiate wtih respect to experience and see what we get for its marginal effect.
\[\begin{align} \frac{\partial \widehat{wage}}{\partial exper} &= \frac{\partial}{\partial exper} \left( -3.96 + 0.5953 educ + 0.2683 exper - 0.0046 exper^2 \right) \\ &= 0.2683 - 0.0092 exper. \end{align}\]
Consequently, the impact of the first year of experience on wage leads to an increase of \(0.2683 - 0.0092 = \$0.25\) cents per hour while the impact of the tenth year of experience on wage leads to an increase of only \(0.2683 - 0.0092(10) = \$0.17\) cents per hour.
Let’s now compute the average partial effect (APE) and the median partial effect (MPE). These values should be similar if the distribution of experience is roughly symmetric. If they are quite different, we likely want to go with the MPE because the median is more robust to outliers. When there are a lot of outliers, this distribution would be skewed either leftward or rightward.
# Obtain the mean and median of experience
avg_exper <- mean(dt[, exper], na.rm = TRUE)
med_exper <- median(dt[, exper], na.rm = TRUE)
# Show the APE
paste0("The average partial effect (APE) is ", round(0.2683 - (0.0092 * avg_exper), 4))
## [1] "The average partial effect (APE) is 0.1117"
# Show the MPE
paste0("The median partial effect (MPE) is ", 0.2683 - (0.0092 * med_exper))
## [1] "The median partial effect (MPE) is 0.1441"
This means the average effect of an additional year of experience on wage leads to an increase of \(\$0.11\) cents per hour while the median effect of an additional year of experience on wage leads to an increase of \(\$ 0.14\) centers per hour. We can similarly make conclusions about, for example, the average/median effect of ten additional years of experience on hourly wage.
Suppose we are also interested in the model \(wage_i = \beta_0 + \beta_1 educ + \beta_2 tenure + \beta_3 tenure^2 + u_i\). Is this model a better predictor of wage or is the previous one? These two models are non-nested so to make a determination we also estimate this one to obtain its adjusted R-squared. Recall that the adjusted R-squared measures how well our model can predict our outcome. A higher R-squared means a higher proportion of the variance (uncertainty) of our outcome is explained.
# Add tenure squared column to data table
dt[, tenure_sq := tenure^2]
# Run regression
reg <- lm(wage ~ educ + tenure + tenure_sq, data = dt)
# Show regression summary
summary(reg)
##
## Call:
## lm(formula = wage ~ educ + tenure + tenure_sq, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1685 -1.7149 -0.5627 1.1531 14.4415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.419542 0.637615 -3.795 0.000165 ***
## educ 0.562224 0.048430 11.609 < 2e-16 ***
## tenure 0.329938 0.047700 6.917 1.35e-11 ***
## tenure_sq -0.005523 0.001729 -3.194 0.001489 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.065 on 522 degrees of freedom
## Multiple R-squared: 0.3152, Adjusted R-squared: 0.3113
## F-statistic: 80.1 on 3 and 522 DF, p-value: < 2.2e-16
The previous model’s adjusted R-squared is \(0.265\) while this model’s is \(0.3113\). That means this model explains \(100(0.3113 - 0.265) = 4.65\) percentage points more of the variance in wage. Consequently, based on this measure, we would choose the model using tenure as a covariate over the model using experience.
Suppose now we hypothesize that the effect of an additional year of experience on hourly wage is less for individuals with more education. Stated differently, experience is more valuable towards your hourly wage when you have less education. To test this hypothesis we could estimate the model \(wage_i = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 educ_i \times exper_i + u_i\). If our hypothesis is correct, we would expect the estimate of \(\beta_3\) to be negative.
# Run regression
reg <- lm(wage ~ educ + exper + educ*exper, data = dt)
# Show regression summary
summary(reg)
##
## Call:
## lm(formula = wage ~ educ + exper + educ * exper, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6747 -1.9683 -0.6991 1.2803 15.8067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.859916 1.181080 -2.421 0.0158 *
## educ 0.601735 0.089900 6.693 5.64e-11 ***
## exper 0.045769 0.042614 1.074 0.2833
## educ:exper 0.002062 0.003491 0.591 0.5549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 522 degrees of freedom
## Multiple R-squared: 0.2257, Adjusted R-squared: 0.2212
## F-statistic: 50.71 on 3 and 522 DF, p-value: < 2.2e-16
So, the partial effect of an additional year of experience on hourly wage is \[\begin{align} \frac{\partial \widehat{wage}}{\partial exper} &= \frac{\partial}{\partial exper} \left( -2.86 + 0.6017 educ + 0.0458 exper + 0.0021 educ \times exper \right) \\ &= 0.0458 + 0.0021 educ. \end{align}\]
Interestingly, the marginal impact of years of experience on hourly wage is greater for individuals with higher education levels as indicated by \(\widehat{\beta}_3 = 0.0021\). Specifically, the effect of an additional year of experience on hourly wage for individuals with one year of education is estimated to be an increase in hourly wage of \(0.0458 + 0.0021 = \$0.05\) cents while the effect of an additional year of experience on hourly wage for individuals with ten years of education is estimated to be an increase in hourly wage of \(0.048 + 10(0.0021) = \$0.07\) cents. However, both parameter estimates of \(exper\) and \(education \times experience\) are insignificant at any traditional level so these effects are likely not too telling.
We can get the APE and MPE for an interaction effect model just like we did above for the quadratic model above in an identical manner.