Preliminaries

In the below code you will see something new that I usually don’t list in the preliminaries section: set.seed(). Throughout this code we will be generating random numbers and running experiments. That means that each time we run our .R script the data we generate will be completely different than the time before it. However, when we set a seed we are ensuring that our results will be reproducibility meaning the results I get from running my .R script today will be the same as when I run it tomorrow. The number 418518 I am passing into the set.seed function is called the seed. Everytime I run my experiments with this seed I will get identical results. However, if I changed this seed to say 418519 then I will get a different set of random numbers for my experiment. Setting seeds is extremely common in practical settings so please do take the time to understand this concept.

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(ggplot2, data.table)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Set random seed for reproducibility
set.seed(418518)

Population Versus Estimated Regression Functions

Generate Data

In the below code, we will generate some data ourselves and place this data into a data table. Some notes regarding the below code:

  1. The function rnorm(n, mean = 5, sd = 2) will generate n random draws from a normal distribution with mean = 5 and sd = 2.
  2. dt <- data.table(x = x, y = y) creates a data table object via the data.table() function and assigns it to dt. The input x = x and y = y is somewhat confusing at first glance. To the left of the equal sign is what we are using as column names in dt and to the right of the equal sign is the data that goes in that column.
# Number of data points to generate
n <- 200

# Intercept term
beta_0 <- 1

# Slope term
beta_1 <- 2

# Covariate (rnorm() will generated n normal random variables with a mean of 5 and standard deviation of 2)
x <- rnorm(n, mean = 5, sd = 2)

# Error term
u <- rnorm(n, mean = 1, sd = 2)

# Outcome
y <- beta_0 + beta_1 * x + u

# Create a data table
dt <- data.table(x = x, y = y)

Plot the True Line and the Estimated Line

# Create the plot
ggplot(dt, aes(x = x, y = y)) + 
  geom_point(color = color_1) + 
  geom_abline(aes(intercept = beta_0, slope = beta_1, color = "Population"), linetype = "dashed", size = 2) +  # True line
  geom_smooth(aes(color = "Estimated"), method = "lm", se = FALSE, size = 2) +  # Fitted line
  labs(
    title = "Regression Functions",
    color = "Regression Function" 
  ) +
  scale_color_manual(
    values = c("Population" = "#9370DB", "Estimated" = "#FFA07A"),
    labels = c("Population", "Estimated")
  ) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
    legend.background = element_rect(fill = "black"),
    legend.text = element_text(color = color_1, size = 15),
    legend.title = element_text(color = color_2, size = 20)
  ) +
  expand_limits(y = 0) +
  xlab("x") + 
  ylab("y")

It’s likely you looked at the above code and had no idea what is going on. That’s totally fine! As I said in the R tutorial document which is located on D2L under Content>R and RStudio>ECON_418-518_R_Tutorial, when you read some code you can’t understand or need to figure out how to write some code, use ChatGPT! I always have ChatGPT up when I code and its increased my coding efficiency ten fold. In fact, I used ChatGPT to create that plot!

Please do the following:

  1. Go to ChatGPT.
  2. If you don’t already have an account, then create one (ChatGPT 3.5 is currently free). Otherwise, log into your account.
  3. On the bottom part of your screen you should see Message ChatGPT. In that box, type What is ggplot2? Also, give me an example of how to use it. Then, press Enter.

You should get a fairly detailed definition of ggplot2 package as well as an example of how to use it. You can do this for essentially any coding question you may have. Now,

  1. Copy and paste the entire block of code from above in the Message ChatGPT box. Press the space bar and type Explain this R code and press Enter.

Again, you should get a fairly detailed description of the code and what each aspect of it does.

I highly encourage you to take advantage of ChatGPT throughout this course as well as when you get into the professional world!

Data Analysis with a wooldridge Dataset

Any computer based homework assignment from the book should have a dataset MindTap will ask you download as a .RData file. There is no need to download it because you can use the wooldridge package to load the dataset directly into R. Let’s load in the gpa2 dataset from the wooldridge package and do some basic data analysis.

The syntax below is fairly straightfoward. I would like to make two points:

  1. When I use dt[, colgpa], the part to the right of the comma means we want the colgpa column. The part to the left of the comma is empty. This tells R to get all observations (all rows). Had we used dt[1, colgpa] this would have given us the first observation’s college GPA. Had we used dt[1:5, colgpa] this would have given us the first five observations’ college GPAs.

  2. When I use na.rm = TRUE, this tells R to exclude missing data in the column of interest when calculating a statistic. For instance, when I run the code mean(dt[, colgpa], na.rm = TRUE) this tells R to calculate the mean of the colgpa column for all observations and exclude any missing values when making this calculation. Had we not included na.rm = TRUE and there were missing outcomes for the colgpa column, R would be unable to compute the mean and return NA instead.

# Load in gpa2 dataset from wooldridge as a data table
dt <- data.table(wooldridge::gpa2)

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
##  [1] "sat"      "tothrs"   "colgpa"   "athlete"  "verbmath" "hsize"   
##  [7] "hsrank"   "hsperc"   "female"   "white"    "black"    "hsizesq"
# Get summary statistics
summary(dt)
##       sat           tothrs           colgpa         athlete       
##  Min.   : 470   Min.   :  6.00   Min.   :0.000   Min.   :0.00000  
##  1st Qu.: 940   1st Qu.: 17.00   1st Qu.:2.210   1st Qu.:0.00000  
##  Median :1030   Median : 47.00   Median :2.660   Median :0.00000  
##  Mean   :1030   Mean   : 52.83   Mean   :2.653   Mean   :0.04689  
##  3rd Qu.:1120   3rd Qu.: 80.00   3rd Qu.:3.120   3rd Qu.:0.00000  
##  Max.   :1540   Max.   :137.00   Max.   :4.000   Max.   :1.00000  
##     verbmath          hsize          hsrank           hsperc       
##  Min.   :0.2597   Min.   :0.03   Min.   :  1.00   Min.   : 0.1667  
##  1st Qu.:0.7759   1st Qu.:1.65   1st Qu.: 11.00   1st Qu.: 6.4328  
##  Median :0.8667   Median :2.51   Median : 30.00   Median :14.5833  
##  Mean   :0.8805   Mean   :2.80   Mean   : 52.83   Mean   :19.2371  
##  3rd Qu.:0.9649   3rd Qu.:3.68   3rd Qu.: 70.00   3rd Qu.:27.7108  
##  Max.   :1.6667   Max.   :9.40   Max.   :634.00   Max.   :92.0000  
##      female           white            black            hsizesq       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   : 0.0009  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.: 2.7225  
##  Median :0.0000   Median :1.0000   Median :0.00000   Median : 6.3001  
##  Mean   :0.4496   Mean   :0.9255   Mean   :0.05535   Mean   :10.8535  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:13.5424  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :88.3600
# Get the number of observations
nrow(dt)
## [1] 4137
# Get the number of variables in the dataset 
ncol(dt)
## [1] 12
# Get the mean of college GPA
mean(dt[, colgpa], na.rm = TRUE)
## [1] 2.652686
# Get the mean of SAT
mean(dt[, sat], na.rm = TRUE)
## [1] 1030.331
# Get the standard deviation of SAT 
sd(dt[, sat], na.rm = TRUE)
## [1] 139.4014

Simple Linear Regression (SLR)

SLR using the lm() Function

To estimate the simple linear regression model of \(colgpa = \beta_0 + \beta_1 sat + u\), we say are running a regression of colgpa on sat or regressing colgpa on sat. R can easily do this with its built in lm() function. The syntax for the lm() function in this specific case is lm(colgpa ~ sat, data = dt). This means colgpa is our y variable, the ~ you can think of as an equal sign, and sat is our x variable. An intercept is automatically included in this estimation (we could exclude it by using colgpa ~ -1 + sat instead). Then, data = dt means the variables colgpa and sat come from our data contained in the data table dt.

# Run a regression of college GPA on SAT scores
reg <- lm(colgpa ~ sat, data = dt)

# Show what reg is
reg
## 
## Call:
## lm(formula = colgpa ~ sat, data = dt)
## 
## Coefficients:
## (Intercept)          sat  
##    0.663057     0.001931
# Get a detailed summary of the regression
summary(reg)
## 
## Call:
## lm(formula = colgpa ~ sat, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.84515 -0.38205  0.02968  0.42623  1.77382 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.631e-01  6.972e-02    9.51   <2e-16 ***
## sat         1.931e-03  6.706e-05   28.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6012 on 4135 degrees of freedom
## Multiple R-squared:  0.167,  Adjusted R-squared:  0.1668 
## F-statistic: 829.3 on 1 and 4135 DF,  p-value: < 2.2e-16

Let’s interpret the above results. If a student’s SAT score increases by one point, then their college GPA would be predicted to increase by roughly \(0.002\). This doesn’t seem that meaningful on its own. However, if a student’s SAT score increased by 100 points then their college GPA would be predicted to increase by \(0.002 * 100 = 0.2\).

Moreover, SAT scores seem to be a salient determinant of college GPA as indicated by the *** to the right of p-value for sat in the above regression summary. The triple star indicates that this parameter is statistically different from zero at the \(\alpha = 0.001\) level (99% confidence level). A double star ** would have meant this parameter is statistically significant at the 95% confidence level (\(\alpha = 0.05\) significance level) while a single star * would have meant this parameter is statistically different from zero at the 90% confidence level (\(\alpha = 0.1\) significance level).

We also talked about the R-squared. The adjusted R-squared is only \(0.1688\) indicating we have some work to do in order to explain our outcome better. Perhaps using multiple regressors will help us out there.

We can use our regression reg to get specific parts of the regression as follows:

# Extract the coefficients
reg$coefficients
## (Intercept)         sat 
## 0.663056766 0.001931058
# Extract the mean of the fitted values
mean(reg$fitted.values)
## [1] 2.652686
# Extrat the mean of the residuals (notice this is approximatley zero like we talked about in class)
mean(reg$residuals)
## [1] -3.455781e-17
# Extract the summ of the residuals (notice this is approximately zero like we talked about in class)
sum(reg$residuals)
## [1] -1.432136e-13

Lastly, we can predict a person’s college GPA with any value of SAT score. Let’s see what the predicted college GPA is when we use the mean SAT score as the predictor.

# Assign the mean SAT to a variable 
mean_sat <- mean(dt[, sat], na.rm = TRUE)

# Place mean_sat into a data table having its column name as sat
dt_mean_sat <- data.table(sat = mean_sat)

# Predict college GPA using the mean SAT
predict(reg, dt_mean_sat)
##        1 
## 2.652686
# Predict college GPA using mean SAT manually
reg$coefficients["(Intercept)"] + reg$coefficients["sat"] * mean_sat
## (Intercept) 
##    2.652686

Notice that the above prediction is exactly the same as the mean of college GPA for our sample which we computed above. This is one of the OLS properties we discussed in class; the sample mean of our outcome lies on the predicted regression line. Stated another way, if we try to predict our outcome with the average value of the regressor, the prediction will be the average value of the outcome.

SLR Done Manually

We derived the SLR parameter estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) in class. Let’s use those formulas to compute the estimates rather than using lm().

# Create our outcome variable y as colgpa
y <- dt[, colgpa]

# Get the mean of y
y_bar <- mean(y, na.rm = TRUE)

# Create our regressor variable x as sat
x <- dt[, sat]

# Get the mean of x
x_bar <- mean(x, na.rm = TRUE)

# Compute slope estimate manually
beta_hat_1 <- cov(x, y) / var(x)

# Show the slope estimate computed manually
beta_hat_1
## [1] 0.001931058
# Compute intercept estimate manually
beta_hat_0 <- y_bar - beta_hat_1 * x_bar

# Show the intercept estimate computed manually
beta_hat_0 
## [1] 0.6630568

Notice, these parameter estimates are identical to that observed when using the lm() function.

Let’s also do this for the standard error of \(\widehat{\beta}_1\).

# Number of observations
n <- length(x)

# Number of estimated parameters
k <- 2

# Degrees of freedom
df <- n - k

# Obtain the residuals
resid <- y - beta_hat_0 - beta_hat_1 * x

# Obtain the estimate of the variance of the error term
var_u_hat <- sum(resid^2) / df

# Compute the SST of x
sst_x <- sum( (x - x_bar)^2  )

# Calculate the standard error of the slope estimate
se_beta_hat_1 <- sqrt(var_u_hat / sst_x)

# Show the standard error of the slope estimate
se_beta_hat_1
## [1] 6.705796e-05

Notice, this standard error is identical to that observed when using the lm() function.

Regression with Logs

Log-level

We talked about using log-level regressions to get approximate percentage changes in our outcome variable. Let’s create a new variable as the natural log of college GPA and regress this on SAT score.

# Keep only observations with college GPA greater than zero because the natural log of any non-positive number is undefined
dt_log_level <- dt[colgpa > 0]

# Make a new column for the natural log of college GPA
dt_log_level[, lcolgpa := log(colgpa)]

# Regress lcolgpa on sat
reg_log_level <- lm(lcolgpa ~ sat, data = dt_log_level)

# Extract the coefficients of this regression
reg_log_level$coefficients
## (Intercept)         sat 
## 0.168093176 0.000748578

So, for a one unit increase in SAT score, our model predicts that college GPA will increase by approximately \(0.0007 * 100\% = 0.07\%\). For a one-hundred unit increase in SAT score, our model predicts that college GPA will increase by approximately \(0.0007 * 100 * 100\% = 7\%\).

Log-Log

We talked about using log-log regressions to get elasticities of our outcome variable with respect to our regressor. Let’s create the natural log of college GPA and regress this on the natural log of SAT score.

# Filter out observations with colgpa or sat at or below zero because the natural log is only defined for positive inputs
dt_log_log <- dt[colgpa > 0 & sat > 0]

# Make a new column for the natural log of college GPA
dt_log_log[, lcolgpa := log(colgpa)]

# Make a new column for the natural log of SAT
dt_log_log[, lsat := log(sat)]

# Regress lcolgpa on lsat
reg_log_level <- lm(lcolgpa ~ lsat, data = dt_log_log)

# Extract the coefficients of this regression
reg_log_level$coefficients
## (Intercept)        lsat 
##  -4.1958457   0.7411999

So, for a one percent change in SAT score, our model predicts that college GPA will increase by approximately \(0.74\%\). For a 50 percent change in SAT score, our model predicts that college GPA will increase by approximately \(0.74 * 50\% = 37\%\).

Sampling Distribution of the OLS Estimates

Let’s see how the sampling distribution of OLS estimates look under different scenarios.

When Assumptions 1-4 Hold

Under Assumptions 1-4, we know the sampling distribution of each parameter estimate should be centered around the truth. As shown below, this holds true. A few notes:

  1. beta_hat_0 = rep(NA, s) creates s = 1000 NA values and stores them in the column named beta_hat_0. We’re essentially creating placeholders for our parameter estimates and we want s of them because we will estimate \beta_hat_0 for each generated sample

  2. for(i in 1:s) is an example of a for loop that will execute the code in brackets 1, 2, ..., 1000 times because s = 1000. For loops are extremely handy when we have a task we need repeated many times.

  3. Within the loop, the code dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"] means extract the estimated intercept term and place it in the dt_estimates data table in column beta_hat_0 and at row i. Remember i starts from one and goes to one-thousand. So, over each iteration of the loop we are replacing those NA values we started with by the parameter estimate for the current sample which is given by i in the loop.

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))

# Intercept term
beta_0 <- 1

# Slope term
beta_1 <- 2

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  # Error term
  u <- rnorm(n, mean = 0, sd = 2)
  
  # Covariate
  x <- rnorm(n, mean = 5, sd = 2)
  
  # Outcome
  y <- beta_0 + beta_1 * x + u
  
  # Create a data table
  dt <- data.table(x = x, y = y)
  
  # Run regression of y on x
  reg <- lm(y ~ x, data = dt)
  
  # Place the current intercept estimate in dt_estimates
  dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
  
  # Place the current slope estimate in dt_estimates
  dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_0, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[0])) +
  ylab("Density")

# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

When Assumption 4 Does Not Hold

If Assumptions 4 does not hold, our parameter estimates becomes biased. However, in this specific case only \(\widehat{\beta}_1\) is biased because the error term still has mean zero.

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))

# Intercept term
beta_0 <- 1

# Slope term
beta_1 <- 2

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
 
  # Covariate
  x <- rnorm(n, mean = 5, sd = 2)
  
  # Error term that depends on x (introducing bias in beta_hat_1)
  u <- 0.5 * x + rnorm(n, mean = 0, sd = 2)
  
  # Outcome
  y <- beta_0 + beta_1 * x + u
  
  # Create a data table
  dt <- data.table(x = x, y = y)
  
  # Run regression of y on x
  reg <- lm(y ~ x, data = dt)
  
  # Place the current intercept estimate in dt_estimates
  dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
  
  # Place the current slope estimate in dt_estimates
  dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_0, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[0])) +
  ylab("Density")

# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

In below simulation, u is now not mean zero and is a function of x. Consequently, both parameter estimates will be biased.

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))

# Intercept term
beta_0 <- 1

# Slope term
beta_1 <- 2

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  # Covariate
  x <- rnorm(n, mean = 5, sd = 2)
  
  # Error term that depends on x and is no longer mean zero
  u <- 0.5 * x + rnorm(n, mean = 1, sd = 2)
  
  # Outcome
  y <- beta_0 + beta_1 * x + u
  
  # Create a data table
  dt <- data.table(x = x, y = y)
  
  # Run regression of y on x
  reg <- lm(y ~ x, data = dt)
  
  # Place the current intercept estimate in dt_estimates
  dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
  
  # Place the current slope estimate in dt_estimates
  dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_0, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[0])) +
  ylab("Density")

# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

In the below simulation, u is not mean zero, but is no longer a function of \(x\). So, our constant term will effectively absorb the mean of u. This means that whatever the mean of u is and whatever the true value of \(\beta_0\) is, the new true value will be \(\widetilde{\beta}_0 = \beta_0 + \overline{u}\).

# Number of samples
s <- 1000

# Number of data points to generate per sample
n <- 200

# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))

# Mean of error term
u_bar <- 2

# Intercept term
beta_0 <- 1

# Slope term
beta_1 <- 2

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  # Covariate
  x <- rnorm(n, mean = 5, sd = 2)
  
  # Error term that is no longer mean zero
  u <- rnorm(n, mean = u_bar, sd = 2)
  
  # Outcome
  y <- beta_0 + beta_1 * x + u
  
  # Create a data table
  dt <- data.table(x = x, y = y)
  
  # Run regression of y on x
  reg <- lm(y ~ x, data = dt)
  
  # Place the current intercept estimate in dt_estimates
  dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
  
  # Place the current slope estimate in dt_estimates
  dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_0 + u_bar, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[0])) +
  ylab("Density")

# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
  geom_density(color = color_2, fill = color_2, alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

The OLS Estimates Variability Diminishes as Regressor Variance Increases

We want variation in our regressor! We know that the SST of our regressor increases as the number of observations increases (recall the SST of x is \(\displaystyle \sum_{i=1}^n \left( x_i - \overline{X} \right)^2\)). Let’s create one-hundred samples where one sample has \(n_1 = 100\) observations and the second sample has \(n_2 = 1000\) observations. Then we’ll plot the sampling distribution of \(\widehat{\beta}_1\) for both scenarios. If regressor variance is truly a good thing, then that means our sampling distribution should be more tightly centered around the true parameter value we are estimating (under Assumptions 1-4) for sample fo size \(s_2\) relative to sample of size \(s_1\).

# Number of samples
s <- 100

# Number of data points to generate per sample
n_1 <- 100
n_2 <- 1000

# Create data tables with empty values to store the estimates of beta_hat_1
dt_estimates_1 <- data.table(beta_hat_1 = rep(NA, s), sample_size = rep(n_1, s))
dt_estimates_2 <- data.table(beta_hat_1 = rep(NA, s), sample_size = rep(n_2, s))

# Intercept term
beta_0 <- 1

# Slope term
beta_1 <- 2

# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
  # Covariates
  x_1 <- rnorm(n_1, mean = 5, sd = 2)
  x_2 <- rnorm(n_2, mean = 5, sd = 2)
  
  # Error terms
  u_1 <- rnorm(n_1, mean = 0, sd = 2)
  u_2 <- rnorm(n_2, mean = 0, sd = 2)
  
  # Outcomes
  y_1 <- beta_0 + beta_1 * x_1 + u_1
  y_2 <- beta_0 + beta_1 * x_2 + u_2
  
  # Create data tables
  dt_1 <- data.table(x = x_1, y = y_1)
  dt_2 <- data.table(x = x_2, y = y_2)
  
  # Run regressions
  reg_1 <- lm(y ~ x, data = dt_1)
  reg_2 <- lm(y ~ x, data = dt_2)
  
  # Place the current slope estimate in dt_estimates_1 and dt_estimates_2
  dt_estimates_1$beta_hat_1[i] <- reg_1$coefficients["x"]
  dt_estimates_2$beta_hat_1[i] <- reg_2$coefficients["x"]
}

# Combine the two data tables for plotting
dt_combined <- rbind(dt_estimates_1, dt_estimates_2)
# Plot the density of the slope estimates for both sample sizes
ggplot(dt_combined, aes(x = beta_hat_1, fill = factor(sample_size))) +
  geom_density(alpha = 0.7) +
  geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
  labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1])),
       fill = "Sample Size") +
  scale_fill_manual(
    values = c("100" = "#9370DB", "1000" = "#FFA07A"),
    labels = c("100", "1000")
  ) +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
    legend.background = element_rect(fill = "black"),
    legend.text = element_text(color = color_1, size = 15),
    legend.title = element_text(color = color_2, size = 20)
  ) +
  xlab(expression(hat(beta)[1])) +
  ylab("Density")

Indeed, for larger samples the sampling distribution of our estimator is more tightly centered around the true value (under Assumptions 1-4 needed for unbiasedness). This means that for larger values of \(n\) and, consquently, larger regressor variance, we are more certain about our estimates.