# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ggplot2, data.table)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
We’ll create various data generating processes and show how our estimator behaves as \(n \to \infty\). Remember, if our estimator is consistent, then as \(n\) gets large its sampling distribution gets closer to having a mean equal to the parameter it is estimating as well as being more tightly centered around this mean.
Let’s construct a scenario under which MLR Assumptions 1-4 hold and show what happens to our estimator’s sampling distribution as \(n \to \infty\).
The true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2 x_{1i} + 3x_{2i} + u_i\) and suppose this is the model we estimate. Let’s verify OLS is consistent by generating \(s = 1000\) samples each with \(n_1 = 50\), \(n_2 = 100\), and \(n_3 = 1000\) data points. Then, we’ll run the regression of \(y\) on \(x_1\) and \(x_2\) using each sample and plot the sampling distribution of \(\widehat{\beta}_1\). If we are correct, then the sampling distribution should be always be centered around \(\beta_1\) for any value of \(n\) since MLR Assumptions 1-4 hold, but this sampling distribution should get tighter (variance of \(\widehat{\beta}_1\) shrinks) as \(n \to \infty\). Moreover, this sampling distribution should look more normally distributed as \(n\) gets large with a mean of the true \(\beta_1\). Let’s test this hypothesis.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- c(50, 200, 1000)
# Initialize empty list to store estimates of beta_1 depending on the number of observations
estimates_list <- vector("list", length(n))
names(estimates_list) <- paste0("n = ", n)
# True parameters
beta_0 <- 1
beta_1 <- 2
beta_2 <- 3
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# We want a differing number of observations per sample
for (j in 1:length(n))
{
# Generate data
x_1 <- rnorm(n[j], mean = 1, sd = 2)
x_2 <- rnorm(n[j], mean = 0, sd = 1)
u <- rnorm(n[j], mean = 0, sd = 2)
y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + u
# Create data table
dt <- data.table(y = y, x_1 = x_1, x_2 = x_2)
# Run regression
reg <- lm(y ~ x_1 + x_2, data = dt)
# Store estimate of beta_1
estimates_list[[j]] <- c(estimates_list[[j]], reg$coefficients["x_1"])
}
}
# Combine estimates into a single data.table for plotting
dt_plot <- data.table(
Estimate = unlist(estimates_list),
Sample_Size = rep(names(estimates_list), times = sapply(estimates_list, length))
)
# Ensure Sample_Size is a factor with levels in the desired order
dt_plot[, Sample_Size := factor(Sample_Size, levels = paste0("n = ", n))]
# Plot the sampling distributions
ggplot(dt_plot, aes(x = Estimate, fill = Sample_Size)) +
geom_density(alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
scale_fill_manual(
values = c("#FFA07A", "#20B2AA", "#778899"),
labels = c(paste0("n = ", n[1]), paste0("n = ", n[2]), paste0("n = ", n[3])),
name = "Sample Size"
) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, color = color_1, size = 25),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
Indeed, in all cases the sampling distribution is centered around \(\beta_1\) since MLR Assumptions 1-4 hold, the variance of our estimator \(\widehat{\beta}_1\) shrinks as \(n \to \infty\), and they look increasingly normal.
Let’s construct a scenario under which MLR Assumptions 1-3 hold, but \(\text{Cov}\left[X, \boldsymbol{u} \right]\) does not. Remember that MLR Assumption 4 of zero conditional mean implies zero covariance so when we have non-zero covariance this implies that MLR Assumption 4 fails. Then, we’ll show what happens to our estimator’s sampling distribution as \(n \to \infty\).
The true underlying model is \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + u_i = 1 + 2 x_{1i} + 3x_{2i} + u_i\), but suppose we don’t include \(x_2\) (perhaps we don’t have data on it which is often an issue when doing empirical work). We know \(x_2\) is positively related to \(y\) as indicated by its positive coefficient in the model \(\left(\beta_2 = 3 \right)\). Moreover, suppose \(x_{1i} = \widetilde{\beta}_0 + \left( \frac{\widetilde{\beta}_1}{\sqrt{n}} \right) x_{2i} + v_i = 4 + \left( \frac{5}{\sqrt{n}} \right) x_{2i} + v_i\) so \(x_{2}\) is also positively related to \(x_1\) as indicated by its positive coefficient in the model \(\left(\widetilde{\beta}_1 = 5 \right)\), but this positive covariance decreases asymptotically at rate \(\sqrt{n}\). Consequently, \(x_2\) is positively related with \(y\) and \(x_1\) meaning when we estimate the model \(y_i = \beta_0 + \beta_1 x_{i1} + w_i\), \(\beta_1\) should be overestimated (think positive times positive gives positive bias), but this overestimation should diminish as \(n\) gets large because \(\frac{\widetilde{\beta}_1}{\sqrt{n}} \to 0\) as \(n \to \infty\).
In estimating the impact of class size (\(x_1\)) on test scores (\(y\)), we might lack direct data on teacher quality (\(x_2\)), which is both correlated with test scores and potentially correlated with class size. In smaller samples of schools, there is a higher likelihood that the best teachers are systematically assigned to smaller classes, perhaps due to this smaller sample of schools having more resources. This results in a correlation between teacher quality and class size. This systematic assignment introduces omitted variable bias (OVB), leading to biased estimates of the effect of class size on test scores. However, as the sample size increases and we collect data from a more diverse and representative set of schools, the variation in teacher quality’s correlation with class sizes diminishes. In larger samples, the assignment of teachers to classes is more likely to be random or varied, reducing the correlation between teacher quality and class size. Consequently, the bias in estimating the impact of class size on test scores diminishes, making our estimates more accurate and approaching an unbiased result asymptotically.
Let’s verify OLS is consistent by generating \(s = 1000\) samples each with \(n_1 = 50\), \(n_2 = 1000\), and \(n_3 = 10000\) data points. Then, we’ll run the regression of \(y\) on \(x_1\) using each sample and plot the sampling distribution of \(\widehat{\beta}_1\). If we are correct, then the sampling distribution should be asymptotically unbiased (consistent) meaning as \(n\) get larges the bias of \(\widehat{\beta}_1\) should disappear and this sampling distribution should get tighter (variance of \(\widehat{\beta}_1\) shrinks) as well. Moreover, this sampling distribution should look more normally distributed as \(n\) gets large with a mean converging to the true value of \(\beta_1\). Let’s test this hypothesis.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- c(50, 1000, 10000)
# Initialize empty list to store estimates of beta_1 depending on the number of observations
estimates_list <- vector("list", length(n))
names(estimates_list) <- paste0("n = ", n)
# True parameters when y is the outcome
beta_0 <- 1
beta_1 <- 2
beta_2 <- 3
# True paramters when x_1 is the outcome
beta_tilde_0 <- 4
beta_tilde_1 <- 5
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# We want a differing number of observations per sample
for (j in 1:length(n))
{
# Generate errors
v <- rnorm(n[j], mean = 0, sd = 1)
u <- rnorm(n[j], mean = 0, sd = 2)
# Generate covariates
x_2 <- rnorm(n[j], mean = 1, sd = 2)
x_1 <- beta_tilde_0 + (beta_tilde_1 / sqrt(n[j])) * x_2 + v
# Generate outcome
y <- beta_0 + beta_1 * x_1 + beta_2 * x_2 + u
# Create data table
dt <- data.table(y = y, x_1 = x_1)
# Run regression
reg <- lm(y ~ x_1, data = dt)
# Store estimate of beta_1
estimates_list[[j]] <- c(estimates_list[[j]], reg$coefficients["x_1"])
}
}
# Combine estimates into a single data.table for plotting
dt_plot <- data.table(
Estimate = unlist(estimates_list),
Sample_Size = rep(names(estimates_list), times = sapply(estimates_list, length))
)
# Ensure Sample_Size is a factor with levels in the desired order
dt_plot[, Sample_Size := factor(Sample_Size, levels = paste0("n = ", n))]
# Plot the sampling distributions
ggplot(dt_plot, aes(x = Estimate, fill = Sample_Size)) +
geom_density(alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
scale_fill_manual(
values = c("#FFA07A", "#20B2AA", "#778899"),
labels = c(paste0("n = ", n[1]), paste0("n = ", n[2]), paste0("n = ", n[3])),
name = "Sample Size"
) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, color = color_1, size = 25),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
Indeed, in all cases the sampling distributions continually moves leftward towards the true value of \(\beta_1\) as well as looking increasingly more normally distributed as \(n \to \infty\). Moreover, the variance of our estimator shrinks like we would expect.
It is important to note that the t-distribution converges to the standard normal asymptotically as its degrees of freedom \(n-k\) gets large. This means we need this difference to be large, not only \(n\). For instance if \(n = 1000\) and \(k = 970\), then we our \(t_{970}\)-distribution may not be as close as we would like even though we have a hefty sample size.
Let’s graph t-Distributions for varying degrees of freedom to show that as these degrees of freedom gets large, then the t-Distribution converges to a standard normal distribution.
# Define a sequence of x values for the t-distributions
x <- seq(-4, 4, length.out = 1000)
# Create a data table for plotting
dt_plot <- data.table(x = rep(x, 3),
df = factor(rep(c(1, 5, 50), each = 1000)),
density = c(dt(x, df = 1), dt(x, df = 5), dt(x, df = 50)))
# Add a standard normal distribution for comparison
dt_plot <- rbind(dt_plot, data.table(x = x, df = factor("Normal"), density = dnorm(x)))
# Plot the t-distributions and the standard normal distribution
ggplot(dt_plot, aes(x = x, y = density, fill = df)) +
geom_area(alpha = 0.5, position = 'identity') +
geom_line(size = 0.4) +
labs(title = "t-Distribution Convergence",
x = "x",
y = "Density",
fill = "Distribution") +
scale_fill_manual(
values = c("1" = "#FFA07A", "5" = "#20B2AA", "50" = "#778899", "Normal" = "white"),
labels = c("1" = "df = 1", "5" = "df = 5", "50" = "df = 50", "Normal" = "Standard Normal"),
name = "Distribution"
) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, color = color_1, size = 25),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
)
As we can see, the tails of the t-distributions get thinner as the degrees of freedom increase. When the degrees of freedom reach fifty, the t-distribution becomes virtually indistinguishable from the standard normal distribution. The mass in the tails shifts towards the center of the distribution, causing the t-distribution to transition from being wider to being taller. These tails are thicker and the corresponding distribution is wider for lower degrees of freedom because the use of the t-distribution implies the estimation of the variance so there is more uncertainty relative to the normal distribution. As the degrees of freedom gets large, this uncertainty decreases because our sample size is increasing causing the distribution to become more centered and the variance to shrink.
qt()
versus qnorm()
Let’s compare the critical values produced by the inverse CDF of the
t-distribution qt()
for varying degrees of freedom to those
produced by the inverse CDF of the standard normal distribution
qnorm()
. We’ll consider a two tailed test at the \(\alpha = 0.05\) significance level.
# Create a vector of degrees of freedom
df <- c(1, 10, 50, 100)
# Compute the critical values for t-distribution
crit_t <- qt(0.975, df = df)
# Name the elements of the critical values vector
names(crit_t) <- paste("df =", df)
# Show the critical values
crit_t
## df = 1 df = 10 df = 50 df = 100
## 12.706205 2.228139 2.008559 1.983972
# Compute the critical value for standard normal distribution
crit_n <- qnorm(0.975)
# Name the critical value
names(crit_n) <- "Standard Normal"
# Show the critical values
crit_n
## Standard Normal
## 1.959964
As we can see, the critical value for a t-distribution with only one degree of freedom is very different from that of the standard normal. This would result in very poor inference if we used the standard normal instead of this t-distribution. However, as the degrees of freedom rises to one-hundred, the critical value is quite similar to that given by the inverse CDF of the standard normal.
pt()
versus pnorm()
The larger tails of a t-distribution with small degrees of freedom result in smaller p-values compared to the standard normal distribution for a given test statistic value. This means that a null hypothesis would be harder to reject when using the t-distribution compared to the standard normal distribution. The t-distribution’s larger tails account for additional uncertainty due to estimating the population standard deviation, leading to higher critical values for the same level of significance. Therefore, for a given critical value, the t-distribution makes it more challenging to reject the null hypothesis. One can think of using the t-distribution as a more conservative means of conducting a hypothesis test.
Let’s now compare the p-values produced by the CDF of the
t-distribution pt()
for varying degrees of freedom to those
produced by the CDF of the standard normal distribution
qnorm()
. We’ll consider a two tailed test of \(H_0: \beta = 0\) versus \(H_A: \beta \neq 0\) at the \(\alpha = 0.05\) signifiance level. Assume
that \(\widehat{\beta} = 1\) and \(\text{se}\left[ \widehat{\beta} \right] =
0.5\).
# Degrees of freedom
df <- c(1, 10, 50, 100)
# Estimate of beta
beta_hat <- 1
# Standard error of estimate of beta
se_beta_hat <- 0.5
# Create the test statistic
test_stat <- beta_hat / se_beta_hat
# Compute the p-values when using t-distribution
p_t <- 2 * pt(abs(test_stat), df = df, lower.tail = FALSE)
# Name the elements of the p-values vector
names(p_t) <- paste("df =", df)
# Show the p-values
p_t
## df = 1 df = 10 df = 50 df = 100
## 0.29516724 0.07338803 0.05094707 0.04821218
# Compute the p-value for standard normal distribution
p_n <- 2 * pnorm(abs(test_stat), lower.tail = FALSE)
# Name the p-value
names(p_n) <- "Standard Normal"
# Show the p-value
p_n
## Standard Normal
## 0.04550026
As we can see, the null hypothesis is rejected when using the CDF of the standard normal distribution as \(p = 0.046 < 0.05 = \alpha\). However, we only reject this null hypothesis when using the CDF of a t-distribution with one-hundred degrees of freedom. Consequently, this provides evidence to the above claim that it is harder to reject the null hypothesis when using the t-distribution relative to the standard normal. The rationale is that due to the required variance estimation embedded within the t-distribution, there is more uncertainty so we need stronger evidence to reject the null hypothesis.