In the below code you will see something new that I usually don’t
list in the preliminaries section: set.seed()
. Throughout
this code we will be generating random numbers and running experiments.
That means that each time we run our .R
script the data we
generate will be completely different than the time before it. However,
when we set a seed we are ensuring that our results will be
reproducibility meaning the results I get from running my
.R
script today will be the same as when I run it tomorrow.
The number 418518
I am passing into the
set.seed
function is called the seed. Everytime I run my
experiments with this seed I will get identical results. However, if I
changed this seed to say 418519
then I will get a different
set of random numbers for my experiment. Setting seeds is extremely
common in practical settings so please do take the time to understand
this concept.
# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ggplot2, data.table)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
In the below code, we will generate some data ourselves and place this data into a data table. Some notes regarding the below code:
rnorm(n, mean = 5, sd = 2)
will generate
n
random draws from a normal distribution with
mean = 5
and sd = 2
.dt <- data.table(x = x, y = y)
creates a data table
object via the data.table()
function and assigns it to
dt
. The input x = x
and y = y
is
somewhat confusing at first glance. To the left of the equal sign is
what we are using as column names in dt
and to the right of
the equal sign is the data that goes in that column.# Number of data points to generate
n <- 200
# Intercept term
beta_0 <- 1
# Slope term
beta_1 <- 2
# Covariate (rnorm() will generated n normal random variables with a mean of 5 and standard deviation of 2)
x <- rnorm(n, mean = 5, sd = 2)
# Error term
u <- rnorm(n, mean = 1, sd = 2)
# Outcome
y <- beta_0 + beta_1 * x + u
# Create a data table
dt <- data.table(x = x, y = y)
# Create the plot
ggplot(dt, aes(x = x, y = y)) +
geom_point(color = color_1) +
geom_abline(aes(intercept = beta_0, slope = beta_1, color = "Population"), linetype = "dashed", size = 2) + # True line
geom_smooth(aes(color = "Estimated"), method = "lm", se = FALSE, size = 2) + # Fitted line
labs(
title = "Regression Functions",
color = "Regression Function"
) +
scale_color_manual(
values = c("Population" = "#9370DB", "Estimated" = "#FFA07A"),
labels = c("Population", "Estimated")
) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
) +
expand_limits(y = 0) +
xlab("x") +
ylab("y")
It’s likely you looked at the above code and had no idea what is
going on. That’s totally fine! As I said in the R tutorial document
which is located on D2L
under
Content
>R and RStudio
>ECON_418-518_R_Tutorial
,
when you read some code you can’t understand or need to figure out how
to write some code, use ChatGPT
! I always have
ChatGPT
up when I code and its increased my coding
efficiency ten fold. In fact, I used ChatGPT
to create that
plot!
Please do the following:
ChatGPT 3.5
is currently free). Otherwise, log into your
account.Message ChatGPT
. In that box, type
What is ggplot2? Also, give me an example of how to use it.
Then, press Enter
.You should get a fairly detailed definition of ggplot2
package as well as an example of how to use it. You can do this for
essentially any coding question you may have. Now,
Message ChatGPT
box. Press the space bar and type
Explain this R code
and press Enter
.Again, you should get a fairly detailed description of the code and what each aspect of it does.
I highly encourage you to take advantage of ChatGPT
throughout this course as well as when you get into the professional
world!
wooldridge
DatasetAny computer based homework assignment from the book should have a
dataset MindTap will ask you download as a .RData
file.
There is no need to download it because you can use the
wooldridge
package to load the dataset directly into
R
. Let’s load in the gpa2
dataset from the
wooldridge
package and do some basic data analysis.
The syntax below is fairly straightfoward. I would like to make two points:
When I use dt[, colgpa]
, the part to the right of
the comma means we want the colgpa
column. The part to the
left of the comma is empty. This tells R
to get all
observations (all rows). Had we used dt[1, colgpa]
this
would have given us the first observation’s college GPA. Had we used
dt[1:5, colgpa]
this would have given us the first five
observations’ college GPAs.
When I use na.rm = TRUE
, this tells R
to exclude missing data in the column of interest when calculating a
statistic. For instance, when I run the code
mean(dt[, colgpa], na.rm = TRUE)
this tells R
to calculate the mean of the colgpa
column for all
observations and exclude any missing values when making this
calculation. Had we not included na.rm = TRUE
and there
were missing outcomes for the colgpa
column, R
would be unable to compute the mean and return NA
instead.
# Load in gpa2 dataset from wooldridge as a data table
dt <- data.table(wooldridge::gpa2)
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "sat" "tothrs" "colgpa" "athlete" "verbmath" "hsize"
## [7] "hsrank" "hsperc" "female" "white" "black" "hsizesq"
# Get summary statistics
summary(dt)
## sat tothrs colgpa athlete
## Min. : 470 Min. : 6.00 Min. :0.000 Min. :0.00000
## 1st Qu.: 940 1st Qu.: 17.00 1st Qu.:2.210 1st Qu.:0.00000
## Median :1030 Median : 47.00 Median :2.660 Median :0.00000
## Mean :1030 Mean : 52.83 Mean :2.653 Mean :0.04689
## 3rd Qu.:1120 3rd Qu.: 80.00 3rd Qu.:3.120 3rd Qu.:0.00000
## Max. :1540 Max. :137.00 Max. :4.000 Max. :1.00000
## verbmath hsize hsrank hsperc
## Min. :0.2597 Min. :0.03 Min. : 1.00 Min. : 0.1667
## 1st Qu.:0.7759 1st Qu.:1.65 1st Qu.: 11.00 1st Qu.: 6.4328
## Median :0.8667 Median :2.51 Median : 30.00 Median :14.5833
## Mean :0.8805 Mean :2.80 Mean : 52.83 Mean :19.2371
## 3rd Qu.:0.9649 3rd Qu.:3.68 3rd Qu.: 70.00 3rd Qu.:27.7108
## Max. :1.6667 Max. :9.40 Max. :634.00 Max. :92.0000
## female white black hsizesq
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. : 0.0009
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.: 2.7225
## Median :0.0000 Median :1.0000 Median :0.00000 Median : 6.3001
## Mean :0.4496 Mean :0.9255 Mean :0.05535 Mean :10.8535
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:13.5424
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :88.3600
# Get the number of observations
nrow(dt)
## [1] 4137
# Get the number of variables in the dataset
ncol(dt)
## [1] 12
# Get the mean of college GPA
mean(dt[, colgpa], na.rm = TRUE)
## [1] 2.652686
# Get the mean of SAT
mean(dt[, sat], na.rm = TRUE)
## [1] 1030.331
# Get the standard deviation of SAT
sd(dt[, sat], na.rm = TRUE)
## [1] 139.4014
lm()
FunctionTo estimate the simple linear regression model of \(colgpa = \beta_0 + \beta_1 sat + u\), we
say are running a regression of colgpa
on sat
or regressing colgpa
on sat
. R
can easily do this with its built in lm()
function. The
syntax for the lm()
function in this specific case is
lm(colgpa ~ sat, data = dt)
. This means colgpa
is our y variable, the ~
you can think of as an equal sign,
and sat
is our x variable. An intercept is automatically
included in this estimation (we could exclude it by using
colgpa ~ -1 + sat
instead). Then, data = dt
means the variables colgpa
and sat
come from
our data contained in the data table dt
.
# Run a regression of college GPA on SAT scores
reg <- lm(colgpa ~ sat, data = dt)
# Show what reg is
reg
##
## Call:
## lm(formula = colgpa ~ sat, data = dt)
##
## Coefficients:
## (Intercept) sat
## 0.663057 0.001931
# Get a detailed summary of the regression
summary(reg)
##
## Call:
## lm(formula = colgpa ~ sat, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84515 -0.38205 0.02968 0.42623 1.77382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.631e-01 6.972e-02 9.51 <2e-16 ***
## sat 1.931e-03 6.706e-05 28.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6012 on 4135 degrees of freedom
## Multiple R-squared: 0.167, Adjusted R-squared: 0.1668
## F-statistic: 829.3 on 1 and 4135 DF, p-value: < 2.2e-16
Let’s interpret the above results. If a student’s SAT score increases by one point, then their college GPA would be predicted to increase by roughly \(0.002\). This doesn’t seem that meaningful on its own. However, if a student’s SAT score increased by 100 points then their college GPA would be predicted to increase by \(0.002 * 100 = 0.2\).
Moreover, SAT scores seem to be a salient determinant of college GPA
as indicated by the ***
to the right of p-value for
sat
in the above regression summary. The triple star
indicates that this parameter is statistically different from zero at
the \(\alpha = 0.001\) level (99%
confidence level). A double star **
would have meant this
parameter is statistically significant at the 95% confidence level
(\(\alpha = 0.05\) significance level)
while a single star *
would have meant this parameter is
statistically different from zero at the 90% confidence level (\(\alpha = 0.1\) significance level).
We also talked about the R-squared. The adjusted R-squared is only \(0.1688\) indicating we have some work to do in order to explain our outcome better. Perhaps using multiple regressors will help us out there.
We can use our regression reg
to get specific parts of
the regression as follows:
# Extract the coefficients
reg$coefficients
## (Intercept) sat
## 0.663056766 0.001931058
# Extract the mean of the fitted values
mean(reg$fitted.values)
## [1] 2.652686
# Extrat the mean of the residuals (notice this is approximatley zero like we talked about in class)
mean(reg$residuals)
## [1] -3.455781e-17
# Extract the summ of the residuals (notice this is approximately zero like we talked about in class)
sum(reg$residuals)
## [1] -1.432136e-13
Lastly, we can predict a person’s college GPA with any value of SAT score. Let’s see what the predicted college GPA is when we use the mean SAT score as the predictor.
# Assign the mean SAT to a variable
mean_sat <- mean(dt[, sat], na.rm = TRUE)
# Place mean_sat into a data table having its column name as sat
dt_mean_sat <- data.table(sat = mean_sat)
# Predict college GPA using the mean SAT
predict(reg, dt_mean_sat)
## 1
## 2.652686
# Predict college GPA using mean SAT manually
reg$coefficients["(Intercept)"] + reg$coefficients["sat"] * mean_sat
## (Intercept)
## 2.652686
Notice that the above prediction is exactly the same as the mean of college GPA for our sample which we computed above. This is one of the OLS properties we discussed in class; the sample mean of our outcome lies on the predicted regression line. Stated another way, if we try to predict our outcome with the average value of the regressor, the prediction will be the average value of the outcome.
We derived the SLR parameter estimates \(\widehat{\beta}_0\) and \(\widehat{\beta}_1\) in class. Let’s use
those formulas to compute the estimates rather than using
lm()
.
# Create our outcome variable y as colgpa
y <- dt[, colgpa]
# Get the mean of y
y_bar <- mean(y, na.rm = TRUE)
# Create our regressor variable x as sat
x <- dt[, sat]
# Get the mean of x
x_bar <- mean(x, na.rm = TRUE)
# Compute slope estimate manually
beta_hat_1 <- cov(x, y) / var(x)
# Show the slope estimate computed manually
beta_hat_1
## [1] 0.001931058
# Compute intercept estimate manually
beta_hat_0 <- y_bar - beta_hat_1 * x_bar
# Show the intercept estimate computed manually
beta_hat_0
## [1] 0.6630568
Notice, these parameter estimates are identical to that observed when
using the lm()
function.
Let’s also do this for the standard error of \(\widehat{\beta}_1\).
# Number of observations
n <- length(x)
# Number of estimated parameters
k <- 2
# Degrees of freedom
df <- n - k
# Obtain the residuals
resid <- y - beta_hat_0 - beta_hat_1 * x
# Obtain the estimate of the variance of the error term
var_u_hat <- sum(resid^2) / df
# Compute the SST of x
sst_x <- sum( (x - x_bar)^2 )
# Calculate the standard error of the slope estimate
se_beta_hat_1 <- sqrt(var_u_hat / sst_x)
# Show the standard error of the slope estimate
se_beta_hat_1
## [1] 6.705796e-05
Notice, this standard error is identical to that observed when using
the lm()
function.
We talked about using log-level regressions to get approximate percentage changes in our outcome variable. Let’s create a new variable as the natural log of college GPA and regress this on SAT score.
# Keep only observations with college GPA greater than zero because the natural log of any non-positive number is undefined
dt_log_level <- dt[colgpa > 0]
# Make a new column for the natural log of college GPA
dt_log_level[, lcolgpa := log(colgpa)]
# Regress lcolgpa on sat
reg_log_level <- lm(lcolgpa ~ sat, data = dt_log_level)
# Extract the coefficients of this regression
reg_log_level$coefficients
## (Intercept) sat
## 0.168093176 0.000748578
So, for a one unit increase in SAT score, our model predicts that college GPA will increase by approximately \(0.0007 * 100\% = 0.07\%\). For a one-hundred unit increase in SAT score, our model predicts that college GPA will increase by approximately \(0.0007 * 100 * 100\% = 7\%\).
We talked about using log-log regressions to get elasticities of our outcome variable with respect to our regressor. Let’s create the natural log of college GPA and regress this on the natural log of SAT score.
# Filter out observations with colgpa or sat at or below zero because the natural log is only defined for positive inputs
dt_log_log <- dt[colgpa > 0 & sat > 0]
# Make a new column for the natural log of college GPA
dt_log_log[, lcolgpa := log(colgpa)]
# Make a new column for the natural log of SAT
dt_log_log[, lsat := log(sat)]
# Regress lcolgpa on lsat
reg_log_level <- lm(lcolgpa ~ lsat, data = dt_log_log)
# Extract the coefficients of this regression
reg_log_level$coefficients
## (Intercept) lsat
## -4.1958457 0.7411999
So, for a one percent change in SAT score, our model predicts that college GPA will increase by approximately \(0.74\%\). For a 50 percent change in SAT score, our model predicts that college GPA will increase by approximately \(0.74 * 50\% = 37\%\).
Let’s see how the sampling distribution of OLS estimates look under different scenarios.
Under Assumptions 1-4, we know the sampling distribution of each parameter estimate should be centered around the truth. As shown below, this holds true. A few notes:
beta_hat_0 = rep(NA, s)
creates
s = 1000
NA
values and stores them in the
column named beta_hat_0
. We’re essentially creating
placeholders for our parameter estimates and we want s
of
them because we will estimate \beta_hat_0
for each
generated sample
for(i in 1:s)
is an example of a for loop that will
execute the code in brackets 1, 2, ..., 1000
times because
s = 1000
. For loops are extremely handy when we have a task
we need repeated many times.
Within the loop, the code
dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
means extract the estimated intercept term and place it in the
dt_estimates
data table in column beta_hat_0
and at row i
. Remember i
starts from one and
goes to one-thousand. So, over each iteration of the loop we are
replacing those NA
values we started with by the parameter
estimate for the current sample which is given by i
in the
loop.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))
# Intercept term
beta_0 <- 1
# Slope term
beta_1 <- 2
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Error term
u <- rnorm(n, mean = 0, sd = 2)
# Covariate
x <- rnorm(n, mean = 5, sd = 2)
# Outcome
y <- beta_0 + beta_1 * x + u
# Create a data table
dt <- data.table(x = x, y = y)
# Run regression of y on x
reg <- lm(y ~ x, data = dt)
# Place the current intercept estimate in dt_estimates
dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
# Place the current slope estimate in dt_estimates
dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_0, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[0])) +
ylab("Density")
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
If Assumptions 4 does not hold, our parameter estimates becomes biased. However, in this specific case only \(\widehat{\beta}_1\) is biased because the error term still has mean zero.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))
# Intercept term
beta_0 <- 1
# Slope term
beta_1 <- 2
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Covariate
x <- rnorm(n, mean = 5, sd = 2)
# Error term that depends on x (introducing bias in beta_hat_1)
u <- 0.5 * x + rnorm(n, mean = 0, sd = 2)
# Outcome
y <- beta_0 + beta_1 * x + u
# Create a data table
dt <- data.table(x = x, y = y)
# Run regression of y on x
reg <- lm(y ~ x, data = dt)
# Place the current intercept estimate in dt_estimates
dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
# Place the current slope estimate in dt_estimates
dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_0, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[0])) +
ylab("Density")
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
In below simulation, u
is now not mean zero and is a
function of x
. Consequently, both parameter estimates will
be biased.
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))
# Intercept term
beta_0 <- 1
# Slope term
beta_1 <- 2
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Covariate
x <- rnorm(n, mean = 5, sd = 2)
# Error term that depends on x and is no longer mean zero
u <- 0.5 * x + rnorm(n, mean = 1, sd = 2)
# Outcome
y <- beta_0 + beta_1 * x + u
# Create a data table
dt <- data.table(x = x, y = y)
# Run regression of y on x
reg <- lm(y ~ x, data = dt)
# Place the current intercept estimate in dt_estimates
dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
# Place the current slope estimate in dt_estimates
dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_0, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[0])) +
ylab("Density")
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
In the below simulation, u
is not mean zero, but is no
longer a function of \(x\). So, our
constant term will effectively absorb the mean of u
. This
means that whatever the mean of u
is and whatever the true
value of \(\beta_0\) is, the new true
value will be \(\widetilde{\beta}_0 = \beta_0
+ \overline{u}\).
# Number of samples
s <- 1000
# Number of data points to generate per sample
n <- 200
# Create a data table with empty values to store the estimates of beta_hat_0 and beta_hat_1
dt_estimates <- data.table(beta_hat_0 = rep(NA, s), beta_hat_1 = rep(NA, s))
# Mean of error term
u_bar <- 2
# Intercept term
beta_0 <- 1
# Slope term
beta_1 <- 2
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Covariate
x <- rnorm(n, mean = 5, sd = 2)
# Error term that is no longer mean zero
u <- rnorm(n, mean = u_bar, sd = 2)
# Outcome
y <- beta_0 + beta_1 * x + u
# Create a data table
dt <- data.table(x = x, y = y)
# Run regression of y on x
reg <- lm(y ~ x, data = dt)
# Place the current intercept estimate in dt_estimates
dt_estimates$beta_hat_0[i] <- reg$coefficients["(Intercept)"]
# Place the current slope estimate in dt_estimates
dt_estimates$beta_hat_1[i] <- reg$coefficients["x"]
}
# Plot the density of the intercept estimates
ggplot(dt_estimates, aes(x = beta_hat_0)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_0 + u_bar, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[0]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[0])) +
ylab("Density")
# Plot the density of the slope estimates
ggplot(dt_estimates, aes(x = beta_hat_1)) +
geom_density(color = color_2, fill = color_2, alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1]))) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
We want variation in our regressor! We know that the SST of our regressor increases as the number of observations increases (recall the SST of x is \(\displaystyle \sum_{i=1}^n \left( x_i - \overline{X} \right)^2\)). Let’s create one-hundred samples where one sample has \(n_1 = 100\) observations and the second sample has \(n_2 = 1000\) observations. Then we’ll plot the sampling distribution of \(\widehat{\beta}_1\) for both scenarios. If regressor variance is truly a good thing, then that means our sampling distribution should be more tightly centered around the true parameter value we are estimating (under Assumptions 1-4) for sample fo size \(s_2\) relative to sample of size \(s_1\).
# Number of samples
s <- 100
# Number of data points to generate per sample
n_1 <- 100
n_2 <- 1000
# Create data tables with empty values to store the estimates of beta_hat_1
dt_estimates_1 <- data.table(beta_hat_1 = rep(NA, s), sample_size = rep(n_1, s))
dt_estimates_2 <- data.table(beta_hat_1 = rep(NA, s), sample_size = rep(n_2, s))
# Intercept term
beta_0 <- 1
# Slope term
beta_1 <- 2
# We want to create i = 1, 2, ..., s samples
for (i in 1:s)
{
# Covariates
x_1 <- rnorm(n_1, mean = 5, sd = 2)
x_2 <- rnorm(n_2, mean = 5, sd = 2)
# Error terms
u_1 <- rnorm(n_1, mean = 0, sd = 2)
u_2 <- rnorm(n_2, mean = 0, sd = 2)
# Outcomes
y_1 <- beta_0 + beta_1 * x_1 + u_1
y_2 <- beta_0 + beta_1 * x_2 + u_2
# Create data tables
dt_1 <- data.table(x = x_1, y = y_1)
dt_2 <- data.table(x = x_2, y = y_2)
# Run regressions
reg_1 <- lm(y ~ x, data = dt_1)
reg_2 <- lm(y ~ x, data = dt_2)
# Place the current slope estimate in dt_estimates_1 and dt_estimates_2
dt_estimates_1$beta_hat_1[i] <- reg_1$coefficients["x"]
dt_estimates_2$beta_hat_1[i] <- reg_2$coefficients["x"]
}
# Combine the two data tables for plotting
dt_combined <- rbind(dt_estimates_1, dt_estimates_2)
# Plot the density of the slope estimates for both sample sizes
ggplot(dt_combined, aes(x = beta_hat_1, fill = factor(sample_size))) +
geom_density(alpha = 0.7) +
geom_vline(xintercept = beta_1, color = color_1, linetype = "dashed", size = 1) +
labs(title = expression(paste("Sampling Distribution of ", hat(beta)[1])),
fill = "Sample Size") +
scale_fill_manual(
values = c("100" = "#9370DB", "1000" = "#FFA07A"),
labels = c("100", "1000")
) +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold"),
legend.background = element_rect(fill = "black"),
legend.text = element_text(color = color_1, size = 15),
legend.title = element_text(color = color_2, size = 20)
) +
xlab(expression(hat(beta)[1])) +
ylab("Density")
Indeed, for larger samples the sampling distribution of our estimator is more tightly centered around the true value (under Assumptions 1-4 needed for unbiasedness). This means that for larger values of \(n\) and, consquently, larger regressor variance, we are more certain about our estimates.