Preliminaries

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# Set working directory
setwd("C:/Users/wbras/OneDrive/Desktop/UA/Fall_2024/ECON_418-518/ECON_418-518_Lecture_Slides/ECON_418-518_Lecture_Slides_E/ECON_418-518_Matrix_Algebra/ECON_418-518_Matrix_Algebra_Code")

# Install pacman
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(data.table)

Regression Summary Using lm()

Regression Using the lm() Function

# Load in wage1 data from wooldridge
dt <- as.data.table(wooldridge::wage1)

# Show first few rows of our data
head(dt)
# Make experience squared variable
dt[, exper2 := exper^2]

# Lets run a regression of wage on education, experience, and experience squared
reg <- lm(data = dt, wage ~ educ + exper + exper2)

# Show summary of regression
summary(reg)
## 
## Call:
## lm(formula = wage ~ educ + exper + exper2, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0692 -2.0837 -0.5417  1.2860 15.1363 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.964890   0.752153  -5.271 1.99e-07 ***
## educ         0.595343   0.053025  11.228  < 2e-16 ***
## exper        0.268287   0.036897   7.271 1.31e-12 ***
## exper2      -0.004612   0.000822  -5.611 3.26e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.166 on 522 degrees of freedom
## Multiple R-squared:  0.2692, Adjusted R-squared:  0.265 
## F-statistic: 64.11 on 3 and 522 DF,  p-value: < 2.2e-16
# Show the 95% confidence interval for each estimated parameter
confint(reg)
##                    2.5 %       97.5 %
## (Intercept) -5.442507629 -2.487271759
## educ         0.491174081  0.699511772
## exper        0.195802286  0.340771681
## exper2      -0.006227036 -0.002997513

Regression Summary Output Done Manually

Obtain \(\widehat{\beta}\) Manually

Some notes on the code below:

  1. The function cbind(cbind(Intercept, dt[, .(educ, exper, exper2)])) creates a datatable object where the first column is the Intercept vector followed by columns educ, exper, and exper2. Then the as.matrix() function turns this datatable object into a matrix.
  2. The function t() gives the transpose of a vector or matrix.
  3. The function solve() computes the inverse of a matrix.
  4. The operator %*% is able to multiply matrices with matrices, vectors with vectors, and matrices with vectors.
# Make our y vector
y <- dt[, wage]

# Make a vector of 1's to add as the first column of our X matrix for the intercept
Intercept <- rep(1, nrow(dt))

# Now lets construct our X matrix
X <- as.matrix(cbind(Intercept, dt[, .(educ, exper, exper2)]))

# Show the first 5 rows and all columns of X
X[1:5, ]
##      Intercept educ exper exper2
## [1,]         1   11     2      4
## [2,]         1   12    22    484
## [3,]         1   11     2      4
## [4,]         1    8    44   1936
## [5,]         1   12     7     49
# Transpose of X
Xprime <- t(X)

# Compute the OLS solution for beta_hat 
beta_hat <- solve(Xprime %*% X) %*% Xprime %*% y

# Show beta_hat
beta_hat
##                   [,1]
## Intercept -3.964889694
## educ       0.595342926
## exper      0.268286984
## exper2    -0.004612274

Obtain the Standard Errors Manually

The standard error of an estimate tells us how uncertain we are about it being close to the parameter we are estimating. A larger standard error means we are less certain while a lower standard error means we are more certain.

# Number of observations
n <- nrow(X)

# Number of estimated parameters (including intercept)
k <- ncol(X)

# Obtain the estimated residuals
u_hat <- y - X %*% beta_hat

# Obtain the estimated variance of u (n - k accounts for the degree of freedom correction)
sigma_hat_sq <- sum(u_hat^2) / (n - k)

# Get inverse of X_prime %*% X
Xprime_X_inverse <- solve(Xprime %*% X)

# Obtain the estimated variance-covariance matrix
var_cov <- sigma_hat_sq * Xprime_X_inverse

# Get the diagonal elements of the estimated variance-covaraince matrix (elements along diagonal are the variance of each parameter)
beta_hat_var <- diag(var_cov)

# Take the square root of the estimated variance of the estimated parameters to get the standard error
beta_hat_se <- sqrt(beta_hat_var)

# Show the standard errors
beta_hat_se
##   Intercept        educ       exper      exper2 
## 0.752152552 0.053025116 0.036896920 0.000821963

Obtain the T-Statistics Manually

The T-statistic produced in R’s regression output are given by \(T = \frac{\widehat{\beta}}{\text{se}\left( \widehat{\beta} \right)}\). The T-statistic allows us to conduct hypothesis tests of any kind, but this specific one aims at conducting the test of \(H_0: \beta = 0\) versus \(H_A: \beta \neq 0\)

# Divide beta_hat by its standard error to get its T-statistic
beta_hat_t <- beta_hat / beta_hat_se

# Show the T-statistics
beta_hat_t
##                [,1]
## Intercept -5.271390
## educ      11.227565
## exper      7.271257
## exper2    -5.611292

Obtain the p-Values Manually

The p-value is the probability of observing a T-statistic at least as extreme as the one computed from the sample data, assuming that the null hypothesis (that the coefficient is equal to zero) is true. It is the smallest level of \(\alpha\) such that the null hypothesis of \(\beta = 0\) can be rejected.

pt(abs(beta_hat_t), n - k, lower.tail = FALSE) uses the CDF of the T-distribution with n - k degrees of freedom to obtain the probability that the absolute value of the T-statistic is greater than (greater than because we used lower.tail = FALSE) a T-distributed random variable with n - k degrees of freedom. We then multiply by two because we need the area in both the lower and upper tails. The T-distribution is symmetric around zero so multiplying by two gives us this.

# Use pt (CDF of t-distribution) to get the p-value
beta_hat_p <- 2 * pt(abs(beta_hat_t), n - k, lower.tail = FALSE)

# Show the p-values
beta_hat_p
##                   [,1]
## Intercept 1.985286e-07
## educ      2.375714e-26
## exper     1.310188e-12
## exper2    3.263148e-08

Obtain the Confidence Interval Manually

Confidence intervals are my preferred choice to conduct hypothesis. For instance, suppose we wish to test if the coefficient on education is equal to zero at the 5% significance level (i.e., testing if this parameter is statistically significant at the 5% level). This corresponds to \(H_0: \beta_1 = 0\) versus \(H_A: \beta_1 \neq 0\). Upon, generating the confidence interval, if zero lies inside it, then we fail to reject the null. Otherwise, we reject the null and conclude that education is a statistically significant determinant of wage at the 5% significance level (95% confidence level). Let’s create the confidence interval and carry out this test.

We use qt(0.025, df = n - k, lower.tail = FALSE) because we want 2.5% of the area under the T-distributions PDF to be in the upper tail and 2.5% of the area under the T-distribution PDF to be in the lower tail. \(2.5\% + 2.5\% = 5\%\) which is exactly what we need for a 5% significance level set. Setting lower.tail = FALSE gives the T-critical value leading to 0.025 area in the upper tail.

# Let's get the T-critical value for a 95% confidence level test
t_crit <- qt(0.025, df = n - k, lower.tail = FALSE)

# Create a vector that holds the left and right endpoints of the confidence interval
ci <- c(beta_hat[2] - t_crit * beta_hat_se[2], beta_hat[2] + t_crit * beta_hat_se[2])

# Show the vector
ci
##      educ      educ 
## 0.4911741 0.6995118

Notice this is exactly what we got from calling the confint(reg) command from above. Given zero is excluded from this interval, we conclude there is a preponderance of evidence to reject the null hypothesis with 95% confidence and that education is a statistically significant determinant of a person’s wage at the 5% significance level.

Obtain the Residual Standard Error Manually

The residual standard error tells how our much we can expect our residual estimates to vary on average.

# Take the square root of the residual variance to get its standard error
u_hat_se <- sqrt(sigma_hat_sq)

# Show the residual standard error
u_hat_se
## [1] 3.166073

Obtain the R-squared Manually

There are two types of R-squared: R-squared and adjusted R-squared. A high R-squared means our model does a good job at explaining our outcome. However, including more covariates doesn’t necessarily mean our model does a better job. The flaw in the R-squared is that it increases as the number of covariates in our model increases. The adjusted R-squared corrects for this so that is what you should mainly be looking at as your measure for model fit.

# Obtain the total sum of squares (SST)
SST <- sum((y - mean(y))^2)

# Get the residual sum of squares (SSR)
SSR <- sum((u_hat)^2)

# Use the SST and SSR to get the R-squared
r_squared <- 1 - (SSR / SST)

# Show the R-squared
r_squared
## [1] 0.2692409

Obtain the Adjusted R-squared Manually

# Calculate the adjusted R-squared
r_squared_adjusted <- 1 - ((1 - r_squared) * (n - 1) / (n - k))

# Show the adjusted R-squared
r_squared_adjusted
## [1] 0.2650412

Obtain the F-statistic and its p-value Manually

The F-statistic is used for the hypothesis test of \(H_0: \beta_1 = \beta_2 = \beta_3 = 0\) versus \(H_A: \text{One of } \beta_1, \ \beta_2, \ \text{or} \ \beta_3 \ \text{not equal to 0}\). In words, we are testing if we should only include an intercept in our model. A larger F-statistic means we are likely to reject this null hypothesis. However, to determine which of the parameters is not equal to \(0\) we must carry out individual T-tests. The p-value associated with this F-statistic gives the smallest significance level \(\alpha\) such that the null hypothesis can be rejected. We compute the p-value almost exactly the same as above expect we now use the function pf() because we need the CDF of F-distribution.

# Get the explained sum of squares (SSE)
SSE <- SST - SSR

# Calculate the numerator of the F-statistic (degrees of freedom is the number of estimator parameters excluding the intercept)
F_stat_num <- r_squared / (k - 1)

# Calculate the denominator of the F-statistic (degrees of freedom is the number of observations minus the number of estimated parameters)
F_stat_denom <- (1 - r_squared) / (n - k)

# Calculate the F-statistic
F_stat <- F_stat_num / F_stat_denom

# Show the F-statistic
F_stat
## [1] 64.10858
# Use pf (CDF of F-distribution) to get the p-value
F_stat_p <- pf(F_stat, df1 = k - 1, df2 = n - k, lower.tail = FALSE)

# Show the p-value on the F-statistic
F_stat_p
## [1] 2.653192e-35

Solving Non-Linear Problems with optim()

By its definition, the linear regression problem of solving for \(\widehat{\beta}\) is linear. Therefore, it has a closed form solution given by \(\widehat{\beta} = \left ( X^\prime X \right)^{-1} X^\prime y\). However, in econometrics, statistics, and machine learning problems are non-linear and, consequently, do not have a simple solution. In such cases, we rely on advanced optimization methods such as numerical optimization (e.g., gradient descent/ascent in machine learning) or automatic differentiation (e.g., backpropagation in deep learning). optim() is a numerical optimization solver built into R that is useful if we have a non-linear problem.

In this section, we will define a function objective_function that computes the sum of the squared residuals which is the function OLS attempts to minimize. We will then use optim() to minimize objective_function. optim() requires you to pass initial guesses of the parameters, which we typically set to zero unless we have some intuition on what those estimated parameters might be. We will see that optim() provides a solution very close to the closed form solution \(\left ( X^\prime X \right)^{-1} X^\prime y\), though not exactly the same.

The key takeaway is that many real-world problems are non-linear, and optim() offers a robust starting point for tackling such challenges.

Here is an example of the implementation:

# Define the objective function to minimize (sum of squared residuals)
objective_function <- function(params) 
{
  # This function calculates the sum of squared residuals (SSR) for a given set of parameters.
  # Recall, is what the least squares problem is; to find the set of parameters that minimize the SSR (give the least squares)
  #
  # Args:
  #   params: A numeric vector of regression coefficients (including the intercept).
  #
  # Returns:
  #   The sum of squared residuals for the given parameters.
  
  # Calculate the predicted values
  y_hat <- X %*% params
  
  # Calculate the residuals
  u_hat <- y - y_hat
  
  # Calculate the sum of squared residuals
  SSR <- sum(u_hat^2)
  
  # The function will return (output) the computed SSR
  return(SSR)
}

# Initial guess for the parameters
initial_params <- rep(0, ncol(X))

# Use optim to minimize the objective function
result <- optim(initial_params, objective_function)

# Extract the estimated parameters
estimated_params <- result$par

# Show the estimated parameters that minimize the SSR
estimated_params
## [1] -3.947029757  0.594304342  0.267882070 -0.004608709