Preliminaries

The below preliminaries are fairly standard, except we now load in the randomForest package which will allow us to estimate random forest models. We will also load in the caret package which contains a variety of powerful functions for ML.

# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")

# Load packages
pacman::p_load(ISLR2, caret, randomForest, data.table, ggplot2)

# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"

# Set random seed for reproducibility
set.seed(418518)

# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)

Data Analysis with a ISLR2 Dataset

We first use classification trees to analyze the Carseats data set. This is a simulated data set containing sales of child car seats at 400 different stores. In these data, Sales is a continuous variable, and so we begin by recoding it as a binary variable. We use the ifelse() function to do so. If Sales exceeds 8, then we convert this variable to be a one and zero otherwise. You’ll see we use the as.factor() function so R knows our outcome variable is categorical.

# Read in data
dt <- data.table(ISLR2::Carseats)

# Convert Sales variable to a binary indicator
dt[, Sales := as.factor(ifelse(Sales > 8, 1, 0))]

# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
# Get summary statistics
summary(dt)
##  Sales     CompPrice       Income        Advertising       Population   
##  0:236   Min.   : 77   Min.   : 21.00   Min.   : 0.000   Min.   : 10.0  
##  1:164   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000   1st Qu.:139.0  
##          Median :125   Median : 69.00   Median : 5.000   Median :272.0  
##          Mean   :125   Mean   : 68.66   Mean   : 6.635   Mean   :264.8  
##          3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000   3rd Qu.:398.5  
##          Max.   :175   Max.   :120.00   Max.   :29.000   Max.   :509.0  
##      Price        ShelveLoc        Age          Education    Urban    
##  Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0   No :118  
##  1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0   Yes:282  
##  Median :117.0   Medium:219   Median :54.50   Median :14.0            
##  Mean   :115.8                Mean   :53.32   Mean   :13.9            
##  3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0            
##  Max.   :191.0                Max.   :80.00   Max.   :18.0            
##    US     
##  No :142  
##  Yes:258  
##           
##           
##           
## 
# Get the number of observations
paste0("There are ", nrow(dt), " observations in dt.")
## [1] "There are 400 observations in dt."
# Get the number of variables in the dataset 
paste0("There are ", ncol(dt), " variables in dt.")
## [1] "There are 11 variables in dt."
# Get the number of NAs in the dataset 
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 0 NAs in dt."

Remember, trees and forests are one of the only ML algorithms that don’t need feature normalization so we can skip that step in our data preprocessing.

Now, we will do a 80-20 training-testing split.

# Training, validation, and test set size
train_size <- floor(0.8 * nrow(dt))
test_size <- floor(0.2 * nrow(dt))

# Shuffle the data
shuffled_indices <- sample(nrow(dt))

# Split the indices
train_indices <- shuffled_indices[1:train_size]
test_indices <- shuffled_indices[(train_size + 1):nrow(dt)]

# Create the datasets
dt_train <- dt[train_indices, ]
dt_test <- dt[test_indices, ]

# Show first few rows of training and validation sets
head(dt_train)
head(dt_test)

You’ll notice some categorical variables in our data table that currently take on values as words rather than numbers. R’s ML algorithms will handle this for us in the process of estimating our models so we don’t have to convert them to numerical values beforehand.

Classification Tree

Let us now use a single tree to try to classify our outcome variable using all the covariates in our data table. We’ll use the train() function in the caret package to estimate this classification tree model. The syntax will be fairly standard, except we will now use two new things:

  1. The method = "rpart" command within the train() function. This just tells this function to use the CART algorithm for building the decision tree.
  2. The alpha_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.01)) command tells R to create a vector of cost-complexity tuning parameters starting from \(0.001\) to \(0.1\) in increments of \(0.01\). Recall that a higher value the cost-complexity tuning parameter means the tree will make fewer splits in the hope of better generalization.
# Define a grid of cp values
alpha_grid <- expand.grid(cp = seq(0.001, 0.1, length = 50))

# Define the model type
model_tree <- train(
  Sales ~ .,        
  data = dt_train,        
  method = "rpart",
  tuneGrid = alpha_grid,  
  trControl = trainControl(method = "cv", number = 10),
)

# Show model
model_tree
## CART 
## 
## 320 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 288, 288, 289, 287, 288, 288, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.001000000  0.7376802  0.4675273
##   0.003020408  0.7376802  0.4675273
##   0.005040816  0.7344544  0.4598905
##   0.007061224  0.7344544  0.4598905
##   0.009081633  0.7439241  0.4740864
##   0.011102041  0.7376741  0.4601633
##   0.013122449  0.7314180  0.4452576
##   0.015142857  0.7314180  0.4438589
##   0.017163265  0.7314180  0.4438589
##   0.019183673  0.7314180  0.4438589
##   0.021204082  0.7375733  0.4534793
##   0.023224490  0.7375733  0.4534793
##   0.025244898  0.7345491  0.4458989
##   0.027265306  0.7345491  0.4458989
##   0.029285714  0.7345491  0.4458989
##   0.031306122  0.7345491  0.4458989
##   0.033326531  0.7345491  0.4458989
##   0.035346939  0.7312408  0.4450940
##   0.037367347  0.7312408  0.4450940
##   0.039387755  0.7312408  0.4450940
##   0.041408163  0.7312408  0.4450940
##   0.043428571  0.7341764  0.4544241
##   0.045448980  0.7341764  0.4544241
##   0.047469388  0.7310514  0.4361451
##   0.049489796  0.7310514  0.4361451
##   0.051510204  0.7311461  0.4334246
##   0.053530612  0.7311461  0.4334246
##   0.055551020  0.7311461  0.4334246
##   0.057571429  0.7311461  0.4334246
##   0.059591837  0.7407228  0.4533073
##   0.061612245  0.7407228  0.4533073
##   0.063632653  0.7407228  0.4533073
##   0.065653061  0.7407228  0.4533073
##   0.067673469  0.7407228  0.4533073
##   0.069693878  0.7407228  0.4533073
##   0.071714286  0.7407228  0.4533073
##   0.073734694  0.7407228  0.4533073
##   0.075755102  0.7407228  0.4533073
##   0.077775510  0.7407228  0.4533073
##   0.079795918  0.7407228  0.4533073
##   0.081816327  0.7407228  0.4533073
##   0.083836735  0.7407228  0.4533073
##   0.085857143  0.7407228  0.4533073
##   0.087877551  0.7407228  0.4533073
##   0.089897959  0.7407228  0.4533073
##   0.091918367  0.7407228  0.4533073
##   0.093938776  0.7286015  0.4241021
##   0.095959184  0.7286015  0.4241021
##   0.097979592  0.7286015  0.4241021
##   0.100000000  0.7098515  0.3780069
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.009081633.

You can see the chosen cost-complexity tuning parameter based on 10-Fold CV is \(\alpha = 0.0091\). The Kappa statistic just measures the agreement between observed and predicted classifications, adjusted for the agreement that could happen by chance. It ranges from -1 to 1:

  1. Kappa = 1: Perfect agreement between observed and predicted values.
  2. Kappa = 0: Agreement is no better than random chance.
  3. Kappa < 0: Agreement is worse than what would be expected by chance.

A higher Kappa value indicates better performance of the classification model, as it suggests that the model is making more accurate predictions than would be expected by random guessing.

Let us plot the structure of the tree using rpart.plot() function contained in the rpart.plot package.

# If model_tree is a model created using rpart
rpart_model <- model_tree$finalModel

# Plot the tree
rpart.plot::rpart.plot(rpart_model, type = 3, extra = 101, fallen.leaves = TRUE)

This is what our tree looks like! In the far left leaf node we see \(25\%\) underneath \(76 \ 3\) underneath \(0\). This means that all observations in that leaf are predicted (classified) as class \(0\), \(76\) of those observations’ true values actually are zero while the other \(3\) are incorrectly classified as class one, and all the data that landed in this leaf node accounts for \(25\%\) of the entire data set.

Now, we will plot the accuracy rate against the value of \(\alpha\) to visuaize our findings.

# Extract the results from the model
results <- model_tree$results

# Plot the effect of cp on accuracy
ggplot(results, aes(x = cp, y = Accuracy)) +
  geom_line(color = color_2, size = 1) +
  geom_point(color = color_1, size = 3) +
  labs(title = expression("Effect of" ~ alpha ~ "on Accuracy"),
       x = expression(alpha),
       y = "Accuracy") +
  theme(
    panel.background = element_rect(fill = "black", color = "black"),
    plot.background = element_rect(fill = "black", color = "black"),
    panel.grid.major = element_line(color = "gray"),
    panel.grid.minor = element_line(color = "gray"),
    axis.text = element_text(color = color_1, size = 15),
    axis.title = element_text(color = color_2, size = 25),
    plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
  )

We see the results are fairly stable except as we increase the complexity parameter \(\alpha\) too much, we begin to increase the bias (underfitting) of our tree model.

Classification Random Forests

Let us now build three different random forest models. One with one-hundred trees in the forest, another with three-hundred trees in the forest, and a final one with five-hundred trees in the forest. To do so, we will use the ntree = t command which tells the train() function to use t trees in the forest.

Recall that random forests only use a subset of the possible variables for splitting in each tree within the forest. Below you will see the command mtry_grid <- expand.grid(mtry = c(1, 3, 5, 7, ncol(dt_train) - 1)) which gives a grid of the possible number of variables used for splitting. For instance, when we build a random forest with a mtry = 1, this means each tree in that forest will only use one variable for each split while the variable across trees could differ.

# Define the grid of mtry values (number of variables randomly sampled as candidates at each split)
mtry_grid <- expand.grid(mtry = c(1, 3, 5, 7, ncol(dt_train) - 1))

# Initialize a list to store models
models_rf <- list()

# Create three random forest models with a different number of trees in each forest
for (t in c(100, 300, 500))
{
  # Print current number of trees in the forest
  print(paste0(t, " trees in the forest."))
  
  # Define the model type
  model_rf <- train(
  Sales ~ .,        
  data = dt_train,        
  method = "rf",  
  tuneGrid = mtry_grid,  
  trControl = trainControl(method = "cv", number = 10),
  ntree = t
  )
  
  # Store the model in the list
  models_rf[[paste0("ntree_", t)]] <- model_rf

  # Show model
  print(model_rf)
  
  print("---------------------------------------------------------------------")
}
## [1] "100 trees in the forest."
## Random Forest 
## 
## 320 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 288, 287, 288, 289, 289, 287, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.7629399  0.4757522
##    3    0.8254704  0.6332526
##    5    0.7973332  0.5735819
##    7    0.8069159  0.5964886
##   10    0.8072061  0.5955218
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
## [1] "---------------------------------------------------------------------"
## [1] "300 trees in the forest."
## Random Forest 
## 
## 320 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 288, 287, 288, 289, 287, 287, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.7497526  0.4404999
##    3    0.8055413  0.5903836
##    5    0.8085777  0.5966051
##    7    0.8024163  0.5844394
##   10    0.8056421  0.5914451
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
## [1] "---------------------------------------------------------------------"
## [1] "500 trees in the forest."
## Random Forest 
## 
## 320 samples
##  10 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 288, 288, 288, 288, 288, 288, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.7499633  0.4428954
##    3    0.8063080  0.5915123
##    5    0.8126588  0.6039848
##    7    0.8093445  0.5951872
##   10    0.8095400  0.5953774
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
## [1] "---------------------------------------------------------------------"

We immediately see the power of random forests relative to decision trees as they lead to a 6-7 percentage point increase in accuracy.

It seems that the best random forest model uses five-hundred trees in the forest and five splitting variables per tree. Since this is most accurate model out of the ones we have estimated, we will use that model on our testing data set.

# Access the model with 500 trees
model_rf <- models_rf[["ntree_500"]]

# Make predictions on the testing data
predictions <- predict(model_rf, newdata = dt_test)

# Create confusion matrix
confusion_matrix <- confusionMatrix(predictions, dt_test[, Sales])

# Print the confusion matrix and other performance metrics
print(confusion_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 44  7
##          1  5 24
##                                         
##                Accuracy : 0.85          
##                  95% CI : (0.7526, 0.92)
##     No Information Rate : 0.6125        
##     P-Value [Acc > NIR] : 0.000003137   
##                                         
##                   Kappa : 0.6802        
##                                         
##  Mcnemar's Test P-Value : 0.7728        
##                                         
##             Sensitivity : 0.8980        
##             Specificity : 0.7742        
##          Pos Pred Value : 0.8627        
##          Neg Pred Value : 0.8276        
##              Prevalence : 0.6125        
##          Detection Rate : 0.5500        
##    Detection Prevalence : 0.6375        
##       Balanced Accuracy : 0.8361        
##                                         
##        'Positive' Class : 0             
## 

We see our model has a 85% accuracy rate which is pretty good and higher than that predicted by 10-Fold CV! This shows the power of bagging inherent to random forests that increase generalization capabilities. Also, we see the model generates five false positives and seven false negatives. Since these numbers are fairly balanced, we know one class of Sales is not being misclassified disproportionately which is a good thing. Overall, these are great results and show why random forests are my favorite ML algorithm.