The below preliminaries are fairly standard, except we now load in
the randomForest
package which will allow us to estimate
random forest models. We will also load in the caret
package which contains a variety of powerful functions for ML.
# Clear environment, plot pane, and console
rm(list = ls())
graphics.off()
cat("\014")
# If pacman is not already installed, then install it
if (!require(pacman)) install.packages("pacman")
# Load packages
pacman::p_load(ISLR2, caret, randomForest, data.table, ggplot2)
# Colors used for plot (https://www.rapidtables.com/web/color/RGB_Color.html)
color_1 <- "#00FF00"
color_2 <- "#ADD8E6"
# Set random seed for reproducibility
set.seed(418518)
# Set scipen option to a high value to avoid scientific notation
options(scipen = 999)
ISLR2
DatasetWe first use classification trees to analyze the
Carseats
data set. This is a simulated data set containing
sales of child car seats at 400 different stores. In these data, Sales
is a continuous variable, and so we begin by recoding it as a binary
variable. We use the ifelse()
function to do so. If Sales
exceeds 8, then we convert this variable to be a one and zero otherwise.
You’ll see we use the as.factor()
function so R knows our
outcome variable is categorical.
# Read in data
dt <- data.table(ISLR2::Carseats)
# Convert Sales variable to a binary indicator
dt[, Sales := as.factor(ifelse(Sales > 8, 1, 0))]
# Show the first few rows of the data
head(dt)
# Show the last few rows of the data
tail(dt)
# Get the names of the variables in our dataset
names(dt)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
# Get summary statistics
summary(dt)
## Sales CompPrice Income Advertising Population
## 0:236 Min. : 77 Min. : 21.00 Min. : 0.000 Min. : 10.0
## 1:164 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000 1st Qu.:139.0
## Median :125 Median : 69.00 Median : 5.000 Median :272.0
## Mean :125 Mean : 68.66 Mean : 6.635 Mean :264.8
## 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 3rd Qu.:398.5
## Max. :175 Max. :120.00 Max. :29.000 Max. :509.0
## Price ShelveLoc Age Education Urban
## Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0 No :118
## 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0 Yes:282
## Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :191.0 Max. :80.00 Max. :18.0
## US
## No :142
## Yes:258
##
##
##
##
# Get the number of observations
paste0("There are ", nrow(dt), " observations in dt.")
## [1] "There are 400 observations in dt."
# Get the number of variables in the dataset
paste0("There are ", ncol(dt), " variables in dt.")
## [1] "There are 11 variables in dt."
# Get the number of NAs in the dataset
paste0("There are ", sum(is.na(dt)), " NAs in dt.")
## [1] "There are 0 NAs in dt."
Remember, trees and forests are one of the only ML algorithms that don’t need feature normalization so we can skip that step in our data preprocessing.
Now, we will do a 80-20 training-testing split.
# Training, validation, and test set size
train_size <- floor(0.8 * nrow(dt))
test_size <- floor(0.2 * nrow(dt))
# Shuffle the data
shuffled_indices <- sample(nrow(dt))
# Split the indices
train_indices <- shuffled_indices[1:train_size]
test_indices <- shuffled_indices[(train_size + 1):nrow(dt)]
# Create the datasets
dt_train <- dt[train_indices, ]
dt_test <- dt[test_indices, ]
# Show first few rows of training and validation sets
head(dt_train)
head(dt_test)
You’ll notice some categorical variables in our data table that currently take on values as words rather than numbers. R’s ML algorithms will handle this for us in the process of estimating our models so we don’t have to convert them to numerical values beforehand.
Let us now use a single tree to try to classify our outcome variable
using all the covariates in our data table. We’ll use the
train()
function in the caret
package to
estimate this classification tree model. The syntax will be fairly
standard, except we will now use two new things:
method = "rpart"
command within the
train()
function. This just tells this function to use the
CART algorithm for building the decision tree.alpha_grid <- expand.grid(cp = seq(0.001, 0.1, by = 0.01))
command tells R to create a vector of cost-complexity tuning parameters
starting from \(0.001\) to \(0.1\) in increments of \(0.01\). Recall that a higher value the
cost-complexity tuning parameter means the tree will make fewer splits
in the hope of better generalization.# Define a grid of cp values
alpha_grid <- expand.grid(cp = seq(0.001, 0.1, length = 50))
# Define the model type
model_tree <- train(
Sales ~ .,
data = dt_train,
method = "rpart",
tuneGrid = alpha_grid,
trControl = trainControl(method = "cv", number = 10),
)
# Show model
model_tree
## CART
##
## 320 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 288, 288, 289, 287, 288, 288, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001000000 0.7376802 0.4675273
## 0.003020408 0.7376802 0.4675273
## 0.005040816 0.7344544 0.4598905
## 0.007061224 0.7344544 0.4598905
## 0.009081633 0.7439241 0.4740864
## 0.011102041 0.7376741 0.4601633
## 0.013122449 0.7314180 0.4452576
## 0.015142857 0.7314180 0.4438589
## 0.017163265 0.7314180 0.4438589
## 0.019183673 0.7314180 0.4438589
## 0.021204082 0.7375733 0.4534793
## 0.023224490 0.7375733 0.4534793
## 0.025244898 0.7345491 0.4458989
## 0.027265306 0.7345491 0.4458989
## 0.029285714 0.7345491 0.4458989
## 0.031306122 0.7345491 0.4458989
## 0.033326531 0.7345491 0.4458989
## 0.035346939 0.7312408 0.4450940
## 0.037367347 0.7312408 0.4450940
## 0.039387755 0.7312408 0.4450940
## 0.041408163 0.7312408 0.4450940
## 0.043428571 0.7341764 0.4544241
## 0.045448980 0.7341764 0.4544241
## 0.047469388 0.7310514 0.4361451
## 0.049489796 0.7310514 0.4361451
## 0.051510204 0.7311461 0.4334246
## 0.053530612 0.7311461 0.4334246
## 0.055551020 0.7311461 0.4334246
## 0.057571429 0.7311461 0.4334246
## 0.059591837 0.7407228 0.4533073
## 0.061612245 0.7407228 0.4533073
## 0.063632653 0.7407228 0.4533073
## 0.065653061 0.7407228 0.4533073
## 0.067673469 0.7407228 0.4533073
## 0.069693878 0.7407228 0.4533073
## 0.071714286 0.7407228 0.4533073
## 0.073734694 0.7407228 0.4533073
## 0.075755102 0.7407228 0.4533073
## 0.077775510 0.7407228 0.4533073
## 0.079795918 0.7407228 0.4533073
## 0.081816327 0.7407228 0.4533073
## 0.083836735 0.7407228 0.4533073
## 0.085857143 0.7407228 0.4533073
## 0.087877551 0.7407228 0.4533073
## 0.089897959 0.7407228 0.4533073
## 0.091918367 0.7407228 0.4533073
## 0.093938776 0.7286015 0.4241021
## 0.095959184 0.7286015 0.4241021
## 0.097979592 0.7286015 0.4241021
## 0.100000000 0.7098515 0.3780069
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.009081633.
You can see the chosen cost-complexity tuning parameter based on 10-Fold CV is \(\alpha = 0.0091\). The Kappa statistic just measures the agreement between observed and predicted classifications, adjusted for the agreement that could happen by chance. It ranges from -1 to 1:
A higher Kappa value indicates better performance of the classification model, as it suggests that the model is making more accurate predictions than would be expected by random guessing.
Let us plot the structure of the tree using rpart.plot()
function contained in the rpart.plot
package.
# If model_tree is a model created using rpart
rpart_model <- model_tree$finalModel
# Plot the tree
rpart.plot::rpart.plot(rpart_model, type = 3, extra = 101, fallen.leaves = TRUE)
This is what our tree looks like! In the far left leaf node we see \(25\%\) underneath \(76 \ 3\) underneath \(0\). This means that all observations in that leaf are predicted (classified) as class \(0\), \(76\) of those observations’ true values actually are zero while the other \(3\) are incorrectly classified as class one, and all the data that landed in this leaf node accounts for \(25\%\) of the entire data set.
Now, we will plot the accuracy rate against the value of \(\alpha\) to visuaize our findings.
# Extract the results from the model
results <- model_tree$results
# Plot the effect of cp on accuracy
ggplot(results, aes(x = cp, y = Accuracy)) +
geom_line(color = color_2, size = 1) +
geom_point(color = color_1, size = 3) +
labs(title = expression("Effect of" ~ alpha ~ "on Accuracy"),
x = expression(alpha),
y = "Accuracy") +
theme(
panel.background = element_rect(fill = "black", color = "black"),
plot.background = element_rect(fill = "black", color = "black"),
panel.grid.major = element_line(color = "gray"),
panel.grid.minor = element_line(color = "gray"),
axis.text = element_text(color = color_1, size = 15),
axis.title = element_text(color = color_2, size = 25),
plot.title = element_text(hjust = 0.5, color = color_2, size = 30, face = "bold")
)
We see the results are fairly stable except as we increase the complexity parameter \(\alpha\) too much, we begin to increase the bias (underfitting) of our tree model.
Let us now build three different random forest models. One with
one-hundred trees in the forest, another with three-hundred trees in the
forest, and a final one with five-hundred trees in the forest. To do so,
we will use the ntree = t
command which tells the
train()
function to use t
trees in the
forest.
Recall that random forests only use a subset of the possible
variables for splitting in each tree within the forest. Below you will
see the command
mtry_grid <- expand.grid(mtry = c(1, 3, 5, 7, ncol(dt_train) - 1))
which gives a grid of the possible number of variables used for
splitting. For instance, when we build a random forest with a
mtry = 1
, this means each tree in that forest will only use
one variable for each split while the variable across trees could
differ.
# Define the grid of mtry values (number of variables randomly sampled as candidates at each split)
mtry_grid <- expand.grid(mtry = c(1, 3, 5, 7, ncol(dt_train) - 1))
# Initialize a list to store models
models_rf <- list()
# Create three random forest models with a different number of trees in each forest
for (t in c(100, 300, 500))
{
# Print current number of trees in the forest
print(paste0(t, " trees in the forest."))
# Define the model type
model_rf <- train(
Sales ~ .,
data = dt_train,
method = "rf",
tuneGrid = mtry_grid,
trControl = trainControl(method = "cv", number = 10),
ntree = t
)
# Store the model in the list
models_rf[[paste0("ntree_", t)]] <- model_rf
# Show model
print(model_rf)
print("---------------------------------------------------------------------")
}
## [1] "100 trees in the forest."
## Random Forest
##
## 320 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 288, 287, 288, 289, 289, 287, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7629399 0.4757522
## 3 0.8254704 0.6332526
## 5 0.7973332 0.5735819
## 7 0.8069159 0.5964886
## 10 0.8072061 0.5955218
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
## [1] "---------------------------------------------------------------------"
## [1] "300 trees in the forest."
## Random Forest
##
## 320 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 288, 287, 288, 289, 287, 287, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7497526 0.4404999
## 3 0.8055413 0.5903836
## 5 0.8085777 0.5966051
## 7 0.8024163 0.5844394
## 10 0.8056421 0.5914451
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
## [1] "---------------------------------------------------------------------"
## [1] "500 trees in the forest."
## Random Forest
##
## 320 samples
## 10 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 288, 288, 288, 288, 288, 288, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7499633 0.4428954
## 3 0.8063080 0.5915123
## 5 0.8126588 0.6039848
## 7 0.8093445 0.5951872
## 10 0.8095400 0.5953774
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
## [1] "---------------------------------------------------------------------"
We immediately see the power of random forests relative to decision trees as they lead to a 6-7 percentage point increase in accuracy.
It seems that the best random forest model uses five-hundred trees in the forest and five splitting variables per tree. Since this is most accurate model out of the ones we have estimated, we will use that model on our testing data set.
# Access the model with 500 trees
model_rf <- models_rf[["ntree_500"]]
# Make predictions on the testing data
predictions <- predict(model_rf, newdata = dt_test)
# Create confusion matrix
confusion_matrix <- confusionMatrix(predictions, dt_test[, Sales])
# Print the confusion matrix and other performance metrics
print(confusion_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 44 7
## 1 5 24
##
## Accuracy : 0.85
## 95% CI : (0.7526, 0.92)
## No Information Rate : 0.6125
## P-Value [Acc > NIR] : 0.000003137
##
## Kappa : 0.6802
##
## Mcnemar's Test P-Value : 0.7728
##
## Sensitivity : 0.8980
## Specificity : 0.7742
## Pos Pred Value : 0.8627
## Neg Pred Value : 0.8276
## Prevalence : 0.6125
## Detection Rate : 0.5500
## Detection Prevalence : 0.6375
## Balanced Accuracy : 0.8361
##
## 'Positive' Class : 0
##
We see our model has a 85% accuracy rate which is pretty good and higher than that predicted by 10-Fold CV! This shows the power of bagging inherent to random forests that increase generalization capabilities. Also, we see the model generates five false positives and seven false negatives. Since these numbers are fairly balanced, we know one class of Sales is not being misclassified disproportionately which is a good thing. Overall, these are great results and show why random forests are my favorite ML algorithm.