Boosting Decision Trees and Variable Importance

Lucy D’Agostino McGowan

Boosting

Like bagging, boosting is an approach that can be applied to many statistical learning methods
We will discuss how to use boosting for decision trees

Bagging

resampling from the original training data to make many bootstrapped training data sets
fitting a separate decision tree to each bootstrapped training data set
combining all trees to make one predictive model
☝️ Note, each tree is built on a bootstrap dataset, independent of the other trees

Boosting

Boosting is similar, except the trees are grown sequentially, using information from the previously grown trees

Boosting algorithm for regression trees

Step 1

Set \(\hat{f}(x)= 0\) and \(r_i= y_i\) for all \(i\) in the training set

Boosting algorithm for regression trees

Step 2 For \(b = 1, 2, \dots, B\) repeat:

Fit a tree \(\hat{f}^b\) with \(d\) splits ( \(d\) + 1 terminal nodes) to the training data ( \(X, r\) )
Update \(\hat{f}\) by adding in a shrunken version of the new tree: \(\hat{f}(x)\leftarrow \hat{f}(x)+\lambda \hat{f}^b(x)\)
Update the residuals: \(r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\)

Boosting algorithm for regression trees

Step 3

Output the boosted model \(\hat{f}(x)=\sum_{b = 1}^B\lambda\hat{f}^b(x)\)

Big picture

Given the current model, we are fitting a decision tree to the residuals
We then add this new decision tree into the fitted function to update the residuals
Each of these trees can be small (just a few terminal nodes), determined by \(d\)
Instead of fitting a single large decision tree, which could result in overfitting, boosting learns slowly

Big Picture

By fitting small trees to the residuals we slowly improve \(\hat{f}\) in areas where it does not perform well
The shrinkage parameter \(\lambda\) slows the process down even more allowing more and different shaped trees to try to minimize those residuals

Boosting for classification

Boosting for classification is similar, but a bit more complex
tidymodels will handle this for us, but if you are interested in learning more, you can check out Chapter 10 of Elements of Statistical Learning

Tuning parameters

With bagging what could we tune?

\(B\), the number of bootstrapped training samples (the number of decision trees fit) (trees)
It is more efficient to just pick something very large instead of tuning this
For \(B\), you don’t really risk overfitting if you pick something too big

Tuning parameters

With random forest what could we tune?

The depth of the tree, \(B\), and m the number of predictors to try (mtry)
The default is \(\sqrt{p}\), and this does pretty well

Tuning parameters for boosting

\(B\) the number of bootstraps
\(\lambda\) the shrinkage parameter
\(d\) the number of splits in each tree

Tuning parameters for boosting

What do you think you can use to pick \(B\)?

Unlike bagging and random forest with boosting you can overfit if \(B\) is too large
Cross-validation, of course!

Tuning parameters for boosting

The shrinkage parameter \(\lambda\) controls the rate at which boosting learn
\(\lambda\) is a small, positive number, typically 0.01 or 0.001
It depends on the problem, but typically a very small \(\lambda\) can require a very large \(B\) for good performance

Tuning parameters for boosting

The number of splits, \(d\), in each tree controls the complexity of the boosted ensemble
Often \(d=1\) is a good default
brace yourself for another tree pun!
In this case we call the tree a stump meaning it just has a single split
This results in an additive model
You can think of \(d\) as the interaction depth it controls the interaction order of the boosted model, since \(d\) splits can involve at most \(d\) variables

Boosted trees in R

boost_spec <- boost_tree(
  mode = "classification", 
  tree_depth = 1, 
  trees = 1000, 
  learn_rate = 0.001, 
) |>
  set_engine("xgboost")

Set the mode as you would with a bagged tree or random forest
tree_depth here is the depth of each tree, let’s set that to 1
trees is the number of trees that are fit, this is equivalent to B
learn_rate is \(\lambda\)

Make a recipe

rec <- recipe(HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs + 
             RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca + Thal,           
           data = heart) |>
  step_dummy(all_nominal_predictors())

xgboost wants you to have all numeric data, that means we need to make dummy variables
because HD (the outcome) is also categorical, we can use all_nominal_predictors to make sure we don’t turn the outcome into dummy variables as well

Fit the model

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(boost_spec)
model <- fit(wf, data = heart)

`Boosting`

How would this code change if I wanted to tune B the number of bootstrapped training samples?

boost_spec <- boost_tree( 
  mode = "classification", 
  tree_depth = 1, 
  trees = 1000, 
  learn_rate = 0.001, 
) |>
  set_engine("xgboost")

06:00

`Boosting`

Fit a boosted model to the data from the previous application exercise.

Variable Importance

Variable importance

For bagged or random forest regression trees, we can record the total RSS that is decreased due to splits of a given predictor \(X_i\) averaged over all \(B\) trees
A large value would indicate that that variable is important

Variable importance

For bagged or random forest classification trees we can add up the total amount that the Gini Index is decreased by splits of a given predictor, \(X_i\), averaged over \(B\) trees

Variable importance in R

rf_spec <- rand_forest(
  mode = "classification",
  mtry = 3
) |> 
  set_engine(
    "ranger",
    importance = "impurity") 

wf <- workflow() |>
  add_recipe(
    recipe(HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs + 
             RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca + Thal,               
           data = heart)
  ) |>
  add_model(rf_spec)
model <- fit(wf, data = heart)

ranger::importance(model$fit$fit$fit)

       Age        Sex  ChestPain     RestBP       Chol        Fbs    RestECG 
 8.4105385  3.9179814 16.2075537  7.8479381  7.0121649  0.8112367  1.5944339 
     MaxHR      ExAng    Oldpeak      Slope         Ca       Thal 
13.7292165  6.8135135 13.0718556  5.9581823 16.6729564 14.7206036

Variable importance

library(ranger)
importance(model$fit$fit$fit)

       Age        Sex  ChestPain     RestBP       Chol        Fbs    RestECG 
 8.4105385  3.9179814 16.2075537  7.8479381  7.0121649  0.8112367  1.5944339 
     MaxHR      ExAng    Oldpeak      Slope         Ca       Thal 
13.7292165  6.8135135 13.0718556  5.9581823 16.6729564 14.7206036

var_imp <- ranger::importance(model$fit$fit$fit)

Plotting variable importance

var_imp_df <- data.frame(
  variable = names(var_imp),
  importance = var_imp
)

var_imp_df |>
  ggplot(aes(x = variable, y = importance)) +
  geom_col()

How could we make this plot better?

Plotting variable importance

var_imp_df |>
  ggplot(aes(x = variable, y = importance)) +
  geom_col() + 
  coord_flip()

How could we make this plot better?

Plotting variable importance

var_imp_df |>
  mutate(variable = factor(variable, 
                           levels = variable[order(var_imp_df$importance)])) |>
  ggplot(aes(x = variable, y = importance)) +
  geom_col() + 
  coord_flip()