Boosting Decision Trees and Variable Importance

Lucy D’Agostino McGowan

Boosting

  • Like bagging, boosting is an approach that can be applied to many statistical learning methods

  • We will discuss how to use boosting for decision trees

Bagging

  • resampling from the original training data to make many bootstrapped training data sets
  • fitting a separate decision tree to each bootstrapped training data set
  • combining all trees to make one predictive model
  • ☝️ Note, each tree is built on a bootstrap dataset, independent of the other trees

Boosting

  • Boosting is similar, except the trees are grown sequentially, using information from the previously grown trees

Boosting algorithm for regression trees

Step 1

  • Set \(\hat{f}(x)= 0\) and \(r_i= y_i\) for all \(i\) in the training set

Boosting algorithm for regression trees

Step 2 For \(b = 1, 2, \dots, B\) repeat:

  • Fit a tree \(\hat{f}^b\) with \(d\) splits ( \(d\) + 1 terminal nodes) to the training data ( \(X, r\) )

  • Update \(\hat{f}\) by adding in a shrunken version of the new tree: \(\hat{f}(x)\leftarrow \hat{f}(x)+\lambda \hat{f}^b(x)\)

  • Update the residuals: \(r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\)

Boosting algorithm for regression trees

Step 3

  • Output the boosted model \(\hat{f}(x)=\sum_{b = 1}^B\lambda\hat{f}^b(x)\)

Big picture

  • Given the current model, we are fitting a decision tree to the residuals

  • We then add this new decision tree into the fitted function to update the residuals

  • Each of these trees can be small (just a few terminal nodes), determined by \(d\)

  • Instead of fitting a single large decision tree, which could result in overfitting, boosting learns slowly

Big Picture

  • By fitting small trees to the residuals we slowly improve \(\hat{f}\) in areas where it does not perform well

  • The shrinkage parameter \(\lambda\) slows the process down even more allowing more and different shaped trees to try to minimize those residuals

Boosting for classification

Tuning parameters

With bagging what could we tune?

  • \(B\), the number of bootstrapped training samples (the number of decision trees fit) (trees)

  • It is more efficient to just pick something very large instead of tuning this

  • For \(B\), you don’t really risk overfitting if you pick something too big

Tuning parameters

With random forest what could we tune?

  • The depth of the tree, \(B\), and m the number of predictors to try (mtry)

  • The default is \(\sqrt{p}\), and this does pretty well

Tuning parameters for boosting

  • \(B\) the number of bootstraps
  • \(\lambda\) the shrinkage parameter
  • \(d\) the number of splits in each tree

Tuning parameters for boosting

What do you think you can use to pick \(B\)?

  • Unlike bagging and random forest with boosting you can overfit if \(B\) is too large

  • Cross-validation, of course!

Tuning parameters for boosting

  • The shrinkage parameter \(\lambda\) controls the rate at which boosting learn

  • \(\lambda\) is a small, positive number, typically 0.01 or 0.001

  • It depends on the problem, but typically a very small \(\lambda\) can require a very large \(B\) for good performance

Tuning parameters for boosting

  • The number of splits, \(d\), in each tree controls the complexity of the boosted ensemble

  • Often \(d=1\) is a good default

  • brace yourself for another tree pun!

  • In this case we call the tree a stump meaning it just has a single split

  • This results in an additive model

  • You can think of \(d\) as the interaction depth it controls the interaction order of the boosted model, since \(d\) splits can involve at most \(d\) variables

Boosted trees in R

boost_spec <- boost_tree(
  mode = "classification", 
  tree_depth = 1, 
  trees = 1000, 
  learn_rate = 0.001, 
) |>
  set_engine("xgboost")  
  • Set the mode as you would with a bagged tree or random forest
  • tree_depth here is the depth of each tree, let’s set that to 1
  • trees is the number of trees that are fit, this is equivalent to B
  • learn_rate is \(\lambda\)

Make a recipe

rec <- recipe(HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs + 
             RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca + Thal,           
           data = heart) |>
  step_dummy(all_nominal_predictors())  
  • xgboost wants you to have all numeric data, that means we need to make dummy variables
  • because HD (the outcome) is also categorical, we can use all_nominal_predictors to make sure we don’t turn the outcome into dummy variables as well

Fit the model

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(boost_spec)
model <- fit(wf, data = heart)

Boosting

How would this code change if I wanted to tune B the number of bootstrapped training samples?

boost_spec <- boost_tree( 
  mode = "classification", 
  tree_depth = 1, 
  trees = 1000, 
  learn_rate = 0.001, 
) |>
  set_engine("xgboost") 
06:00

Boosting

Fit a boosted model to the data from the previous application exercise.

Variable Importance

Variable importance

  • For bagged or random forest regression trees, we can record the total RSS that is decreased due to splits of a given predictor \(X_i\) averaged over all \(B\) trees

  • A large value would indicate that that variable is important

Variable importance

  • For bagged or random forest classification trees we can add up the total amount that the Gini Index is decreased by splits of a given predictor, \(X_i\), averaged over \(B\) trees

Variable importance in R

rf_spec <- rand_forest(
  mode = "classification",
  mtry = 3
) |> 
  set_engine(
    "ranger",
    importance = "impurity") 

wf <- workflow() |>
  add_recipe(
    recipe(HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs + 
             RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca + Thal,               
           data = heart)
  ) |>
  add_model(rf_spec)
model <- fit(wf, data = heart)
ranger::importance(model$fit$fit$fit)
       Age        Sex  ChestPain     RestBP       Chol        Fbs    RestECG 
 8.4105385  3.9179814 16.2075537  7.8479381  7.0121649  0.8112367  1.5944339 
     MaxHR      ExAng    Oldpeak      Slope         Ca       Thal 
13.7292165  6.8135135 13.0718556  5.9581823 16.6729564 14.7206036 

Variable importance

library(ranger)
importance(model$fit$fit$fit)
       Age        Sex  ChestPain     RestBP       Chol        Fbs    RestECG 
 8.4105385  3.9179814 16.2075537  7.8479381  7.0121649  0.8112367  1.5944339 
     MaxHR      ExAng    Oldpeak      Slope         Ca       Thal 
13.7292165  6.8135135 13.0718556  5.9581823 16.6729564 14.7206036 
var_imp <- ranger::importance(model$fit$fit$fit)

Plotting variable importance

var_imp_df <- data.frame(
  variable = names(var_imp),
  importance = var_imp
)

var_imp_df |>
  ggplot(aes(x = variable, y = importance)) +
  geom_col()

How could we make this plot better?

Plotting variable importance

var_imp_df |>
  ggplot(aes(x = variable, y = importance)) +
  geom_col() + 
  coord_flip()

How could we make this plot better?

Plotting variable importance

var_imp_df |>
  mutate(variable = factor(variable, 
                           levels = variable[order(var_imp_df$importance)])) |>
  ggplot(aes(x = variable, y = importance)) +
  geom_col() + 
  coord_flip()