Like bagging, boosting is an approach that can be applied to many statistical learning methods
We will discuss how to use boosting for decision trees
Fit a tree \(\hat{f}^b\) with \(d\) splits ( \(d\) + 1 terminal nodes) to the training data ( \(X, r\) )
Update \(\hat{f}\) by adding in a shrunken version of the new tree: \(\hat{f}(x)\leftarrow \hat{f}(x)+\lambda \hat{f}^b(x)\)
Update the residuals: \(r_i \leftarrow r_i - \lambda \hat{f}^b(x_i)\)
Given the current model, we are fitting a decision tree to the residuals
We then add this new decision tree into the fitted function to update the residuals
Each of these trees can be small (just a few terminal nodes), determined by \(d\)
Instead of fitting a single large decision tree, which could result in overfitting, boosting learns slowly
By fitting small trees to the residuals we slowly improve \(\hat{f}\) in areas where it does not perform well
The shrinkage parameter \(\lambda\) slows the process down even more allowing more and different shaped trees to try to minimize those residuals
Boosting for classification is similar, but a bit more complex
tidymodels
will handle this for us, but if you are interested in learning more, you can check out Chapter 10 of Elements of Statistical Learning
With bagging what could we tune?
\(B\), the number of bootstrapped training samples (the number of decision trees fit) (trees
)
It is more efficient to just pick something very large instead of tuning this
For \(B\), you don’t really risk overfitting if you pick something too big
With random forest what could we tune?
The depth of the tree, \(B\), and m
the number of predictors to try (mtry
)
The default is \(\sqrt{p}\), and this does pretty well
What do you think you can use to pick \(B\)?
Unlike bagging and random forest with boosting you can overfit if \(B\) is too large
Cross-validation, of course!
The shrinkage parameter \(\lambda\) controls the rate at which boosting learn
\(\lambda\) is a small, positive number, typically 0.01 or 0.001
It depends on the problem, but typically a very small \(\lambda\) can require a very large \(B\) for good performance
The number of splits, \(d\), in each tree controls the complexity of the boosted ensemble
Often \(d=1\) is a good default
brace yourself for another tree pun!
In this case we call the tree a stump meaning it just has a single split
This results in an additive model
You can think of \(d\) as the interaction depth it controls the interaction order of the boosted model, since \(d\) splits can involve at most \(d\) variables
mode
as you would with a bagged tree or random foresttree_depth
here is the depth of each tree, let’s set that to 1trees
is the number of trees that are fit, this is equivalent to B
learn_rate
is \(\lambda\)xgboost
wants you to have all numeric data, that means we need to make dummy variablesHD
(the outcome) is also categorical, we can use all_nominal_predictors
to make sure we don’t turn the outcome into dummy variables as wellBoosting
How would this code change if I wanted to tune B
the number of bootstrapped training samples?
06:00
Boosting
Fit a boosted model to the data from the previous application exercise.
For bagged or random forest regression trees, we can record the total RSS that is decreased due to splits of a given predictor \(X_i\) averaged over all \(B\) trees
A large value would indicate that that variable is important
rf_spec <- rand_forest(
mode = "classification",
mtry = 3
) |>
set_engine(
"ranger",
importance = "impurity")
wf <- workflow() |>
add_recipe(
recipe(HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs +
RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca + Thal,
data = heart)
) |>
add_model(rf_spec)
model <- fit(wf, data = heart)
Age Sex ChestPain RestBP Chol Fbs RestECG
8.4105385 3.9179814 16.2075537 7.8479381 7.0121649 0.8112367 1.5944339
MaxHR ExAng Oldpeak Slope Ca Thal
13.7292165 6.8135135 13.0718556 5.9581823 16.6729564 14.7206036
How could we make this plot better?
How could we make this plot better?
var_imp_df |>
mutate(variable = factor(variable,
levels = variable[order(var_imp_df$importance)])) |>
ggplot(aes(x = variable, y = importance)) +
geom_col() +
coord_flip()
Dr. Lucy D’Agostino McGowan adapted from slides by Hastie & Tibshirani