`01:00`

Lucy D’Agostino McGowan

*Do you* ❤️ *all of the tree puns?*

Random forests provide an improvement over bagged trees by way of a small tweak that

*decorrelates*the treesBy

*decorrelating*the trees, this reduces the variance even more when we average the trees!

Like bagging, build a number of decision trees on bootstrapped training samples

Each time the tree is split, instead of considering

*all predictors*(like bagging),**a random selection of**\(m\)**predictors**is chosen as split candidates from the full set of \(p\) predictorsThe split is allowed to use only one of those \(m\) predictors

A fresh selection of \(m\) predictors is taken at each split

typically we choose \(m \approx \sqrt{p}\)

`Choosing m for Random Forest`

Let’s say you have a dataset with 100 observations and 9 variables, if you were fitting a random forest, what would a good \(m\) be?

`01:00`

*Recall that we are predicting whether a patient has heart disease from 13 predictors*

`mtry`

here is `m`

. If we are doing *bagging* what do you think we set this to?

What would we change `mtry`

to if we are doing a random forest?

- The default for
`rand_forest`

is`floor(sqrt(# predictors))`

(so 3 in this case)

`Application Exercise`

- Open your last application exercise
- Refit your model as a
*random forest*

`10:00`

Dr. Lucy D’Agostino McGowan *adapted from slides by Hastie & Tibshirani*