https://github.com/sta-363-s23/lab-06-ensemble.git
Lab 06 - Ensemble models
Due: Tuesday 2023-04-25
Getting started
- Find the lab instructions under the course syllabus on our website bit.ly/sta-363-s23
- Go to our RStudio Pro workspace and create a new project using my template.
For this assignment, go to RStudio Pro and click:
Step 1. File > New Project
Step 2. “Version Control”
Step 3. Git
Step 4. Copy the following into the “Repository URL”:
Packages
In this lab we will work with four packages: ISLR
for the data, visdat
to visualize the dataset, tidyverse
which is a collection of packages for doing data analysis in a “tidy” way, tidymodels
for statistical modeling, and stacks
for stacking the models.
If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.
library(tidyverse)
library(tidymodels)
library(ISLR)
library(visdat)
library(stacks)
Note that the packages are also loaded with the same commands in your Quarto document.
You may need to install the package to perform boosting. You can do this by running the following once in the console:
install.packages("xgboost")
install.packages("stacks")
Data
For this lab, we are using Carseats
data from the ISLR
package.
Exercises
Examine the
Carseats
data set using thevisdat
package. How many variables are there? What are the variable types? Is there any missing data?Our outcome for this lab is
Sales
. Create a visualization examining the distribution of this variable.Create a recipe to predict
Sales
from the remain variables. We are going to be fitting bagged decision trees, random forests, boosted decision trees, and penalized regression. Make sure to perform any preprocessing steps necessary for each of these models (i.e. normalizing the data, creating dummy variables, etc.). Add this recipe to a workflow.Set a seed to
7
. Add the code below to your file and be sure to add the parametercontrol = ctrl
to yourtune_grid
function. Fit a bagged decision tree estimating the car seatSales
using the remaining 10 variables. You may specify the parameters in any way that you’d like, but tune the number of trees (trees
), examining 10, 25, 50, 100, 200, and 300 trees. Add this model specification to your workflow and fit the model to find the best parameters for a bagged decision tree.
<-
ctrl ::control_resamples(
tunesave_pred = TRUE,
save_workflow = TRUE
)
- Collect the metrics from the bagged tree and filter them to only include the root mean squared error. Fill in the code below to plot these results. Describe what you see.
ggplot(---, aes(x = trees, y = mean)) +
geom_point() +
geom_line() +
labs(y = ---)
Update the model in your workflow to fit a random forest estimating the car seat
Sales
using the remaining 10 variables. You may specify the parameters in any way that you’d like, but tune the number of trees (trees
), examining 10, 25, 50, 100, 200, and 300 trees.Collect the metrics from the random forest tree and filter them to only include the root mean squared error. Using similar code as in Exercise 5, plot these results. Describe what you see.
Update the model in your workflow to fit a boosted tree estimating the car seat
Sales
using the remaining 10 variables. Specify the tree depth to be 1, the learn rate to 0.1, and tune the number of trees (trees
), examining 10, 25, 50, 100, 200, and 300 trees.Collect the metrics from the boosted tree and filter them to only include the root mean squared error. Using similar code as in Exercise 5, plot these results. Describe what you see.
Based on the exercises above and the number of trees attempted, which method would you prefer? What seems to be the optimal number of trees?
Update the model in your workflow to fit a penalized regression model using Elastic Net to estimate the car seat
Sales
using the remaining 10 variables. Use the following grid:
<- expand_grid(
grid penalty = seq(0, 0.1, by = 0.01),
mixture = c(0, 0.5, 1)
)
Collect the metrics from the penalized regression and filter them to only include the root mean squared error. Using similar code as in Exercise 5, plot these results. Describe what you see.
Let’s stack these models to create a single ensemble model. Using the
stacks()
function along withadd_candidates()
,blend_predictions()
andfit_members()
, put together a final model. Which models were retained in the ensemble model? Usingpredict
examine how this final Ensemble model performs – what is the RMSE? How does this compare to the estimates of the test error for the individual models?