Lab 06 - Ensemble models

Due: Tuesday 2023-04-25

Getting started

Find the lab instructions under the course syllabus on our website bit.ly/sta-363-s23
Go to our RStudio Pro workspace and create a new project using my template.

For this assignment, go to RStudio Pro and click:

Step 1. File > New Project
Step 2. “Version Control”
Step 3. Git
Step 4. Copy the following into the “Repository URL”:

https://github.com/sta-363-s23/lab-06-ensemble.git

Packages

In this lab we will work with four packages: ISLR for the data, visdat to visualize the dataset, tidyverse which is a collection of packages for doing data analysis in a “tidy” way, tidymodels for statistical modeling, and stacks for stacking the models.

If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.

library(tidyverse) 
library(tidymodels)
library(ISLR)
library(visdat)
library(stacks)

Note that the packages are also loaded with the same commands in your Quarto document.

You may need to install the package to perform boosting. You can do this by running the following once in the console:

install.packages("xgboost")
install.packages("stacks")

Data

For this lab, we are using Carseats data from the ISLR package.

Exercises

Examine the Carseats data set using the visdat package. How many variables are there? What are the variable types? Is there any missing data?
Our outcome for this lab is Sales. Create a visualization examining the distribution of this variable.
Create a recipe to predict Sales from the remain variables. We are going to be fitting bagged decision trees, random forests, boosted decision trees, and penalized regression. Make sure to perform any preprocessing steps necessary for each of these models (i.e. normalizing the data, creating dummy variables, etc.). Add this recipe to a workflow.
Set a seed to 7. Add the code below to your file and be sure to add the parameter control = ctrl to your tune_grid function. Fit a bagged decision tree estimating the car seat Sales using the remaining 10 variables. You may specify the parameters in any way that you’d like, but tune the number of trees (trees), examining 10, 25, 50, 100, 200, and 300 trees. Add this model specification to your workflow and fit the model to find the best parameters for a bagged decision tree.

ctrl <- 
  tune::control_resamples(
    save_pred = TRUE,
    save_workflow = TRUE
  )

Collect the metrics from the bagged tree and filter them to only include the root mean squared error. Fill in the code below to plot these results. Describe what you see.

ggplot(---, aes(x = trees, y = mean)) + 
  geom_point() + 
  geom_line() + 
  labs(y = ---)

Update the model in your workflow to fit a random forest estimating the car seat Sales using the remaining 10 variables. You may specify the parameters in any way that you’d like, but tune the number of trees (trees), examining 10, 25, 50, 100, 200, and 300 trees.
Collect the metrics from the random forest tree and filter them to only include the root mean squared error. Using similar code as in Exercise 5, plot these results. Describe what you see.
Update the model in your workflow to fit a boosted tree estimating the car seat Sales using the remaining 10 variables. Specify the tree depth to be 1, the learn rate to 0.1, and tune the number of trees (trees), examining 10, 25, 50, 100, 200, and 300 trees.
Collect the metrics from the boosted tree and filter them to only include the root mean squared error. Using similar code as in Exercise 5, plot these results. Describe what you see.
Based on the exercises above and the number of trees attempted, which method would you prefer? What seems to be the optimal number of trees?
Update the model in your workflow to fit a penalized regression model using Elastic Net to estimate the car seat Sales using the remaining 10 variables. Use the following grid:

grid <- expand_grid(
  penalty = seq(0, 0.1, by = 0.01),
  mixture = c(0, 0.5, 1)
)

Collect the metrics from the penalized regression and filter them to only include the root mean squared error. Using similar code as in Exercise 5, plot these results. Describe what you see.
Let’s stack these models to create a single ensemble model. Using the stacks() function along with add_candidates(), blend_predictions() and fit_members(), put together a final model. Which models were retained in the ensemble model? Using predict examine how this final Ensemble model performs – what is the RMSE? How does this compare to the estimates of the test error for the individual models?