https://github.com/sta-363-s23/lab-02-cv.gitLab 02 - Cross Validation
Due: Tuesday 2023-02-07 at 11:59pm
Getting started
- Find the lab instructions under the course syllabus on our website bit.ly/sta-363-s23
 
- Go to our RStudio Pro workspace and create a new project using my template.
For this assignment, go to RStudio Pro and click:
Step 1. File > New Project
Step 2. “Version Control”
Step 3. Git
Step 4. Copy the following into the “Repository URL”:
Instructions
Packages
In this lab we will work with three packages: ISLR which is a package that accompanies your textbook, tidyverse which is a collection of packages for doing data analysis in a “tidy” way, and tidymodels a collection of packages for statistical modeling.
library(tidyverse) 
library(tidymodels)
library(ISLR)YAML:
Open the .qmd file in your project, change the author name to your name, and Render the document.
Data
For this lab, we are using the Auto data from the ISLR package.
Exercises
Conceptual questions
- Explain how k-fold Cross Validation is implemented. 
- What are the advantages / disadvantages of k-fold Cross Validation compared to the Validation Set approach? What are the advantages / disadvantages of k-fold Cross Validation compared to Leave-one-out Cross Validation? 
Data exploration
- For this analysis, we are using the - Autodataset from the- ISLRpackage. How many rows are in this dataset? How many columns? Is there any missing data?
- Our outcome of interest is miles per gallon: - mpg. Create a publication-ready figure examining the distribution of this variable. For full credit, be sure your figure has correct labels and titles.
- Our main predictor of interest is - horsepower. Create a publication-ready figure looking at the relationship between miles per gallon and horsepower.
K-fold cross validation
We are trying to decide between three models of varying flexibility:
- Model 1: \(\texttt{mpg} = \beta_0 + \beta_1 \texttt{horsepower} + \epsilon\)
- Model 2: \(\texttt{mpg} = \beta_0 + \beta_1 \texttt{horsepower} + \beta_2 \texttt{horsepower}^2 + \epsilon\)
- Model 3: \(\texttt{mpg} = \beta_0 + \beta_1 \texttt{horsepower} + \beta_2 \texttt{horsepower}^2 + \beta_3 \texttt{horsepower}^3 + \epsilon\)
- Using the Autodata, split the data into two groups a training data set, saved asAuto_trainand a testing data set, saved asAuto_test. Be sure to set a seed to ensure that you get the same result each time you Render your document.
You can use the poly() function to fit a model with a polynomial term. For example, to fit the model \(y = \beta_0 + \beta_1 \texttt{x} + \beta_2 \texttt{x}^2 + \beta_3 \texttt{x}^3 + \epsilon\), you would run fit(lm_spec, y ~ poly(x, 3), data = data)
- Fit the three models outlined above on the training data. Using the model created on the training data, predict - mpgin the test data set for each model. What is the test RMSE for the three models? Which model would you choose?
- Fit the same three models, but instead of the validation set approach, perform 5-fold cross validation. Make sure to set a seed so you get the same answer each time you run the analysis. Calculate the overall 5-fold cross validation error for each of the three models. Which model would you chose? 
- The tidymodels package allows you to do this faster! Instead of having a fit 3 (or more!) different models to determine the best flexibility, you can (1) create a recipe to specify how you would like to fit a model and then (2) tune this model to determine the best output. Copy the code below. What do you think the line - step_poly(horsepower, degree = tune())does? Hint: you can run- ?step_polyin the Console to learn more about this function.
auto_prep <- Auto |>
  recipe(mpg ~ horsepower) |>
  step_poly(horsepower, degree = tune())- To tune this model, you will replace fit_resampleswithtune_grid. The pseudo code to do this is below - you may need to update some names to match what you have named objects in your document. Add the code to tune your model based on the code below.
auto_tune <- tune_grid(lm_spec,
          auto_prep,
          resamples = auto_cv)- Using the - collect_metricsfunction, look at the RMSE for- auto_tune. Which- degreeis preferable?
- You can plot the output from Exercise 11 to make it a bit easier to determine. First, save your output from Exercise 11 as - auto_metrics. Then filter this data frame to only include rows where- .metric == "rmse". Save this filtered data frame as- auto_rmse. Edit the code below to plot the- degreeon the x-axis and- meanon the y-axis. Describe what this plot shows.
ggplot(auto_rmse, aes(x = ----, y = ----)) + 
  geom_line() +
  geom_pointrange(aes(ymin = mean - std_err, ymax = mean + std_err)) + 
  labs(x = "Degree",
       y = "Cross validation error",
       title = ---)