Lab 04 - Ridge Lasso Elastic Net

Due: Tuesday 2023-03-14

Getting started

  • Find the lab instructions under the course syllabus on our website bit.ly/sta-363-s23
  • Go to our RStudio Pro workspace and create a new project using my template.

For this assignment, go to RStudio Pro and click:

Step 1. File > New Project
Step 2. “Version Control”
Step 3. Git
Step 4. Copy the following into the “Repository URL”:

https://github.com/sta-363-s23/lab-04-ridge-lasso-elastic.git

Remember to write out your answers in full sentences, do not just rely on the R output

Packages

In this lab we will work with two packages: tidyverse which is a collection of packages for doing data analysis in a “tidy” way and tidymodels for statistical modeling.

If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.

library(tidyverse) 
library(tidymodels)

Note that the packages are also loaded with the same commands in your R Markdown document.

Data

For this lab, we are using a data frame currently in music.csv. This data frame includes 72 predictors that are components of audio files and one outcome, lat, the latitude of where the music originated. We are trying to predict the location of the music’s origin using audio components of the music.

This data frame should be loaded in your RStudio Project when you load your starter files.

Exercises

  1. Set a seed of 7. Split the music data into a training and test set with 75% of the data in the training and 25% in the testing. Call your training set music_train and your testing set music_test. Describe these data sets (how many observations, how many variables).
music <- read_csv("data/music.csv")
  1. We are interested in predicting the latitude (lat) of the music’s origin from all other variables. Using the training data only, examine a visualization of the outcome.

  2. Fit a linear model using least squares on the training set. Report the test root mean squared error obtained.

  3. Fit a ridge regression model on the training set with \(\lambda\) chosen using 10-fold cross validation. Report the \(\lambda\) chosen and explain why. Report the estimated test root mean square error from the cross validation and the test root mean squared error obtained using testing portion of the initially split data frame.

  4. Using the model chosen above on the training data, which variables were included in the final model?

  5. Fit a lasso model on the training set with \(\lambda\) chosen using cross validation. Report the \(\lambda\) chosen and explain why. Report the estimated test root mean square error from the cross validation and the test root mean squared error obtained using testing portion of the initially split data frame.

  6. Using the model chosen above on the training data, which variables were included in the final model?

  7. Fit an elastic net model on the training set with \(\lambda\) and \(\alpha\) chosen using cross validation. Create a Figure with the estimated test root mean squared error on the y-axis and \(\lambda\) on the x-axis with lines colored by the mixture chosen.

  8. Report the \(\lambda\) that you would choose from the previous exercise and explain why. Report the estimated test root mean square error from the cross validation and the test root mean squared error obtained using testing portion of the initially split data frame.

  9. Using the model chosen above on the training data, which variables were included in the final model?

  10. Comment on the results obtained. How well can we predict the latitude of where the music originated? Is there much difference among the test errors resulting from these four approaches?