https://github.com/sta-363-s23/lab-05-non-linear.git
Lab 05 - Nonlinear models
Due: Tuesday 2022-04-04
Getting started
- Find the lab instructions under the course syllabus on our website bit.ly/sta-363-s23
- Go to our RStudio Pro workspace and create a new project using my template.
For this assignment, go to RStudio Pro and click:
Step 1. File > New Project
Step 2. “Version Control”
Step 3. Git
Step 4. Copy the following into the “Repository URL”:
Remember to write out your answers in full sentences, do not just rely on the R output
Packages
In this lab we will work with four packages: tidyverse
which is a collection of packages for doing data analysis in a “tidy” way, tidymodels
for statistical modeling, DALEXtra
to help us explain how the parameters are impacting the final predictions, and visdat
to help us visualize the dataset.
If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.
library(tidyverse)
library(tidymodels)
library(DALEXtra)
library(visdat)
Data
For this lab, we are using wage
data adapted from the ISLR
package. This can be loaded in by running:
load("data/wage.rda")
Below is a data dictionary for this dataset:
variable | definition |
---|---|
year | Year that wage information was recorded |
age | Age of worker |
maritl | A factor with levels 1. Never Married 2. Married 3. Widowed 4. Divorced and 5. Separated indicating marital status |
race | A factor with levels 1. White 2. Black 3. Asian and 4. Other indicating race |
education | A factor with levels 1. < HS Grad 2. HS Grad 3. Some College 4. College Grad and 5. Advanced Degree indicating education level |
region | Region of the country (mid-atlantic only) |
jobclass | A factor with levels 1. Industrial and 2. Information indicating type of job |
health | A factor with levels 1. <=Good and 2. >=Very Good indicating health level of worker |
health_ins | A factor with levels 1. Yes and 2. No indicating whether worker has health insurance |
logwage | Log of workers wage |
wage | Workers raw wage |
Exercises
Using the
visdat
package examine thewage
dataset. What are the variables? How many observations are there? Is there any missing data? If so how much?For this lab, we are interested in predicting the variable
wage
fromage
,health_ins
,jobclass
,education
, andrace
. Let’s first look at the distribution of our outcome. Create a publication-ready figure examining the distribution ofwage
.Create a recipe using the
wage
data. Recall we are interested in predictingwage
fromage
,health_ins
,jobclass
,education
, andrace
. Add a step to your recipe that accounts for the missing data, if there is any. Explain what this step is doing and why you chose this method. (In this exercise, you are just creating the recipe).We want age to be fit using a natural spline. Let’s consider fitting a spline with 3 degrees of freedom. How many knots did will spline include? Where are the knots located by default? What values of age for this dataset will the knots be set at by default?
Instead of pre-specifying the degrees of freedom for our natural spline, let’s allow this to be tuned using cross-validation. Update your recipe to include a step to fit age using a natural spline, but instead of setting the degrees of freedom, use
tune()
for this parameter. (In this exercise, you are just updating the recipe).Create a workflow for fitting the model. Add your recipe to this workflow. Add a model to this workflow (here we are doing linear regression).
Use
tune_grid()
to fit the workflow specified in Exercise 6 using 10-fold cross validation, similar to Lab 04. Using the following grid:
<- data.frame(deg_free = 1:10) grid
Use the show_best
function to show the top 5 models using RMSE as the metric. Choose the model with the lowest RMSE. How many degrees of freedom were used for the age natural spline for this best model? Report the RMSE for this model as well as the chosen degrees of freedom.
Create a “final workflow” using the best parameters from Exercise 7 (for this you will want to use the
finalize_workflow
function along with theselect_best
function to select the best parameters). Fit this finalized model.Using the
explain_tidymodels
andmodel_profile
functions from theDALEXtra
package, along with the functionggplot_pdp
to create a partial dependence plot to examine the effect ofage
on predicted wage in your final model. (To get theggplot_pdp
function to work, be sure to run thesource("ggplot_pdp.R")
line of code provided in your template file). Describe what you see.We are going to create a new recipe, swapping out the natural spline for age with a polynomial. Here we want age to be fit using a polynomial. Use
tune()
to decide how many degrees the polynomial should have for theage
variable. Update the workflow from Exercise 6 with this new recipe using theupdate_recipe
function.Use
tune_grid()
to fit the workflow from Exercise 10 using 10-fold cross validation. Using the following grid:
<- data.frame(degree = 1:10) grid
Choose the model with the lowest RMSE. What degree polynomial was used for age for this best model? Report the RMSE for this model as well as the chosen degree.
Create a “final workflow” using the best parameters from Exercise 11. Fit this finalized model.
Create a publication-ready partial dependence plot to examine the effect of
age
on predicted wage in your final model from Exercise 12. Compare this to the plot created in Exercise 9.If the goal is prediction, which model would you prefer, the one fit in Exercise 8 or the one fit in Exercise 12? Explain your answer.
Using your chosen model, examine whether ridge, lasso, or elastic net would provide a better fit. Use the following grid:
<- expand_grid(
grid penalty = seq(0, 3, by = 0.5),
mixture = seq(0, 1, by = 0.1)
)
To do this, we are going to have to create a new recipe and new workflow. In your recipe make sure that you use all the necessary steps (think about missing data, dummy variables, scaling, the transformation of age, etc). Describe your results.