tidymodels

Lucy D’Agostino McGowan

Application Exercise

  1. Create a new project from this template in RStudio Pro:
https://github.com/sta-363-s23/06-appex.git
  1. Load the packages by running the top chunk of R code

tidymodels

tidymodels.org

  • tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
  • All packages share an underlying philosophy and a common grammar.

Step 1: Specify the model

  • Pick the model
  • Set the engine

Specify the model

linear_reg() |>
  set_engine("lm")

Specify the model

linear_reg() |>
  set_engine("glmnet")

Specify the model

linear_reg() |>
  set_engine("spark")

Specify the model

decision_tree() |>
  set_engine("rpart")

Specify the model

  • All available models:

tidymodels.org

Application Exercise

  1. Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels. Save it as lm_spec and look at the object. What does it return?

Hint: you’ll need https://www.tidymodels.org

05:00

lm_spec <- 
  linear_reg() |> # Pick linear regression
  set_engine(engine = "lm") # set engine
lm_spec
Linear Regression Model Specification (regression)

Computational engine: lm 

Fit the data

  • You can train your model using the fit() function
fit(lm_spec,
    mpg ~ horsepower,
    data = Auto)
parsnip model object


Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578  

Application Exercise

  1. Fit the model:
library(ISLR)
lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)
lm_fit

Does this give the same results as

lm(mpg ~ horsepower, data = Auto)
03:00

lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)
lm_fit
parsnip model object


Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578  
lm(mpg ~ horsepower, data = Auto)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578  

Get predictions

lm_fit |>
  predict(new_data = Auto)
  • Uses the predict() function
  • ‼️ new_data has an underscore
  • 😄 This automagically creates a data frame

Get predictions

lm_fit |>
  predict(new_data = Auto) |>
  bind_cols(Auto)
# A tibble: 392 × 10
   .pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name 
 * <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
 1 19.4     18         8          307      130   3504    12      70      1 chev…
 2 13.9     15         8          350      165   3693    11.5    70      1 buic…
 3 16.3     18         8          318      150   3436    11      70      1 plym…
 4 16.3     16         8          304      150   3433    12      70      1 amc …
 5 17.8     17         8          302      140   3449    10.5    70      1 ford…
 6  8.68    15         8          429      198   4341    10      70      1 ford…
 7  5.21    14         8          454      220   4354     9      70      1 chev…
 8  6.00    14         8          440      215   4312     8.5    70      1 plym…
 9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹​horsepower,
#   ²​acceleration

What does bind_cols do?

Get predictions

lm_fit |>
  predict(new_data = Auto) |>
  bind_cols(Auto)
# A tibble: 392 × 10
   .pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name 
 * <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
 1 19.4     18         8          307      130   3504    12      70      1 chev…
 2 13.9     15         8          350      165   3693    11.5    70      1 buic…
 3 16.3     18         8          318      150   3436    11      70      1 plym…
 4 16.3     16         8          304      150   3433    12      70      1 amc …
 5 17.8     17         8          302      140   3449    10.5    70      1 ford…
 6  8.68    15         8          429      198   4341    10      70      1 ford…
 7  5.21    14         8          454      220   4354     9      70      1 chev…
 8  6.00    14         8          440      215   4312     8.5    70      1 plym…
 9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹​horsepower,
#   ²​acceleration

Which column has the predicted values?

Application Exercise

03:00
  1. Edit the code below to add the original data to the predicted data.
mpg_pred <- lm_fit |> 
  predict(new_data = Auto) |> 
  ---

Get predictions

mpg_pred <- lm_fit |>
  predict(new_data = Auto) |>
  bind_cols(Auto)

mpg_pred
# A tibble: 392 × 10
   .pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name 
 * <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
 1 19.4     18         8          307      130   3504    12      70      1 chev…
 2 13.9     15         8          350      165   3693    11.5    70      1 buic…
 3 16.3     18         8          318      150   3436    11      70      1 plym…
 4 16.3     16         8          304      150   3433    12      70      1 amc …
 5 17.8     17         8          302      140   3449    10.5    70      1 ford…
 6  8.68    15         8          429      198   4341    10      70      1 ford…
 7  5.21    14         8          454      220   4354     9      70      1 chev…
 8  6.00    14         8          440      215   4312     8.5    70      1 plym…
 9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹​horsepower,
#   ²​acceleration

Calculate the error

  • Root mean square error
mpg_pred |>
  rmse(truth = mpg, estimate = .pred)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        4.89

What is this estimate? (training error? testing error?)

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

How many observations are in the training set?

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

How many observations are in the test set?

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

How many observations are there in total?

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>
  • Extract the training and testing data
training(Auto_split)
testing(Auto_split)

Validation set approach

Auto_train <- training(Auto_split)
Auto_train
# A tibble: 196 × 9
     mpg cylinders displacement horsepower weight accelerat…¹  year origin name 
   <dbl>     <dbl>        <dbl>      <dbl>  <dbl>       <dbl> <dbl>  <dbl> <fct>
 1  19           4        120           88   3270        21.9    76      2 peug…
 2  20.5         6        200           95   3155        18.2    78      1 chev…
 3  37.2         4         86           65   2019        16.4    80      3 dats…
 4  22           6        250          105   3353        14.5    76      1 chev…
 5  14           8        351          148   4657        13.5    75      1 ford…
 6  14           8        351          153   4129        13      72      1 ford…
 7  25           4         97.5         80   2126        17      72      1 dodg…
 8  16           8        351          149   4335        14.5    77      1 ford…
 9  38           4         91           67   1995        16.2    82      3 dats…
10  31.6         4        120           74   2635        18.3    81      3 mazd…
# … with 186 more rows, and abbreviated variable name ¹​acceleration

Application Exercise

  1. Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE.
set.seed(100)
Auto_split  <- ________
Auto_train  <- ________
Auto_test   <- ________
lm_fit      <- fit(lm_spec, 
                   mpg ~ horsepower, 
                   data = ________)
mpg_pred  <- ________ |> 
  predict(new_data = ________) |> 
  bind_cols(________)
rmse(________, truth = ________, estimate = ________)
06:00

A faster way!

  • You can use last_fit() and specify the split
  • This will automatically train the data on the train data from the split
  • Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [313/79]> Fold1
2 <split [313/79]> Fold2
3 <split [314/78]> Fold3
4 <split [314/78]> Fold4
5 <split [314/78]> Fold5

What about cross validation?

  • Instead of fit we will use fit_resamples
fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv) 

What about cross validation?

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv) 

What about cross validation?

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv) 

What about cross validation?

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv) 

What about cross validation?

fit_resamples(lm_spec,
              mpg ~ horsepower,
              resamples = Auto_cv)
# Resampling results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics         .notes          
  <list>           <chr> <list>           <list>          
1 <split [313/79]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [313/79]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [314/78]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
4 <split [314/78]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
5 <split [314/78]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

What about cross validation?

How do we get the metrics out? With collect_metrics() again!

results <- fit_resamples(lm_spec,
                         mpg ~ horsepower,
                         resamples = Auto_cv)

results |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.88      5  0.385  Preprocessor1_Model1
2 rsq     standard   0.616     5  0.0220 Preprocessor1_Model1

Application Exercise

05:00
  1. Edit the code below to get the 5-fold cross validation error rate for the following model:

\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)

Auto_cv <- vfold_cv(Auto, v = 5)

results <- fit_resamples(lm_spec,
                         ----,
                         resamples = ---)

results |>
  collect_metrics()
  • What do you think rsq is?

Auto_cv <- vfold_cv(Auto, v = 5)

results <- fit_resamples(lm_spec,
                         mpg ~ horsepower + I(horsepower^2),
                         resamples = Auto_cv)

results |>
  collect_metrics()

Application Exercise

  1. Fit 3 models on the data using 5 fold cross validation:

    \(mpg = \beta_0 + \beta_1 horsepower + \epsilon\)

    \(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)

    \(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \beta_3 horsepower^3 +\epsilon\)

  2. Collect the metrics from each model, saving the results as results_1, results_2, results_3

  3. Which model is “best”?

08:00