tidymodels

Lucy D’Agostino McGowan

`Application Exercise`

Create a new project from this template in RStudio Pro:

https://github.com/sta-363-s23/06-appex.git

Load the packages by running the top chunk of R code

tidymodels

tidymodels.org

tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
All packages share an underlying philosophy and a common grammar.

Step 1: Specify the model

Pick the model
Set the engine

Specify the model

linear_reg() |>
  set_engine("lm")

Specify the model

linear_reg() |>
  set_engine("glmnet")

Specify the model

linear_reg() |>
  set_engine("spark")

Specify the model

decision_tree() |>
  set_engine("rpart")

Specify the model

All available models:

tidymodels.org

`Application Exercise`

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels. Save it as lm_spec and look at the object. What does it return?

Hint: you’ll need https://www.tidymodels.org

05:00

lm_spec <- 
  linear_reg() |> # Pick linear regression
  set_engine(engine = "lm") # set engine
lm_spec

Linear Regression Model Specification (regression)

Computational engine: lm

Fit the data

You can train your model using the fit() function

fit(lm_spec,
    mpg ~ horsepower,
    data = Auto)

parsnip model object


Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578

`Application Exercise`

Fit the model:

library(ISLR)
lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)
lm_fit

Does this give the same results as

lm(mpg ~ horsepower, data = Auto)

03:00

lm_fit <- fit(lm_spec,
              mpg ~ horsepower,
              data = Auto)
lm_fit

parsnip model object


Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578

lm(mpg ~ horsepower, data = Auto)


Call:
lm(formula = mpg ~ horsepower, data = Auto)

Coefficients:
(Intercept)   horsepower  
    39.9359      -0.1578

Get predictions

lm_fit |>
  predict(new_data = Auto)

Uses the predict() function
‼️ new_data has an underscore
😄 This automagically creates a data frame

Get predictions

lm_fit |>
  predict(new_data = Auto) |>
  bind_cols(Auto)

# A tibble: 392 × 10
   .pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name 
 * <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
 1 19.4     18         8          307      130   3504    12      70      1 chev…
 2 13.9     15         8          350      165   3693    11.5    70      1 buic…
 3 16.3     18         8          318      150   3436    11      70      1 plym…
 4 16.3     16         8          304      150   3433    12      70      1 amc …
 5 17.8     17         8          302      140   3449    10.5    70      1 ford…
 6  8.68    15         8          429      198   4341    10      70      1 ford…
 7  5.21    14         8          454      220   4354     9      70      1 chev…
 8  6.00    14         8          440      215   4312     8.5    70      1 plym…
 9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹horsepower,
#   ²acceleration

What does bind_cols do?

Get predictions

lm_fit |>
  predict(new_data = Auto) |>
  bind_cols(Auto)

# A tibble: 392 × 10
   .pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name 
 * <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
 1 19.4     18         8          307      130   3504    12      70      1 chev…
 2 13.9     15         8          350      165   3693    11.5    70      1 buic…
 3 16.3     18         8          318      150   3436    11      70      1 plym…
 4 16.3     16         8          304      150   3433    12      70      1 amc …
 5 17.8     17         8          302      140   3449    10.5    70      1 ford…
 6  8.68    15         8          429      198   4341    10      70      1 ford…
 7  5.21    14         8          454      220   4354     9      70      1 chev…
 8  6.00    14         8          440      215   4312     8.5    70      1 plym…
 9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹horsepower,
#   ²acceleration

Which column has the predicted values?

`Application Exercise`

03:00

Edit the code below to add the original data to the predicted data.

mpg_pred <- lm_fit |> 
  predict(new_data = Auto) |> 
  ---

Get predictions

mpg_pred <- lm_fit |>
  predict(new_data = Auto) |>
  bind_cols(Auto)

mpg_pred

# A tibble: 392 × 10
   .pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name 
 * <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
 1 19.4     18         8          307      130   3504    12      70      1 chev…
 2 13.9     15         8          350      165   3693    11.5    70      1 buic…
 3 16.3     18         8          318      150   3436    11      70      1 plym…
 4 16.3     16         8          304      150   3433    12      70      1 amc …
 5 17.8     17         8          302      140   3449    10.5    70      1 ford…
 6  8.68    15         8          429      198   4341    10      70      1 ford…
 7  5.21    14         8          454      220   4354     9      70      1 chev…
 8  6.00    14         8          440      215   4312     8.5    70      1 plym…
 9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹horsepower,
#   ²acceleration

Calculate the error

Root mean square error

mpg_pred |>
  rmse(truth = mpg, estimate = .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        4.89

What is this estimate? (training error? testing error?)

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

<Training/Testing/Total>
<196/196/392>

How many observations are in the training set?

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

<Training/Testing/Total>
<196/196/392>

How many observations are in the test set?

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

<Training/Testing/Total>
<196/196/392>

How many observations are there in total?

Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split

<Training/Testing/Total>
<196/196/392>

Extract the training and testing data

training(Auto_split)
testing(Auto_split)

Validation set approach

Auto_train <- training(Auto_split)

Auto_train

# A tibble: 196 × 9
     mpg cylinders displacement horsepower weight accelerat…¹  year origin name 
   <dbl>     <dbl>        <dbl>      <dbl>  <dbl>       <dbl> <dbl>  <dbl> <fct>
 1  19           4        120           88   3270        21.9    76      2 peug…
 2  20.5         6        200           95   3155        18.2    78      1 chev…
 3  37.2         4         86           65   2019        16.4    80      3 dats…
 4  22           6        250          105   3353        14.5    76      1 chev…
 5  14           8        351          148   4657        13.5    75      1 ford…
 6  14           8        351          153   4129        13      72      1 ford…
 7  25           4         97.5         80   2126        17      72      1 dodg…
 8  16           8        351          149   4335        14.5    77      1 ford…
 9  38           4         91           67   1995        16.2    82      3 dats…
10  31.6         4        120           74   2635        18.3    81      3 mazd…
# … with 186 more rows, and abbreviated variable name ¹acceleration

`Application Exercise`

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE.

set.seed(100)
Auto_split  <- ________
Auto_train  <- ________
Auto_test   <- ________
lm_fit      <- fit(lm_spec, 
                   mpg ~ horsepower, 
                   data = ________)
mpg_pred  <- ________ |> 
  predict(new_data = ________) |> 
  bind_cols(________)
rmse(________, truth = ________, estimate = ________)

06:00

A faster way!

You can use last_fit() and specify the split
This will automatically train the data on the train data from the split
Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
                   mpg ~ horsepower,
                   split = Auto_split) 

lm_fit |>
  collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

What about cross validation?

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [313/79]> Fold1
2 <split [313/79]> Fold2
3 <split [314/78]> Fold3
4 <split [314/78]> Fold4
5 <split [314/78]> Fold5

What about cross validation?

Instead of fit we will use fit_resamples

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv)

What about cross validation?

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv)

What about cross validation?

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv)

What about cross validation?

fit_resamples(lm_spec, 
              mpg ~ horsepower,
              resamples = Auto_cv)

What about cross validation?

fit_resamples(lm_spec,
              mpg ~ horsepower,
              resamples = Auto_cv)

# Resampling results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics         .notes          
  <list>           <chr> <list>           <list>          
1 <split [313/79]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [313/79]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [314/78]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
4 <split [314/78]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
5 <split [314/78]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

What about cross validation?

How do we get the metrics out? With collect_metrics() again!

results <- fit_resamples(lm_spec,
                         mpg ~ horsepower,
                         resamples = Auto_cv)

results |>
  collect_metrics()

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.88      5  0.385  Preprocessor1_Model1
2 rsq     standard   0.616     5  0.0220 Preprocessor1_Model1

`Application Exercise`

05:00

Edit the code below to get the 5-fold cross validation error rate for the following model:

\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)

Auto_cv <- vfold_cv(Auto, v = 5)

results <- fit_resamples(lm_spec,
                         ----,
                         resamples = ---)

results |>
  collect_metrics()

What do you think rsq is?

Auto_cv <- vfold_cv(Auto, v = 5)

results <- fit_resamples(lm_spec,
                         mpg ~ horsepower + I(horsepower^2),
                         resamples = Auto_cv)

results |>
  collect_metrics()

`Application Exercise`

Fit 3 models on the data using 5 fold cross validation:

\(mpg = \beta_0 + \beta_1 horsepower + \epsilon\)

\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)

\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \beta_3 horsepower^3 +\epsilon\)
Collect the metrics from each model, saving the results as results_1, results_2, results_3
Which model is “best”?

08:00