# tidymodels

Lucy D’Agostino McGowan

## Application Exercise

1. Create a new project from this template in RStudio Pro:
https://github.com/sta-363-s23/06-appex.git
1. Load the packages by running the top chunk of R code

## tidymodels

tidymodels.org

• tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
• All packages share an underlying philosophy and a common grammar.

## Step 1: Specify the model

• Pick the model
• Set the engine

## Specify the model

linear_reg() |>
set_engine("lm")

## Specify the model

linear_reg() |>
set_engine("glmnet")

## Specify the model

linear_reg() |>
set_engine("spark")

## Specify the model

decision_tree() |>
set_engine("rpart")

## Specify the model

• All available models:

tidymodels.org

## Application Exercise

1. Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels. Save it as lm_spec and look at the object. What does it return?

Hint: you’ll need https://www.tidymodels.org

05:00
lm_spec <-
linear_reg() |> # Pick linear regression
set_engine(engine = "lm") # set engine
lm_spec
Linear Regression Model Specification (regression)

Computational engine: lm 

## Fit the data

• You can train your model using the fit() function
fit(lm_spec,
mpg ~ horsepower,
data = Auto)
parsnip model object

Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower
39.9359      -0.1578  

## Application Exercise

1. Fit the model:
library(ISLR)
lm_fit <- fit(lm_spec,
mpg ~ horsepower,
data = Auto)
lm_fit

Does this give the same results as

lm(mpg ~ horsepower, data = Auto)
03:00
lm_fit <- fit(lm_spec,
mpg ~ horsepower,
data = Auto)
lm_fit
parsnip model object

Call:
stats::lm(formula = mpg ~ horsepower, data = data)

Coefficients:
(Intercept)   horsepower
39.9359      -0.1578  
lm(mpg ~ horsepower, data = Auto)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Coefficients:
(Intercept)   horsepower
39.9359      -0.1578  

## Get predictions

lm_fit |>
predict(new_data = Auto)
• Uses the predict() function
• ‼️ new_data has an underscore
• 😄 This automagically creates a data frame

## Get predictions

lm_fit |>
predict(new_data = Auto) |>
bind_cols(Auto)
# A tibble: 392 × 10
.pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name
* <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
1 19.4     18         8          307      130   3504    12      70      1 chev…
2 13.9     15         8          350      165   3693    11.5    70      1 buic…
3 16.3     18         8          318      150   3436    11      70      1 plym…
4 16.3     16         8          304      150   3433    12      70      1 amc …
5 17.8     17         8          302      140   3449    10.5    70      1 ford…
6  8.68    15         8          429      198   4341    10      70      1 ford…
7  5.21    14         8          454      220   4354     9      70      1 chev…
8  6.00    14         8          440      215   4312     8.5    70      1 plym…
9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹​horsepower,
#   ²​acceleration

What does bind_cols do?

## Get predictions

lm_fit |>
predict(new_data = Auto) |>
bind_cols(Auto)
# A tibble: 392 × 10
.pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name
* <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
1 19.4     18         8          307      130   3504    12      70      1 chev…
2 13.9     15         8          350      165   3693    11.5    70      1 buic…
3 16.3     18         8          318      150   3436    11      70      1 plym…
4 16.3     16         8          304      150   3433    12      70      1 amc …
5 17.8     17         8          302      140   3449    10.5    70      1 ford…
6  8.68    15         8          429      198   4341    10      70      1 ford…
7  5.21    14         8          454      220   4354     9      70      1 chev…
8  6.00    14         8          440      215   4312     8.5    70      1 plym…
9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹​horsepower,
#   ²​acceleration

Which column has the predicted values?

## Application Exercise

03:00
1. Edit the code below to add the original data to the predicted data.
mpg_pred <- lm_fit |>
predict(new_data = Auto) |>
---

## Get predictions

mpg_pred <- lm_fit |>
predict(new_data = Auto) |>
bind_cols(Auto)

mpg_pred
# A tibble: 392 × 10
.pred   mpg cylinders displacement horsep…¹ weight accel…²  year origin name
* <dbl> <dbl>     <dbl>        <dbl>    <dbl>  <dbl>   <dbl> <dbl>  <dbl> <fct>
1 19.4     18         8          307      130   3504    12      70      1 chev…
2 13.9     15         8          350      165   3693    11.5    70      1 buic…
3 16.3     18         8          318      150   3436    11      70      1 plym…
4 16.3     16         8          304      150   3433    12      70      1 amc …
5 17.8     17         8          302      140   3449    10.5    70      1 ford…
6  8.68    15         8          429      198   4341    10      70      1 ford…
7  5.21    14         8          454      220   4354     9      70      1 chev…
8  6.00    14         8          440      215   4312     8.5    70      1 plym…
9  4.42    14         8          455      225   4425    10      70      1 pont…
10  9.95    15         8          390      190   3850     8.5    70      1 amc …
# … with 382 more rows, and abbreviated variable names ¹​horsepower,
#   ²​acceleration

## Calculate the error

• Root mean square error
mpg_pred |>
rmse(truth = mpg, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr>   <chr>          <dbl>
1 rmse    standard        4.89

What is this estimate? (training error? testing error?)

## Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

How many observations are in the training set?

## Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

How many observations are in the test set?

## Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>

How many observations are there in total?

## Validation set approach

Auto_split <- initial_split(Auto, prop = 0.5)
Auto_split
<Training/Testing/Total>
<196/196/392>
• Extract the training and testing data
training(Auto_split)
testing(Auto_split)

## Validation set approach

Auto_train <- training(Auto_split)
Auto_train
# A tibble: 196 × 9
mpg cylinders displacement horsepower weight accelerat…¹  year origin name
<dbl>     <dbl>        <dbl>      <dbl>  <dbl>       <dbl> <dbl>  <dbl> <fct>
1  19           4        120           88   3270        21.9    76      2 peug…
2  20.5         6        200           95   3155        18.2    78      1 chev…
3  37.2         4         86           65   2019        16.4    80      3 dats…
4  22           6        250          105   3353        14.5    76      1 chev…
5  14           8        351          148   4657        13.5    75      1 ford…
6  14           8        351          153   4129        13      72      1 ford…
7  25           4         97.5         80   2126        17      72      1 dodg…
8  16           8        351          149   4335        14.5    77      1 ford…
9  38           4         91           67   1995        16.2    82      3 dats…
10  31.6         4        120           74   2635        18.3    81      3 mazd…
# … with 186 more rows, and abbreviated variable name ¹​acceleration

## Application Exercise

1. Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE.
set.seed(100)
Auto_split  <- ________
Auto_train  <- ________
Auto_test   <- ________
lm_fit      <- fit(lm_spec,
mpg ~ horsepower,
data = ________)
mpg_pred  <- ________ |>
predict(new_data = ________) |>
bind_cols(________)
rmse(________, truth = ________, estimate = ________)
06:00

## A faster way!

• You can use last_fit() and specify the split
• This will automatically train the data on the train data from the split
• Instead of specifying which metric to calculate (with rmse as before) you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

## A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)

lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr>   <chr>          <dbl> <chr>
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

## A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)

lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr>   <chr>          <dbl> <chr>
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

## A faster way!

set.seed(100)

Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)

lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr>   <chr>          <dbl> <chr>
1 rmse    standard       4.96  Preprocessor1_Model1
2 rsq     standard       0.613 Preprocessor1_Model1

Auto_cv <- vfold_cv(Auto, v = 5)
Auto_cv
#  5-fold cross-validation
# A tibble: 5 × 2
splits           id
<list>           <chr>
1 <split [313/79]> Fold1
2 <split [313/79]> Fold2
3 <split [314/78]> Fold3
4 <split [314/78]> Fold4
5 <split [314/78]> Fold5

• Instead of fit we will use fit_resamples
fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv) 

fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv) 

fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv) 

fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv) 

fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv)
# Resampling results
# 5-fold cross-validation
# A tibble: 5 × 4
splits           id    .metrics         .notes
<list>           <chr> <list>           <list>
1 <split [313/79]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [313/79]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [314/78]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
4 <split [314/78]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
5 <split [314/78]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>

How do we get the metrics out? With collect_metrics() again!

results <- fit_resamples(lm_spec,
mpg ~ horsepower,
resamples = Auto_cv)

results |>
collect_metrics()
# A tibble: 2 × 6
.metric .estimator  mean     n std_err .config
<chr>   <chr>      <dbl> <int>   <dbl> <chr>
1 rmse    standard   4.88      5  0.385  Preprocessor1_Model1
2 rsq     standard   0.616     5  0.0220 Preprocessor1_Model1

## Application Exercise

05:00
1. Edit the code below to get the 5-fold cross validation error rate for the following model:

$mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon$

Auto_cv <- vfold_cv(Auto, v = 5)

results <- fit_resamples(lm_spec,
----,
resamples = ---)

results |>
collect_metrics()
• What do you think rsq is?
Auto_cv <- vfold_cv(Auto, v = 5)

results <- fit_resamples(lm_spec,
mpg ~ horsepower + I(horsepower^2),
resamples = Auto_cv)

results |>
collect_metrics()

## Application Exercise

1. Fit 3 models on the data using 5 fold cross validation:

$mpg = \beta_0 + \beta_1 horsepower + \epsilon$

$mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon$

$mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \beta_3 horsepower^3 +\epsilon$

2. Collect the metrics from each model, saving the results as results_1, results_2, results_3

3. Which model is “best”?

08:00