Application Exercise
Application Exercise
lm()
to fit a linear regression using tidymodels. Save it as lm_spec
and look at the object. What does it return?Hint: you’ll need https://www.tidymodels.org
05:00
fit()
functionApplication Exercise
parsnip model object
Call:
stats::lm(formula = mpg ~ horsepower, data = data)
Coefficients:
(Intercept) horsepower
39.9359 -0.1578
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Coefficients:
(Intercept) horsepower
39.9359 -0.1578
predict()
functionnew_data
has an underscore# A tibble: 392 × 10
.pred mpg cylinders displacement horsep…¹ weight accel…² year origin name
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19.4 18 8 307 130 3504 12 70 1 chev…
2 13.9 15 8 350 165 3693 11.5 70 1 buic…
3 16.3 18 8 318 150 3436 11 70 1 plym…
4 16.3 16 8 304 150 3433 12 70 1 amc …
5 17.8 17 8 302 140 3449 10.5 70 1 ford…
6 8.68 15 8 429 198 4341 10 70 1 ford…
7 5.21 14 8 454 220 4354 9 70 1 chev…
8 6.00 14 8 440 215 4312 8.5 70 1 plym…
9 4.42 14 8 455 225 4425 10 70 1 pont…
10 9.95 15 8 390 190 3850 8.5 70 1 amc …
# … with 382 more rows, and abbreviated variable names ¹horsepower,
# ²acceleration
What does bind_cols
do?
# A tibble: 392 × 10
.pred mpg cylinders displacement horsep…¹ weight accel…² year origin name
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19.4 18 8 307 130 3504 12 70 1 chev…
2 13.9 15 8 350 165 3693 11.5 70 1 buic…
3 16.3 18 8 318 150 3436 11 70 1 plym…
4 16.3 16 8 304 150 3433 12 70 1 amc …
5 17.8 17 8 302 140 3449 10.5 70 1 ford…
6 8.68 15 8 429 198 4341 10 70 1 ford…
7 5.21 14 8 454 220 4354 9 70 1 chev…
8 6.00 14 8 440 215 4312 8.5 70 1 plym…
9 4.42 14 8 455 225 4425 10 70 1 pont…
10 9.95 15 8 390 190 3850 8.5 70 1 amc …
# … with 382 more rows, and abbreviated variable names ¹horsepower,
# ²acceleration
Which column has the predicted values?
Application Exercise
03:00
# A tibble: 392 × 10
.pred mpg cylinders displacement horsep…¹ weight accel…² year origin name
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19.4 18 8 307 130 3504 12 70 1 chev…
2 13.9 15 8 350 165 3693 11.5 70 1 buic…
3 16.3 18 8 318 150 3436 11 70 1 plym…
4 16.3 16 8 304 150 3433 12 70 1 amc …
5 17.8 17 8 302 140 3449 10.5 70 1 ford…
6 8.68 15 8 429 198 4341 10 70 1 ford…
7 5.21 14 8 454 220 4354 9 70 1 chev…
8 6.00 14 8 440 215 4312 8.5 70 1 plym…
9 4.42 14 8 455 225 4425 10 70 1 pont…
10 9.95 15 8 390 190 3850 8.5 70 1 amc …
# … with 382 more rows, and abbreviated variable names ¹horsepower,
# ²acceleration
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 4.89
What is this estimate? (training error? testing error?)
How many observations are in the training set?
How many observations are in the test set?
How many observations are there in total?
# A tibble: 196 × 9
mpg cylinders displacement horsepower weight accelerat…¹ year origin name
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 19 4 120 88 3270 21.9 76 2 peug…
2 20.5 6 200 95 3155 18.2 78 1 chev…
3 37.2 4 86 65 2019 16.4 80 3 dats…
4 22 6 250 105 3353 14.5 76 1 chev…
5 14 8 351 148 4657 13.5 75 1 ford…
6 14 8 351 153 4129 13 72 1 ford…
7 25 4 97.5 80 2126 17 72 1 dodg…
8 16 8 351 149 4335 14.5 77 1 ford…
9 38 4 91 67 1995 16.2 82 3 dats…
10 31.6 4 120 74 2635 18.3 81 3 mazd…
# … with 186 more rows, and abbreviated variable name ¹acceleration
Application Exercise
06:00
last_fit()
and specify the splittrain
data from the splitrmse
as before) you can just use collect_metrics()
and it will automatically calculate the metrics on the test
data from the splitset.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 4.96 Preprocessor1_Model1
2 rsq standard 0.613 Preprocessor1_Model1
set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 4.96 Preprocessor1_Model1
2 rsq standard 0.613 Preprocessor1_Model1
set.seed(100)
Auto_split <- initial_split(Auto, prop = 0.5)
lm_fit <- last_fit(lm_spec,
mpg ~ horsepower,
split = Auto_split)
lm_fit |>
collect_metrics()
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 4.96 Preprocessor1_Model1
2 rsq standard 0.613 Preprocessor1_Model1
fit
we will use fit_resamples
# Resampling results
# 5-fold cross-validation
# A tibble: 5 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [313/79]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]>
2 <split [313/79]> Fold2 <tibble [2 × 4]> <tibble [0 × 3]>
3 <split [314/78]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]>
4 <split [314/78]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]>
5 <split [314/78]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]>
How do we get the metrics out? With collect_metrics()
again!
Application Exercise
05:00
\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)
rsq
is?Application Exercise
Fit 3 models on the data using 5 fold cross validation:
\(mpg = \beta_0 + \beta_1 horsepower + \epsilon\)
\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \epsilon\)
\(mpg = \beta_0 + \beta_1 horsepower + \beta_2 horsepower^2+ \beta_3 horsepower^3 +\epsilon\)
Collect the metrics from each model, saving the results as results_1
, results_2
, results_3
Which model is “best”?
08:00
Dr. Lucy D’Agostino McGowan adapted from Alison Hill’s Introduction to ML with the Tidyverse