Application Exercise
Let’s look at a sample of 116 sparrows from Kent Island. We are interested in the relationship between Weight
and Wing Length
How can we quantify how much we’d expect the slope to differ from one random sample to another?
How can we quantify how much we’d expect the slope to differ from one random sample to another?
How do we interpret this?
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.37 0.957 1.43 1.56e- 1
2 WingLength 0.467 0.0347 13.5 2.62e-25
How do we know what values of this statistic are worth paying attention to?
linear_reg() |>
set_engine("lm") |>
fit(Weight ~ WingLength, data = Sparrows) |>
tidy(conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26
2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536
Application Exercise
mtcars
data frame predicting miles per gallon (mpg
) from weight and horsepower (wt
and hp
).tidy()
function demonstrated. How do you interpret these?04:00
How are these statistics distributed under the null hypothesis?
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.37 0.957 1.43 1.56e- 1
2 WingLength 0.467 0.0347 13.5 2.62e-25
The distribution of test statistics we would expect given the null hypothesis is true, \(\beta_1 = 0\), is t-distribution with n-2 degrees of freedom.
How can we compare this line to the distribution under the null?
The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true
The proportion of area less than 1.5
The proportion of area greater than 1.5
The proportion of area greater than 1.5 or less than -1.5.
The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true
Application Exercise
mpg
from wt
and hp
) - calculate the p-value for the coefficient for weight04:00
If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.
\[\Huge \hat\beta_1 \pm t^∗ \times SE_{\hat\beta_1}\]
linear_reg() |>
set_engine("lm") |>
fit(Weight ~ WingLength, data = Sparrows) |>
tidy(conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26
2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536
If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.
Using the information here, how could I predict a new sparrow’s weight if I knew the wing length was 30?
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.37 0.957 1.43 1.56e- 1
2 WingLength 0.467 0.0347 13.5 2.62e-25
What is the residual sum of squares again?
\[RSS = \sum(y_i - \hat{y}_i)^2\]
\[TSS = \sum(y_i - \bar{y})^2\]
lm_fit <- linear_reg() |>
set_engine("lm") |>
fit(Weight ~ WingLength, data = Sparrows)
lm_fit |>
predict(new_data = Sparrows) |>
bind_cols(Sparrows) |>
rsq(truth = Weight, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rsq standard 0.614
Is this testing \(R^2\) or training \(R^2\)?
Application Exercise
mtcars
data frame predicting miles per gallon (mpg
) from weight and horsepower (wt
and hp
), using polynomials with 4 degrees of freedom for both.rsq
function.04:00
Application Exercise
mtcars
datafit_resamples
)collect_metrics
to estimate the test \(R^2\) - how does this compare to the training \(R^2\) calculated in the previous exercise?04:00
Refer to Chapter 3 for more details on these topics if you need a refresher.