# Cross validation

Dr. D’Agostino McGowan

## Cross validation

### 💡 Big idea

• We have determined that it is sensible to use a test set to calculate metrics like prediction error

## Cross validation

### 💡 Big idea

• We have determined that it is sensible to use a test set to calculate metrics like prediction error

Why?

## Cross validation

### 💡 Big idea

• We have determined that it is sensible to use a test set to calculate metrics like prediction error

How could we do this?

## Cross validation

### 💡 Big idea

• We have determined that it is sensible to use a test set to calculate metrics like prediction error
• What if we don’t have a separate data set to test our model on?
• 🎉 We can use resampling methods to estimate the test-set prediction error

## Training error versus test error

What is the difference? Which is typically larger?

• The training error is calculated by using the same observations used to fit the statistical learning model
• The test error is calculated by using a statistical learning method to predict the response of new observations
• The training error rate typically underestimates the true prediction error rate

## Estimating prediction error

• Best case scenario: We have a large data set to test our model on
• This is not always the case!

💡 Let’s instead find a way to estimate the test error by holding out a subset of the training observations from the model fitting process, and then applying the statistical learning method to those held out observations

## Approach #1: Validation set

• Randomly divide the available set up samples into two parts: a training set and a validation set
• Fit the model on the training set, calculate the prediction error on the validation set

If we have a quantitative predictor what metric would we use to calculate this test error?

• Often we use Mean Squared Error (MSE)

## Approach #1: Validation set

• Randomly divide the available set up samples into two parts: a training set and a validation set
• Fit the model on the training set, calculate the prediction error on the validation set

If we have a qualitative predictor what metric would we use to calculate this test error?

• Often we use misclassification rate

## Approach #1: Validation set

$\Large\color{orange}{MSE_{\texttt{test-split}} = \textrm{Ave}_{i\in\texttt{test-split}}[y_i-\hat{f}(x_i)]^2}$

$\Large\color{orange}{Err_{\texttt{test-split}} = \textrm{Ave}_{i\in\texttt{test-split}}I[y_i\neq \mathcal{\hat{C}}(x_i)]}$

## Approach #1: Validation set

Auto example:

• We have 392 observations.
• Trying to predict mpg from horsepower.
• We can split the data in half and use 196 to fit the model and 196 to test

## Approach #1: Validation set

$\color{orange}{MSE_{\texttt{test-split}}}$

$\color{orange}{MSE_{\texttt{test-split}}}$

$\color{orange}{MSE_{\texttt{test-split}}}$

$\color{orange}{MSE_{\texttt{test-split}}}$

## Approach #1: Validation set

Auto example:

• We have 392 observations.
• Trying to predict mpg from horsepower.
• We can split the data in half and use 196 to fit the model and 196 to test - what if we did this many times?

## Approach #1: Validation set (Drawbacks)

• the validation estimate of the test error can be highly variable, depending on which observations are included in the training set and which observations are included in the validation set
• In the validation approach, only a subset of the observations (those that are included in the training set rather than in the validation set) are used to fit the model
• Therefore, the validation set error may tend to overestimate the test error for the model fit on the entire data set

## Approach #2: K-fold cross validation

💡 The idea is to do the following:

• Randomly divide the data into $K$ equal-sized parts
• Leave out part $k$, fit the model to the other $K - 1$ parts (combined)
• Obtain predictions for the left-out $k$th part
• Do this for each part $k = 1, 2,\dots K$, and then combine the result

## K-fold cross validation

$\color{orange}{MSE_{\texttt{test-split-1}}}$

$\color{orange}{MSE_{\texttt{test-split-2}}}$

$\color{orange}{MSE_{\texttt{test-split-3}}}$

$\color{orange}{MSE_{\texttt{test-split-4}}}$

Take the mean of the $k$ MSE values

## Application Exercise

If we use 10 folds:

1. What percentage of the training data is used in each analysis for each fold?
2. What percentage of the training data is used in the assessment for each fold?
02:00

## Estimating prediction error (quantitative outcome)

• Split the data into K parts, where $C_1, C_2, \dots, C_k$ indicate the indices of observations in part $k$
• $CV_{(K)} = \sum_{k=1}^K\frac{n_k}{n}MSE_k$
• $MSE_k = \sum_{i \in C_k} (y_i - \hat{y}_i)^2/n_k$
• $n_k$ is the number of observations in group $k$
• $\hat{y}_i$ is the fit for observation $i$ obtained from the data with the part $k$ removed
• If we set $K = n$, we’d have $n-fold$ cross validation which is the same as leave-one-out cross validation (LOOCV)

## Estimating prediction error (quantitative outcome)

• Split the data into K parts, where $C_1, C_2, \dots, C_k$ indicate the indices of observations in part $k$
• $CV_{(K)} = \sum_{k=1}^K\frac{n_k}{n}MSE_k$
• $MSE_k = \sum_{i \in C_k} (y_i - \hat{y}_i)^2/n_k$
• $n_k$ is the number of observations in group $k$
• $\hat{y}_i$ is the fit for observation $i$ obtained from the data with the part $k$ removed
• If we set $K = n$, we’d have $n-fold$ cross validation which is the same as leave-one-out cross validation (LOOCV)

## Leave-one-out cross validation

$\dots$

## Special Case!

• With linear regression, you can actually calculate the LOOCV error without having to iterate!
• $CV_{(n)} = \frac{1}{n}\sum_{i=1}^n\left(\frac{y_i-\hat{y}_i}{1-h_i}\right)^2$
• $\hat{y}_i$ is the $i$th fitted value from the linear model
• $h_i$ is the diagonal of the “hat” matrix (remember that! 🎓)

## Picking $K$

• $K$ can vary from 2 (splitting the data in half each time) to $n$ (LOOCV)
• LOOCV is sometimes useful but usually the estimates from each fold are very correlated, so their average can have a high variance
• A better choice tends to be $K=5$ or $K=10$

• Since each training set is only $(K - 1)/K$ as big as the original training set, the estimates of prediction error will typically be biased upward
• This bias is minimized when $K = n$ (LOOCV), but this estimate has a high variance
• $K =5$ or $K=10$ provides a nice compromise for the bias-variance trade-off

## Approach #2: K-fold Cross Validation

Auto example:

• We have 392 observations.
• Trying to predict mpg from horsepower

## Estimating prediction error (qualitative outcome)

• The premise is the same as cross valiation for quantitative outcomes
• Split the data into K parts, where $C_1, C_2, \dots, C_k$ indicate the indices of observations in part $k$
• $CV_K = \sum_{k=1}^K\frac{n_k}{n}Err_k$
• $Err_k = \sum_{i\in C_k}I(y_i\neq\hat{y}_i)/n_k$ (misclassification rate)
• $n_k$ is the number of observations in group $k$
• $\hat{y}_i$ is the fit for observation $i$ obtained from the data with the part $k$ removed

## Estimating prediction error (qualitative outcome)

• The premise is the same as cross valiation for quantitative outcomes
• Split the data into K parts, where $C_1, C_2, \dots, C_k$ indicate the indices of observations in part $k$
• $CV_K = \sum_{k=1}^K\frac{n_k}{n}Err_k$
• $Err_k = \sum_{i\in C_k}I(y_i\neq\hat{y}_i)/n_k$ (misclassification rate)
• $n_k$ is the number of observations in group $k$
• $\hat{y}_i$ is the fit for observation $i$ obtained from the data with the part $k$ removed 