What are we minimizing with Ridge Regression?

- \(RSS + \lambda\sum_{j=1}^p\beta_j^2\)

What is the resulting estimate for \(\hat\beta_{ridge}\)?

- \(\hat\beta_{ridge} = (\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\)

Why is this useful?

How is \(\lambda\) determined?

\[RSS + \lambda\sum_{j=1}^p\beta_j^2\]

What is the bias-variance trade-off?

- Can be used when \(p > n\)
- Can be used to help with multicollinearity
- Will decrease variance (as \(\lambda \rightarrow \infty\) )

- Will have increased bias (compared to least squares)
- Does not really help with variable selection (all variables are included in
*some*regard, even if their \(\beta\) coefficients are really small)

- The lasso is similar to ridge, but it actually drives some \(\beta\) coefficients to 0! (So it helps with variable selection)
- \(RSS + \lambda\sum_{j=1}^p|\beta_j|\)
- We say lasso uses an \(\ell_1\) penalty, ridge uses an \(\ell_2\) penalty
- \(||\beta||_1=\sum|\beta_j|\)
- \(||\beta||_2=\sum\beta_j^2\)

- Like Ridge regression, lasso shrinks the coefficients towards 0
- In lasso, the \(\ell_1\) penalty
**forces**some of the coefficient estimates to be**exactly zero**when the tuning parameter \(\lambda\) is sufficiently large - Therefore, lasso can be used for
**variable selection** - The lasso can help create
**smaller, simplier**models - Choosing \(\lambda\) again is done via cross-validation

- Can be used when \(p > n\)
- Can be used to help with multicollinearity
- Will decrease variance (as \(\lambda \rightarrow \infty\) )
- Can be used for variable selection, since it will make some \(\beta\) coefficients exactly 0

- Will have increased bias (compared to least squares)
- If \(p>n\) the lasso can select
**at most**\(n\) variables

- Neither Ridge nor lasso will universally dominate
- Cross-validation can also be used to determine which method (Ridge or lasso) should be used
- Cross-validation is
**also**used to select \(\lambda\) in either method. You choose the \(\lambda\) value for which the cross-validation model is the smallest

Elastic net!

\(RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\)

What is the \(\ell_1\) part of the penalty?

What is the \(\ell_2\) part of the penalty

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Ridge Regression?

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Lasso?

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

- The \(\ell_1\) part of the penalty will generate a
**sparse**model (shrink some \(\beta\) coefficients to exactly 0) - The \(\ell_2\) part of the penalty removes the limitation on the number of variables selected (can be \(>n\) now)

How do you think \(\lambda_1\) and \(\lambda_2\) are chosen?

Dr. Lucy Dâ€™Agostino McGowan *adapted from slides by Hastie & Tibshirani*