Ridge Review
What are we minimizing with Ridge Regression?
- \(RSS + \lambda\sum_{j=1}^p\beta_j^2\)
Ridge Regression
What is the resulting estimate for \(\hat\beta_{ridge}\)?
- \(\hat\beta_{ridge} = (\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\)
Ridge Review
How is \(\lambda\) determined?
\[RSS + \lambda\sum_{j=1}^p\beta_j^2\]
What is the bias-variance trade-off?
Ridge Regression
Pros
- Can be used when \(p > n\)
- Can be used to help with multicollinearity
- Will decrease variance (as \(\lambda \rightarrow \infty\) )
Cons
- Will have increased bias (compared to least squares)
- Does not really help with variable selection (all variables are included in some regard, even if their \(\beta\) coefficients are really small)
Lasso!
- The lasso is similar to ridge, but it actually drives some \(\beta\) coefficients to 0! (So it helps with variable selection)
- \(RSS + \lambda\sum_{j=1}^p|\beta_j|\)
- We say lasso uses an \(\ell_1\) penalty, ridge uses an \(\ell_2\) penalty
- \(||\beta||_1=\sum|\beta_j|\)
- \(||\beta||_2=\sum\beta_j^2\)
Lasso
- Like Ridge regression, lasso shrinks the coefficients towards 0
- In lasso, the \(\ell_1\) penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter \(\lambda\) is sufficiently large
- Therefore, lasso can be used for variable selection
- The lasso can help create smaller, simplier models
- Choosing \(\lambda\) again is done via cross-validation
Lasso
Pros
- Can be used when \(p > n\)
- Can be used to help with multicollinearity
- Will decrease variance (as \(\lambda \rightarrow \infty\) )
- Can be used for variable selection, since it will make some \(\beta\) coefficients exactly 0
Cons
- Will have increased bias (compared to least squares)
- If \(p>n\) the lasso can select at most \(n\) variables
Ridge versus lasso
- Neither Ridge nor lasso will universally dominate
- Cross-validation can also be used to determine which method (Ridge or lasso) should be used
- Cross-validation is also used to select \(\lambda\) in either method. You choose the \(\lambda\) value for which the cross-validation model is the smallest
What if we want to do both?
What is the \(\ell_1\) part of the penalty?
What is the \(\ell_2\) part of the penalty
Elastic net
\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]
When will this be equivalent to Ridge Regression?
Elastic net
\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]
When will this be equivalent to Lasso?
Elastic Net
\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]
- The \(\ell_1\) part of the penalty will generate a sparse model (shrink some \(\beta\) coefficients to exactly 0)
- The \(\ell_2\) part of the penalty removes the limitation on the number of variables selected (can be \(>n\) now)
How do you think \(\lambda_1\) and \(\lambda_2\) are chosen?