Lasso and Elastic Net

Ridge Review

What are we minimizing with Ridge Regression?

What is the resulting estimate for \(\hat\beta_{ridge}\)?

\(\hat\beta_{ridge} = (\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\)

Why is this useful?

How is \(\lambda\) determined?

\[RSS + \lambda\sum_{j=1}^p\beta_j^2\]

What is the bias-variance trade-off?

Will have increased bias (compared to least squares)
Does not really help with variable selection (all variables are included in some regard, even if their \(\beta\) coefficients are really small)

The lasso is similar to ridge, but it actually drives some \(\beta\) coefficients to 0! (So it helps with variable selection)
\(RSS + \lambda\sum_{j=1}^p|\beta_j|\)
We say lasso uses an \(\ell_1\) penalty, ridge uses an \(\ell_2\) penalty
\(||\beta||_1=\sum|\beta_j|\)
\(||\beta||_2=\sum\beta_j^2\)

Like Ridge regression, lasso shrinks the coefficients towards 0
In lasso, the \(\ell_1\) penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter \(\lambda\) is sufficiently large
Therefore, lasso can be used for variable selection
The lasso can help create smaller, simplier models
Choosing \(\lambda\) again is done via cross-validation

Can be used when \(p > n\)
Can be used to help with multicollinearity
Will decrease variance (as \(\lambda \rightarrow \infty\) )
Can be used for variable selection, since it will make some \(\beta\) coefficients exactly 0

Neither Ridge nor lasso will universally dominate
Cross-validation can also be used to determine which method (Ridge or lasso) should be used
Cross-validation is also used to select \(\lambda\) in either method. You choose the \(\lambda\) value for which the cross-validation model is the smallest

What is the \(\ell_1\) part of the penalty?

What is the \(\ell_2\) part of the penalty

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Ridge Regression?

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Lasso?

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

The \(\ell_1\) part of the penalty will generate a sparse model (shrink some \(\beta\) coefficients to exactly 0)
The \(\ell_2\) part of the penalty removes the limitation on the number of variables selected (can be \(>n\) now)

How do you think \(\lambda_1\) and \(\lambda_2\) are chosen?