Lasso and Elastic Net

Ridge Review

What are we minimizing with Ridge Regression?

  • \(RSS + \lambda\sum_{j=1}^p\beta_j^2\)

Ridge Regression

What is the resulting estimate for \(\hat\beta_{ridge}\)?

  • \(\hat\beta_{ridge} = (\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\)

Ridge Regression

Why is this useful?

Ridge Review

How is \(\lambda\) determined?

\[RSS + \lambda\sum_{j=1}^p\beta_j^2\]

What is the bias-variance trade-off?

Ridge Regression

Pros

  • Can be used when \(p > n\)
  • Can be used to help with multicollinearity
  • Will decrease variance (as \(\lambda \rightarrow \infty\) )

Cons

  • Will have increased bias (compared to least squares)
  • Does not really help with variable selection (all variables are included in some regard, even if their \(\beta\) coefficients are really small)

Lasso!

  • The lasso is similar to ridge, but it actually drives some \(\beta\) coefficients to 0! (So it helps with variable selection)
  • \(RSS + \lambda\sum_{j=1}^p|\beta_j|\)
  • We say lasso uses an \(\ell_1\) penalty, ridge uses an \(\ell_2\) penalty
  • \(||\beta||_1=\sum|\beta_j|\)
  • \(||\beta||_2=\sum\beta_j^2\)

Lasso

  • Like Ridge regression, lasso shrinks the coefficients towards 0
  • In lasso, the \(\ell_1\) penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter \(\lambda\) is sufficiently large
  • Therefore, lasso can be used for variable selection
  • The lasso can help create smaller, simplier models
  • Choosing \(\lambda\) again is done via cross-validation

Lasso

Pros

  • Can be used when \(p > n\)
  • Can be used to help with multicollinearity
  • Will decrease variance (as \(\lambda \rightarrow \infty\) )
  • Can be used for variable selection, since it will make some \(\beta\) coefficients exactly 0

Cons

  • Will have increased bias (compared to least squares)
  • If \(p>n\) the lasso can select at most \(n\) variables

Ridge versus lasso

  • Neither Ridge nor lasso will universally dominate
  • Cross-validation can also be used to determine which method (Ridge or lasso) should be used
  • Cross-validation is also used to select \(\lambda\) in either method. You choose the \(\lambda\) value for which the cross-validation model is the smallest

What if we want to do both?

  • Elastic net!

  • \(RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\)

What is the \(\ell_1\) part of the penalty?

What is the \(\ell_2\) part of the penalty

Elastic net

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Ridge Regression?

Elastic net

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

When will this be equivalent to Lasso?

Elastic Net

\[RSS + \lambda_1\sum_{j=1}^p\beta^2_j+\lambda_2\sum_{j=1}^p|\beta_j|\]

  • The \(\ell_1\) part of the penalty will generate a sparse model (shrink some \(\beta\) coefficients to exactly 0)
  • The \(\ell_2\) part of the penalty removes the limitation on the number of variables selected (can be \(>n\) now)

How do you think \(\lambda_1\) and \(\lambda_2\) are chosen?