Ridge Regression

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!
  • \((\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta)\)
  • What is the solution ( \(\hat\beta\) ) to this?
  • \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

Linear Regression Review

What is \(\mathbf{X}\)?

  • the design matrix!

Linear Regression Review

How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?

\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \hat\beta & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align} \]

Linear Regression Review

Let’s try to find an \(\mathbf{X}\) for which it would be impossible to calculate \(\hat\beta\)

Calculating in R

y x
4 1
3 2
1 5
3 1
5 5

Creating a vector in R

y x
4 1
3 2
1 5
3 1
5 5
y <- c(4, 3, 1, 3, 5)

Creating a Design matrix in R

y x
4 1
3 2
1 5
3 1
5 5
(X <- matrix(c(rep(1, 5), 
               c(1, 2, 5, 1, 5)),
             ncol = 2))
     [,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    5
[4,]    1    1
[5,]    1    5

Taking a transpose in R

t(X)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    2    5    1    5

Taking an inverse in R

XTX <- t(X) %*% X
solve(XTX)
           [,1]        [,2]
[1,]  0.6666667 -0.16666667
[2,] -0.1666667  0.05952381

Put it all together

solve(t(X) %*% X) %*% t(X) %*% y
           [,1]
[1,]  3.5000000
[2,] -0.1071429

Application Exercise

In R, find a design matrix X where it is not possible to calculate \(\hat\beta\)

solve(t(X) %*% X) %*% t(X) %*% y
05:00

Estimating \(\hat\beta\)

\(\hat\beta = \mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

Under what circumstances is this equation not estimable?

  • when we can’t invert \((\mathbf{X^TX})^{-1}\)
  • \(p > n\)
  • multicollinearity
  • A guaranteed way to check whether a square matrix is not invertible is to check whether the determinant is equal to zero

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}\]

What is \(n\) here? What is \(p\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2)
det(t(X) %*% X)
[1] 0

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
[1] 0
cor(X[, 2], X[, 3])
[1] 1

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

What was the problem this time?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
[1] 0
cor(X[, 2], X[, 3])
[1] 1

Estimating \(\hat\beta\)

What is a sure-fire way to tell whether \(\mathbf{(X^TX)^{-1}}\) will be invertible?

  • Take the determinant!
  • \(|\mathbf{A}|\) means the determinant of matrix \(\mathbf{A}\)
  • For a 2x2 matrix:
  • \(\mathbf{A} = \begin{bmatrix}a&b\\c&d\end{bmatrix}\)
  • \(|\mathbf{A}| = ad - bc\)

Estimating \(\hat\beta\)

What is a sure-fire way to tell whether \(\mathbf{(X^TX)^{-1}}\) will be invertible?

  • Take the determinant!
  • \(|\mathbf{A}|\) means the determinant of matrix \(\mathbf{A}\)
  • For a 3x3 matrix:
  • \(\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}\)
  • \(|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)\)

Determinants

It looks funky, but it follows a nice pattern!

\[\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}\] \[|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)\]

  • multiply \(a\) by the determinant of the portion of the matrix that are not in \(a\)’s row or column (A)
  • do the same for \(b\) (B) and \(c\) (C)
  • put it together as plus (A) minus (B) plus (C)

Determinants

It looks funky, but it follows a nice pattern!

\[\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}\] \[|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)\]

\[|\mathbf{A}| = a \left|\begin{matrix}e&f\\h&i\end{matrix}\right|-b\left|\begin{matrix}d&f\\g&i\end{matrix}\right|+c\left|\begin{matrix}d&e\\g&h\end{matrix}\right|\]

Application Exercise

Calculate the determinant of the following matrices in R using the det() function:

\[\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 4 & 5 \end{bmatrix}\]

\[\mathbf{B} = \begin{bmatrix} 1 & 2 & 3 \\ 3 & 6 & 9 \\ 2 & 5 & 7\end{bmatrix}\] Are these both invertible?

01:00

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
[1] 0.0056
cor(X[, 2], X[, 3])
[1] 0.999993

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

y <- c(1, 2, 3, 2)
solve(t(X) %*% X) %*% t(X) %*% y
            [,1]
[1,]    1.285714
[2,] -114.285714
[3,]   57.285714

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

\[\begin{bmatrix}\hat\beta_0\\\hat\beta_1\\\hat\beta_2\end{bmatrix} = \begin{bmatrix}1.28\\-114.29\\57.29\end{bmatrix}\]

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

What is the equation for the variance of \(\hat\beta\)?

\[var(\hat\beta) = \hat\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\]

  • \(\hat\sigma^2 = \frac{RSS}{n-(p+1)}\)

Variance of \(\hat\beta\)

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_0\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \color{blue}{\mathbf{0.91835}}&-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_0\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_1\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \color{blue}{\mathbf{4081.571}} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_1\)? 😱

What’s the problem?

  • Sometimes we can’t solve for \(\hat\beta\)

Why?

What’s the problem?

  • Sometimes we can’t solve for \(\hat\beta\)
  • \(\mathbf{X}^T\mathbf{X}\) is not invertible
  • We have more variables than observations ( \(p > n\) )
  • The variables are linear combinations of one another
  • Even when we can invert \(\mathbf{X}^T\mathbf{X}\), things can go wrong
  • The variance can blow up, like we just saw!

What can we do about this?

Ridge Regression

  • What if we add an additional penalty to keep the \(\hat\beta\) coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing \(RSS\), like we do with linear regression, let’s minimize \(RSS\) PLUS some penalty function
  • \(RSS + \underbrace{\lambda\sum_{j=1}^p\beta^2_j}_{\textrm{shrinkage penalty}}\)

What happens when \(\lambda=0\)? What happens as \(\lambda\rightarrow\infty\)?

Ridge Regression

Let’s solve for the \(\hat\beta\) coefficients using Ridge Regression. What are we minimizing?

\[(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta\]

Try it!

Find \(\hat\beta\) that minimizes this:

\[(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta\]

02:00

Ridge Regression

\[\hat\beta_{ridge} = (\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

  • Not only does this help with the variance, it solves our problem when \(\mathbf{X}^{T}\mathbf{X}\) isn’t invertible!

Choosing \(\lambda\)

  • \(\lambda\) is known as a tuning parameter and is selected using cross validation
  • For example, choose the \(\lambda\) that results in the smallest estimated test error

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As \(\lambda\) ☝️, bias ☝️, variance 👇
  • Bias( \(\hat\beta_{ridge}\) ) = \(-\lambda(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\beta\)

Bias-variance tradeoff

What would this be if \(\lambda\) was 0?

  • Var( \(\hat\beta_{ridge}\) ) = \(\sigma^2(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\)

Bias-variance tradeoff

Is this bigger or smaller than \(\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)? What is this when \(\lambda = 0\)? As \(\lambda\rightarrow\infty\) does this go up or down?

Ridge Regression

  • IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)

Why?