Ridge Regression

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

RSS!
\((\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta)\)
What is the solution ( \(\hat\beta\) ) to this?
\(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

Linear Regression Review

What is \(\mathbf{X}\)?

the design matrix!

Linear Regression Review

How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?

\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \hat\beta & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align} \]

Linear Regression Review

Let’s try to find an \(\mathbf{X}\) for which it would be impossible to calculate \(\hat\beta\)

Calculating in R

y	x
4	1
3	2
1	5
3	1
5	5

Creating a vector in R

y	x
4	1
3	2
1	5
3	1
5	5

y <- c(4, 3, 1, 3, 5)

Creating a Design matrix in R

y	x
4	1
3	2
1	5
3	1
5	5

(X <- matrix(c(rep(1, 5), 
               c(1, 2, 5, 1, 5)),
             ncol = 2))

     [,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    5
[4,]    1    1
[5,]    1    5

Taking a transpose in R

t(X)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    2    5    1    5

Taking an inverse in R

XTX <- t(X) %*% X
solve(XTX)

           [,1]        [,2]
[1,]  0.6666667 -0.16666667
[2,] -0.1666667  0.05952381

Put it all together

solve(t(X) %*% X) %*% t(X) %*% y

           [,1]
[1,]  3.5000000
[2,] -0.1071429

`Application Exercise`

In R, find a design matrix X where it is not possible to calculate \(\hat\beta\)

solve(t(X) %*% X) %*% t(X) %*% y

05:00

Estimating \(\hat\beta\)

\(\hat\beta = \mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

Under what circumstances is this equation not estimable?

when we can’t invert \((\mathbf{X^TX})^{-1}\)
\(p > n\)
multicollinearity
A guaranteed way to check whether a square matrix is not invertible is to check whether the determinant is equal to zero

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}\]

What is \(n\) here? What is \(p\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2)
det(t(X) %*% X)

[1] 0

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)

[1] 0

cor(X[, 2], X[, 3])

[1] 1

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

What was the problem this time?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)

[1] 0

cor(X[, 2], X[, 3])

[1] 1

Estimating \(\hat\beta\)

What is a sure-fire way to tell whether \(\mathbf{(X^TX)^{-1}}\) will be invertible?

Take the determinant!
\(|\mathbf{A}|\) means the determinant of matrix \(\mathbf{A}\)
For a 2x2 matrix:
\(\mathbf{A} = \begin{bmatrix}a&b\\c&d\end{bmatrix}\)
\(|\mathbf{A}| = ad - bc\)

Estimating \(\hat\beta\)

What is a sure-fire way to tell whether \(\mathbf{(X^TX)^{-1}}\) will be invertible?

Take the determinant!
\(|\mathbf{A}|\) means the determinant of matrix \(\mathbf{A}\)

For a 3x3 matrix:
\(\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}\)
\(|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)\)

Determinants

It looks funky, but it follows a nice pattern!

\[\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}\] \[|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)\]

multiply \(a\) by the determinant of the portion of the matrix that are not in \(a\)’s row or column (A)
do the same for \(b\) (B) and \(c\) (C)
put it together as plus (A) minus (B) plus (C)

Determinants

It looks funky, but it follows a nice pattern!

\[\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}\] \[|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)\]

\[|\mathbf{A}| = a \left|\begin{matrix}e&f\\h&i\end{matrix}\right|-b\left|\begin{matrix}d&f\\g&i\end{matrix}\right|+c\left|\begin{matrix}d&e\\g&h\end{matrix}\right|\]

`Application Exercise`

Calculate the determinant of the following matrices in R using the det() function:

\[\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 4 & 5 \end{bmatrix}\]

\[\mathbf{B} = \begin{bmatrix} 1 & 2 & 3 \\ 3 & 6 & 9 \\ 2 & 5 & 7\end{bmatrix}\] Are these both invertible?

01:00

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)

[1] 0.0056

cor(X[, 2], X[, 3])

[1] 0.999993

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

y <- c(1, 2, 3, 2)
solve(t(X) %*% X) %*% t(X) %*% y

            [,1]
[1,]    1.285714
[2,] -114.285714
[3,]   57.285714

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

Is \(\mathbf{(X^TX)^{-1}}\) going to be invertible?

\[\begin{bmatrix}\hat\beta_0\\\hat\beta_1\\\hat\beta_2\end{bmatrix} = \begin{bmatrix}1.28\\-114.29\\57.29\end{bmatrix}\]

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

What is the equation for the variance of \(\hat\beta\)?

\[var(\hat\beta) = \hat\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\]

\(\hat\sigma^2 = \frac{RSS}{n-(p+1)}\)

Variance of \(\hat\beta\)

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_0\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \color{blue}{\mathbf{0.91835}}&-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_0\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_1\)?

Estimating \(\hat\beta\)

\[\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}\]

\[var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \color{blue}{\mathbf{4081.571}} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}\]

What is the variance for \(\hat\beta_1\)? 😱

What’s the problem?

Sometimes we can’t solve for \(\hat\beta\)

Why?

What’s the problem?

Sometimes we can’t solve for \(\hat\beta\)
\(\mathbf{X}^T\mathbf{X}\) is not invertible
We have more variables than observations ( \(p > n\) )
The variables are linear combinations of one another
Even when we can invert \(\mathbf{X}^T\mathbf{X}\), things can go wrong
The variance can blow up, like we just saw!

What can we do about this?

Ridge Regression

What if we add an additional penalty to keep the \(\hat\beta\) coefficients small (this will keep the variance from blowing up!)
Instead of minimizing \(RSS\), like we do with linear regression, let’s minimize \(RSS\) PLUS some penalty function
\(RSS + \underbrace{\lambda\sum_{j=1}^p\beta^2_j}_{\textrm{shrinkage penalty}}\)

What happens when \(\lambda=0\)? What happens as \(\lambda\rightarrow\infty\)?

Ridge Regression

Let’s solve for the \(\hat\beta\) coefficients using Ridge Regression. What are we minimizing?

\[(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta\]

`Try it!`

Find \(\hat\beta\) that minimizes this:

\[(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta\]

02:00

Ridge Regression

\[\hat\beta_{ridge} = (\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

Not only does this help with the variance, it solves our problem when \(\mathbf{X}^{T}\mathbf{X}\) isn’t invertible!

Choosing \(\lambda\)

\(\lambda\) is known as a tuning parameter and is selected using cross validation
For example, choose the \(\lambda\) that results in the smallest estimated test error

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

As \(\lambda\) ☝️, bias ☝️, variance 👇
Bias( \(\hat\beta_{ridge}\) ) = \(-\lambda(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\beta\)

Bias-variance tradeoff

What would this be if \(\lambda\) was 0?

Var( \(\hat\beta_{ridge}\) ) = \(\sigma^2(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\)

Bias-variance tradeoff

Is this bigger or smaller than \(\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)? What is this when \(\lambda = 0\)? As \(\lambda\rightarrow\infty\) does this go up or down?

Ridge Regression

IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)

Why?