Ridge Regression

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

• $(\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta)$
• What is the solution ( $\hat\beta$ ) to this?
• $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Linear Regression Review

What is $\mathbf{X}$?

• the design matrix!

Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \hat\beta & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align}

Linear Regression Review

Let’s try to find an $\mathbf{X}$ for which it would be impossible to calculate $\hat\beta$

y x
4 1
3 2
1 5
3 1
5 5

Creating a vector in R

y x
4 1
3 2
1 5
3 1
5 5
y <- c(4, 3, 1, 3, 5)

Creating a Design matrix in R

y x
4 1
3 2
1 5
3 1
5 5
(X <- matrix(c(rep(1, 5),
c(1, 2, 5, 1, 5)),
ncol = 2))
     [,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    5
[4,]    1    1
[5,]    1    5

Taking a transpose in R

t(X)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1    1
[2,]    1    2    5    1    5

Taking an inverse in R

XTX <- t(X) %*% X
solve(XTX)
           [,1]        [,2]
[1,]  0.6666667 -0.16666667
[2,] -0.1666667  0.05952381

Put it all together

solve(t(X) %*% X) %*% t(X) %*% y
           [,1]
[1,]  3.5000000
[2,] -0.1071429

Application Exercise

In R, find a design matrix X where it is not possible to calculate $\hat\beta$

solve(t(X) %*% X) %*% t(X) %*% y
05:00

Estimating $\hat\beta$

$\hat\beta = \mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Under what circumstances is this equation not estimable?

• when we can’t invert $(\mathbf{X^TX})^{-1}$
• $p > n$
• multicollinearity
• A guaranteed way to check whether a square matrix is not invertible is to check whether the determinant is equal to zero

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}$

What is $n$ here? What is $p$?

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix}1 & 2 & 3 & 1 \\ 1 & 3 & 4& 0 \end{bmatrix}$

Is $\mathbf{(X^TX)^{-1}}$ going to be invertible?

X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2)
det(t(X) %*% X)
[1] 0

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

Is $\mathbf{(X^TX)^{-1}}$ going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
[1] 0
cor(X[, 2], X[, 3])
[1] 1

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

What was the problem this time?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
[1] 0
cor(X[, 2], X[, 3])
[1] 1

Estimating $\hat\beta$

What is a sure-fire way to tell whether $\mathbf{(X^TX)^{-1}}$ will be invertible?

• Take the determinant!
• $|\mathbf{A}|$ means the determinant of matrix $\mathbf{A}$
• For a 2x2 matrix:
• $\mathbf{A} = \begin{bmatrix}a&b\\c&d\end{bmatrix}$
• $|\mathbf{A}| = ad - bc$

Estimating $\hat\beta$

What is a sure-fire way to tell whether $\mathbf{(X^TX)^{-1}}$ will be invertible?

• Take the determinant!
• $|\mathbf{A}|$ means the determinant of matrix $\mathbf{A}$
• For a 3x3 matrix:
• $\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}$
• $|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)$

Determinants

It looks funky, but it follows a nice pattern!

$\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}$ $|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)$

• multiply $a$ by the determinant of the portion of the matrix that are not in $a$’s row or column (A)
• do the same for $b$ (B) and $c$ (C)
• put it together as plus (A) minus (B) plus (C)

Determinants

It looks funky, but it follows a nice pattern!

$\mathbf{A} = \begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}$ $|\mathbf{A}| = a(ei-fh)-b(di-fg) +c(dh-eg)$

$|\mathbf{A}| = a \left|\begin{matrix}e&f\\h&i\end{matrix}\right|-b\left|\begin{matrix}d&f\\g&i\end{matrix}\right|+c\left|\begin{matrix}d&e\\g&h\end{matrix}\right|$

Application Exercise

Calculate the determinant of the following matrices in R using the det() function:

$\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 4 & 5 \end{bmatrix}$

$\mathbf{B} = \begin{bmatrix} 1 & 2 & 3 \\ 3 & 6 & 9 \\ 2 & 5 & 7\end{bmatrix}$ Are these both invertible?

01:00

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

Is $\mathbf{(X^TX)^{-1}}$ going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
[1] 0.0056
cor(X[, 2], X[, 3])
[1] 0.999993

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

Is $\mathbf{(X^TX)^{-1}}$ going to be invertible?

y <- c(1, 2, 3, 2)
solve(t(X) %*% X) %*% t(X) %*% y
            [,1]
[1,]    1.285714
[2,] -114.285714
[3,]   57.285714

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

Is $\mathbf{(X^TX)^{-1}}$ going to be invertible?

$\begin{bmatrix}\hat\beta_0\\\hat\beta_1\\\hat\beta_2\end{bmatrix} = \begin{bmatrix}1.28\\-114.29\\57.29\end{bmatrix}$

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

What is the equation for the variance of $\hat\beta$?

$var(\hat\beta) = \hat\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$

• $\hat\sigma^2 = \frac{RSS}{n-(p+1)}$

Variance of $\hat\beta$

$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$

What is the variance for $\hat\beta_0$?

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

$var(\hat\beta) = \begin{bmatrix} \color{blue}{\mathbf{0.91835}}&-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$

What is the variance for $\hat\beta_0$?

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \mathbf{4081.571} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$

What is the variance for $\hat\beta_1$?

Estimating $\hat\beta$

$\mathbf{X} = \begin{bmatrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\1 & 5& 10\\ 1 & 2 & 4\end{bmatrix}$

$var(\hat\beta) = \begin{bmatrix} \mathbf{0.91835} &-24.489 & 12.132\\-24.48943 & \color{blue}{\mathbf{4081.571}} & -2038.745 \\12.13247 & -2038.745 &\mathbf{1018.367}\end{bmatrix}$

What is the variance for $\hat\beta_1$? 😱

What’s the problem?

• Sometimes we can’t solve for $\hat\beta$

Why?

What’s the problem?

• Sometimes we can’t solve for $\hat\beta$
• $\mathbf{X}^T\mathbf{X}$ is not invertible
• We have more variables than observations ( $p > n$ )
• The variables are linear combinations of one another
• Even when we can invert $\mathbf{X}^T\mathbf{X}$, things can go wrong
• The variance can blow up, like we just saw!

Ridge Regression

• What if we add an additional penalty to keep the $\hat\beta$ coefficients small (this will keep the variance from blowing up!)
• Instead of minimizing $RSS$, like we do with linear regression, let’s minimize $RSS$ PLUS some penalty function
• $RSS + \underbrace{\lambda\sum_{j=1}^p\beta^2_j}_{\textrm{shrinkage penalty}}$

What happens when $\lambda=0$? What happens as $\lambda\rightarrow\infty$?

Ridge Regression

Let’s solve for the $\hat\beta$ coefficients using Ridge Regression. What are we minimizing?

$(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta$

Try it!

Find $\hat\beta$ that minimizes this:

$(\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)+\lambda\beta^T\beta$

02:00

Ridge Regression

$\hat\beta_{ridge} = (\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

• Not only does this help with the variance, it solves our problem when $\mathbf{X}^{T}\mathbf{X}$ isn’t invertible!

Choosing $\lambda$

• $\lambda$ is known as a tuning parameter and is selected using cross validation
• For example, choose the $\lambda$ that results in the smallest estimated test error

How do you think ridge regression fits into the bias-variance tradeoff?

• As $\lambda$ ☝️, bias ☝️, variance 👇
• Bias( $\hat\beta_{ridge}$ ) = $-\lambda(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\beta$

What would this be if $\lambda$ was 0?

• Var( $\hat\beta_{ridge}$ ) = $\sigma^2(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I})^{-1}$

Is this bigger or smaller than $\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$? What is this when $\lambda = 0$? As $\lambda\rightarrow\infty$ does this go up or down?