# Linear Regression

Dr. D’Agostino McGowan

## Linear Regression Questions

• Is there a relationship between a response variable and predictors?
• How strong is the relationship?
• What is the uncertainty?
• How accurately can we predict a future outcome?

## Simple linear regression

$Y = \beta_0 + \beta_1 X + \varepsilon$

• $\beta_0$: intercept
• $\beta_1$: slope
• $\beta_0$ and $\beta_1$ are coefficients, parameters
• $\varepsilon$: error

## Simple linear regression

We estimate this with

$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x$

• $\hat{y}$ is the prediction of $Y$ when $X = x$
• The hat denotes that this is an estimated value

## Simple linear regression

$Y_i = \beta_0 + \beta_1X_i + \varepsilon_i$

$\varepsilon_i\sim N(0, \sigma^2)$

## Simple linear regression

$Y_i = \beta_0 + \beta_1X_i + \varepsilon_i$

$\varepsilon_i\sim N(0, \sigma^2)$

\begin{align} Y_1 &= \beta_0 + \beta_1X_1 + \varepsilon_1\\ Y_2 &= \beta_0 + \beta_1X_2 + \varepsilon_2\\ \vdots \hspace{0.25cm} & \hspace{0.25cm} \vdots \hspace{0.5cm} \vdots\\ Y_n &=\beta_0 + \beta_1X_n + \varepsilon_n \end{align}

\begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} \beta_0 + \beta_1X_1\\ \beta_0+\beta_1X_2\\ \vdots\\ \beta_0 + \beta_1X_n\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

## Simple linear regression

$Y_i = \beta_0 + \beta_1X_i + \varepsilon_i$

$\varepsilon_i\sim N(0, \sigma^2)$

\begin{align} Y_1 &= \beta_0 + \beta_1X_1 + \varepsilon_1\\ Y_2 &= \beta_0 + \beta_1X_2 + \varepsilon_2\\ \vdots \hspace{0.25cm} & \hspace{0.25cm} \vdots \hspace{0.5cm} \vdots\\ Y_n &=\beta_0 + \beta_1X_n + \varepsilon_n \end{align}

\begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} \beta_0 + \beta_1X_1\\ \beta_0+\beta_1X_2\\ \vdots\\ \beta_0 + \beta_1X_n\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

## Simple linear regression

$Y_i = \beta_0 + \beta_1X_i + \varepsilon_i$

$\varepsilon_i\sim N(0, \sigma^2)$

\begin{align} Y_1 &= \beta_0 + \beta_1X_1 + \varepsilon_1\\ Y_2 &= \beta_0 + \beta_1X_2 + \varepsilon_2\\ \vdots \hspace{0.25cm} & \hspace{0.25cm} \vdots \hspace{0.5cm} \vdots\\ Y_n &=\beta_0 + \beta_1X_n + \varepsilon_n \end{align}

\begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

## Simple linear regression

\Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

## Simple linear regression

\Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \underbrace{\begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix}}_{\mathbf{X}: \textrm{ Design Matrix}} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

What are the dimensions of $\mathbf{X}$?

• $n\times2$

## Simple linear regression

\Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \underbrace{\begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix}}_{\mathbf{X}: \textrm{ Design Matrix}} \underbrace{\begin{bmatrix}\beta_0\\\beta_1\end{bmatrix}}_{\beta: \textrm{ Vector of parameters}} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

What are the dimensions of $\beta$?

## Simple linear regression

\Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \underbrace{\begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix}}_{\varepsilon:\textrm{ vector of error terms}} \end{align}

What are the dimensions of $\varepsilon$?

## Simple linear regression

\Large \begin{align} \underbrace{\begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix}}_{\textbf{Y}: \textrm{ Vector of responses}} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

What are the dimensions of $\mathbf{Y}$?

## Simple linear regression

\Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align}

$\Large \mathbf{Y}=\mathbf{X}\beta+\varepsilon$

## Simple linear regression

\Large \begin{align} \begin{bmatrix} \hat{y}_1 \\\hat{y}_2\\ \vdots\\ \hat{y}_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} x_1\\ 1\hspace{0.25cm} x_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}x_n\end{bmatrix} \begin{bmatrix}\hat{\beta}_0\\\ \hat{\beta}_1\end{bmatrix} \end{align}

$\hat{y}_i=\hat{\beta}_0 + \hat{\beta}_1x_i$

• $\hat\varepsilon_i = y_i - \hat{y}_i$
• $\hat\varepsilon_i = y_i - (\hat{\beta}_0+\hat{\beta}_1x_i)$
• $\hat\varepsilon_i$ is known as the residual for observation $i$

## Simple linear regression

How are $\hat{\beta}_0$ and $\hat{\beta}_1$ chosen? What are we minimizing?

• Minimize the residual sum of squares
• RSS = $\sum\hat\varepsilon_i^2 = \hat\varepsilon_1^2 + \hat\varepsilon_2^2 + \dots+\hat\varepsilon_n^2$

## Simple linear regression

How could we re-write this with $y_i$ and $x_i$?

• Minimize the residual sum of squares
• RSS = $\sum\hat\varepsilon_i^2 = \hat\varepsilon_1^2 + \hat\varepsilon_2^2 + \dots+\hat\varepsilon_n^2$
• RSS = $(y_1 - \hat{\beta_0} - \hat{\beta}_1x_1)^2 + (y_2 - \hat{\beta}_0-\hat{\beta}_1x_2)^2 + \dots + (y_n - \hat{\beta}_0-\hat{\beta}_1x_n)^2$

## Simple linear regression

Let’s put this back in matrix form:

\Large \begin{align} \sum \hat\varepsilon_i^2=\begin{bmatrix}\hat\varepsilon_1 &\hat\varepsilon_2 &\dots&\hat\varepsilon_n\end{bmatrix} \begin{bmatrix}\hat\varepsilon_1 \\ \hat\varepsilon_2 \\ \vdots \\ \hat\varepsilon_n\end{bmatrix} = \hat\varepsilon^T\hat\varepsilon \end{align}

## Simple linear regression

What can we replace $\hat\varepsilon_i$ with? (Hint: look back a few slides)

\Large \begin{align} \sum \hat\varepsilon_i^2 = (\mathbf{Y}-\mathbf{X}\hat\beta)^T(\mathbf{Y}-\mathbf{X}\hat\beta) \end{align}

## Simple linear regression

OKAY! So this is the thing we are trying to minimize with respect to $\beta$:

$\Large (\mathbf{Y}-\mathbf{X}\beta)^T(\mathbf{Y}-\mathbf{X}\beta)$

In calculus, how do we minimize things?

• Take the derivative with respect to $\beta$
• Set it equal to 0 (or a vector of 0s!)
• Solve for $\beta$

## Matrix fact

\begin{align} \mathbf{C} &= \mathbf{AB}\\ \mathbf{C}^T &=\mathbf{B}^T\mathbf{A}^T \end{align}

## Try it!

• Distribute (FOIL / get rid of the parentheses) the RSS equation

$RSS = (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta)$

02:00

## Matrix fact

\begin{align} \mathbf{C} &= \mathbf{AB}\\ \mathbf{C}^T &=\mathbf{B}^T\mathbf{A}^T \end{align}

## Try it!

• Distribute (FOIL / get rid of the parentheses) the RSS equation

\begin{align} RSS &= (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta) \\ & = \mathbf{y}^T\mathbf{y}-\hat{\beta}^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\hat\beta + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta \end{align}

## Matrix fact

• the transpose of a scalar is a scalar
• $\hat\beta^T\mathbf{X}^T\mathbf{y}$ is a scalar

Why? What are the dimensions of $\hat\beta^T$? What are the dimensions of $\mathbf{X}$? What are the dimensions of $\mathbf{y}$?

## Matrix fact

• $(\mathbf{y}^T\mathbf{X}\hat\beta)^T = \hat\beta^T\mathbf{X}^T\mathbf{y}$

\begin{align} RSS &= (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta) \\ & = \mathbf{y}^T\mathbf{y}-\hat{\beta}^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\hat\beta + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ &=\mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ \end{align}

## Linear Regression Review

To find the $\hat\beta$ that is going to minimize this RSS, what do we do? Why?

\begin{align} RSS &= (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta) \\ & = \mathbf{y}^T\mathbf{y}-\hat{\beta}^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\hat\beta + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ &=\mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ \end{align}

## Matrix fact

• When $\mathbf{a}$ and $\mathbf{b}$ are $p\times 1$ vectors

$\frac{\partial\mathbf{a}^T\mathbf{b}}{\partial\mathbf{b}}=\frac{\partial\mathbf{b}^T\mathbf{a}}{\partial\mathbf{b}}=\mathbf{a}$

• When $\mathbf{A}$ is a symmetric matrix

$\frac{\partial\mathbf{b}^T\mathbf{Ab}}{\partial\mathbf{b}}=2\mathbf{Ab}=2\mathbf{b}^T\mathbf{A}$

## Try it!

$\frac{\partial RSS}{\partial\hat\beta} =$

• $RSS = \mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta$
02:00

## Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

$RSS = \mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta$

$\frac{\partial RSS}{\partial\hat\beta}=-2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta = 0$

## Matrix fact

$\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$

What is $\mathbf{I}$?

• identity matrix

$\mathbf{I}=\begin{bmatrix} 1 & 0&\dots & 0 \\ 0&1 & \dots &0 \\ \vdots&\vdots&\ddots&\vdots\\ 0 & 0 & \dots & 1 \end{bmatrix}$

$\mathbf{AI} = \mathbf{A}$

## Try it!

• Solve for $\hat\beta$

$-2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta = 0$

02:00

## Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ \end{align}

## Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \end{align}

## Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align}

## Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align}

## Linear Regression Review

How did we get $\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$?

\begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \hat\beta & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align}

## Simple linear regression

\begin{align} \begin{bmatrix}\hat{\beta}_0\\\hat{\beta}_1\end{bmatrix}= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{align}

## Simple linear regression

\begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\hat{\beta}\\ \hat{\mathbf{Y}}&=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{align}

## Simple linear regression

\begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\hat{\beta}\\ \hat{\mathbf{Y}}&=\mathbf{X}\underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}}_{\hat\beta} \end{align}

## Simple linear regression

\begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\hat{\beta}\\ \hat{\mathbf{Y}}&=\underbrace{\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T}_{\textrm{hat matrix}}\mathbf{Y} \end{align}

Why do you think this is called the “hat matrix”

## Multiple linear regression

We can generalize this beyond just one predictor

\begin{align} \begin{bmatrix}\hat{\beta}_0\\\hat{\beta}_1\\\vdots\\\hat{\beta}_p\end{bmatrix}= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{align}

What are the dimensions of the design matrix, $\mathbf{X}$ now?

• $\mathbf{X}_{n\times (p+1)}$

## Multiple linear regression

What are the dimensions of the design matrix, $\mathbf{X}$ now?

\begin{align} \mathbf{X} = \begin{bmatrix} 1 & X_{11} & X_{12} & \dots & X_{1p} \\ 1 & X_{21} & X_{22} & \dots & X_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & X_{n1} & X_{n2} & \dots & X_{np}\end{bmatrix} \end{align}

## $\hat\beta$ interpretation in multiple linear regression

The coefficient for $x$ is $\hat\beta$ (95% CI: $LB_\hat\beta, UB_\hat\beta$). A one-unit increase in $x$ yields an expected increase in y of $\hat\beta$, holding all other variables constant.

## Linear Regression Questions

• ✔️ Is there a relationship between a response variable and predictors?
• How strong is the relationship?
• What is the uncertainty?
• How accurately can we predict a future outcome?

## Linear Regression Questions

• ✔️ Is there a relationship between a response variable and predictors?
• How strong is the relationship?
• What is the uncertainty?
• How accurately can we predict a future outcome?

## Linear regression uncertainty

• The standard error of an estimator reflects how it varies under repeated sampling

$\textrm{Var}(\hat{\beta}) =\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$

• $\sigma^2 = \textrm{Var}(\varepsilon)$
• In the case of simple linear regression, $\textrm{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}$
• This uncertainty is used in the test statistic $t = \frac{\hat\beta_1}{SE_{\hat\beta_1}}$