\[Y = \beta_0 + \beta_1 X + \varepsilon\]
We estimate this with
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x\]
\[Y_i = \beta_0 + \beta_1X_i + \varepsilon_i\]
\[\varepsilon_i\sim N(0, \sigma^2)\]
\[Y_i = \beta_0 + \beta_1X_i + \varepsilon_i\]
\[\varepsilon_i\sim N(0, \sigma^2)\]
\[ \begin{align} Y_1 &= \beta_0 + \beta_1X_1 + \varepsilon_1\\ Y_2 &= \beta_0 + \beta_1X_2 + \varepsilon_2\\ \vdots \hspace{0.25cm} & \hspace{0.25cm} \vdots \hspace{0.5cm} \vdots\\ Y_n &=\beta_0 + \beta_1X_n + \varepsilon_n \end{align} \]
\[ \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} \beta_0 + \beta_1X_1\\ \beta_0+\beta_1X_2\\ \vdots\\ \beta_0 + \beta_1X_n\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
\[Y_i = \beta_0 + \beta_1X_i + \varepsilon_i\]
\[\varepsilon_i\sim N(0, \sigma^2)\]
\[ \begin{align} Y_1 &= \beta_0 + \beta_1X_1 + \varepsilon_1\\ Y_2 &= \beta_0 + \beta_1X_2 + \varepsilon_2\\ \vdots \hspace{0.25cm} & \hspace{0.25cm} \vdots \hspace{0.5cm} \vdots\\ Y_n &=\beta_0 + \beta_1X_n + \varepsilon_n \end{align} \]
\[ \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} \beta_0 + \beta_1X_1\\ \beta_0+\beta_1X_2\\ \vdots\\ \beta_0 + \beta_1X_n\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
\[Y_i = \beta_0 + \beta_1X_i + \varepsilon_i\]
\[\varepsilon_i\sim N(0, \sigma^2)\]
\[ \begin{align} Y_1 &= \beta_0 + \beta_1X_1 + \varepsilon_1\\ Y_2 &= \beta_0 + \beta_1X_2 + \varepsilon_2\\ \vdots \hspace{0.25cm} & \hspace{0.25cm} \vdots \hspace{0.5cm} \vdots\\ Y_n &=\beta_0 + \beta_1X_n + \varepsilon_n \end{align} \]
\[ \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
\[ \Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
\[ \Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \underbrace{\begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix}}_{\mathbf{X}: \textrm{ Design Matrix}} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
What are the dimensions of \(\mathbf{X}\)?
\[ \Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \underbrace{\begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix}}_{\mathbf{X}: \textrm{ Design Matrix}} \underbrace{\begin{bmatrix}\beta_0\\\beta_1\end{bmatrix}}_{\beta: \textrm{ Vector of parameters}} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
What are the dimensions of \(\beta\)?
\[ \Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \underbrace{\begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix}}_{\varepsilon:\textrm{ vector of error terms}} \end{align} \]
What are the dimensions of \(\varepsilon\)?
\[ \Large \begin{align} \underbrace{\begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix}}_{\textbf{Y}: \textrm{ Vector of responses}} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
What are the dimensions of \(\mathbf{Y}\)?
\[ \Large \begin{align} \begin{bmatrix} Y_1 \\Y_2\\ \vdots\\ Y_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} X_1\\ 1\hspace{0.25cm} X_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}X_n\end{bmatrix} \begin{bmatrix}\beta_0\\\beta_1\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_n\end{bmatrix} \end{align} \]
\[\Large \mathbf{Y}=\mathbf{X}\beta+\varepsilon\]
\[ \Large \begin{align} \begin{bmatrix} \hat{y}_1 \\\hat{y}_2\\ \vdots\\ \hat{y}_n \end{bmatrix} & = \begin{bmatrix} 1 \hspace{0.25cm} x_1\\ 1\hspace{0.25cm} x_2\\ \vdots\hspace{0.25cm} \vdots\\ 1\hspace{0.25cm}x_n\end{bmatrix} \begin{bmatrix}\hat{\beta}_0\\\ \hat{\beta}_1\end{bmatrix} \end{align} \]
\[\hat{y}_i=\hat{\beta}_0 + \hat{\beta}_1x_i\]
How are \(\hat{\beta}_0\) and \(\hat{\beta}_1\) chosen? What are we minimizing?
How could we re-write this with \(y_i\) and \(x_i\)?
Let’s put this back in matrix form:
\[ \Large \begin{align} \sum \hat\varepsilon_i^2=\begin{bmatrix}\hat\varepsilon_1 &\hat\varepsilon_2 &\dots&\hat\varepsilon_n\end{bmatrix} \begin{bmatrix}\hat\varepsilon_1 \\ \hat\varepsilon_2 \\ \vdots \\ \hat\varepsilon_n\end{bmatrix} = \hat\varepsilon^T\hat\varepsilon \end{align} \]
What can we replace \(\hat\varepsilon_i\) with? (Hint: look back a few slides)
\[ \Large \begin{align} \sum \hat\varepsilon_i^2 = (\mathbf{Y}-\mathbf{X}\hat\beta)^T(\mathbf{Y}-\mathbf{X}\hat\beta) \end{align} \]
OKAY! So this is the thing we are trying to minimize with respect to \(\beta\):
\[\Large (\mathbf{Y}-\mathbf{X}\beta)^T(\mathbf{Y}-\mathbf{X}\beta)\]
In calculus, how do we minimize things?
Matrix fact
\[ \begin{align} \mathbf{C} &= \mathbf{AB}\\ \mathbf{C}^T &=\mathbf{B}^T\mathbf{A}^T \end{align} \]
Try it!
\[RSS = (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta)\]
02:00
Matrix fact
\[ \begin{align} \mathbf{C} &= \mathbf{AB}\\ \mathbf{C}^T &=\mathbf{B}^T\mathbf{A}^T \end{align} \]
Try it!
\[ \begin{align} RSS &= (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta) \\ & = \mathbf{y}^T\mathbf{y}-\hat{\beta}^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\hat\beta + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta \end{align} \]
Matrix fact
Why? What are the dimensions of \(\hat\beta^T\)? What are the dimensions of \(\mathbf{X}\)? What are the dimensions of \(\mathbf{y}\)?
Matrix fact
\[ \begin{align} RSS &= (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta) \\ & = \mathbf{y}^T\mathbf{y}-\hat{\beta}^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\hat\beta + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ &=\mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ \end{align} \]
To find the \(\hat\beta\) that is going to minimize this RSS, what do we do? Why?
\[ \begin{align} RSS &= (\mathbf{y} - \mathbf{X}\hat\beta)^T(\mathbf{y}-\mathbf{X}\hat\beta) \\ & = \mathbf{y}^T\mathbf{y}-\hat{\beta}^T\mathbf{X}^T\mathbf{y}-\mathbf{y}^T\mathbf{X}\hat\beta + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ &=\mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\\ \end{align} \]
Matrix fact
\[\frac{\partial\mathbf{a}^T\mathbf{b}}{\partial\mathbf{b}}=\frac{\partial\mathbf{b}^T\mathbf{a}}{\partial\mathbf{b}}=\mathbf{a}\]
\[\frac{\partial\mathbf{b}^T\mathbf{Ab}}{\partial\mathbf{b}}=2\mathbf{Ab}=2\mathbf{b}^T\mathbf{A}\]
Try it!
\[\frac{\partial RSS}{\partial\hat\beta} = \]
02:00
How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?
\[RSS = \mathbf{y}^T\mathbf{y}-2\hat{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\beta}^T\mathbf{X}^T\mathbf{X}\hat\beta\]
\[\frac{\partial RSS}{\partial\hat\beta}=-2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta = 0\]
Matrix fact
\[\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}\]
What is \(\mathbf{I}\)?
\[\mathbf{I}=\begin{bmatrix} 1 & 0&\dots & 0 \\ 0&1 & \dots &0 \\ \vdots&\vdots&\ddots&\vdots\\ 0 & 0 & \dots & 1 \end{bmatrix}\]
\[\mathbf{AI} = \mathbf{A}\]
Try it!
\[-2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta = 0\]
02:00
How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?
\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ \end{align} \]
How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?
\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \end{align} \]
How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?
\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align} \]
How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?
\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align} \]
How did we get \(\mathbf{(X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)?
\[ \begin{align} -2\mathbf{X}^T\mathbf{y}+2\mathbf{X}^T\mathbf{X}\hat\beta &= 0\\ 2\mathbf{X}^T\mathbf{X}\hat\beta & = 2\mathbf{X}^T\mathbf{y} \\ \mathbf{X}^T\mathbf{X}\hat\beta & =\mathbf{X}^T\mathbf{y} \\ (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}}_{\mathbf{I}}\hat\beta &=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \mathbf{I}\hat\beta &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\\ \hat\beta & = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} \end{align} \]
\[ \begin{align} \begin{bmatrix}\hat{\beta}_0\\\hat{\beta}_1\end{bmatrix}= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{align} \]
\[ \begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\hat{\beta}\\ \hat{\mathbf{Y}}&=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{align} \]
\[ \begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\hat{\beta}\\ \hat{\mathbf{Y}}&=\mathbf{X}\underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}}_{\hat\beta} \end{align} \]
\[ \begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\hat{\beta}\\ \hat{\mathbf{Y}}&=\underbrace{\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T}_{\textrm{hat matrix}}\mathbf{Y} \end{align} \]
Why do you think this is called the “hat matrix”
We can generalize this beyond just one predictor
\[ \begin{align} \begin{bmatrix}\hat{\beta}_0\\\hat{\beta}_1\\\vdots\\\hat{\beta}_p\end{bmatrix}= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \end{align} \]
What are the dimensions of the design matrix, \(\mathbf{X}\) now?
What are the dimensions of the design matrix, \(\mathbf{X}\) now?
\[ \begin{align} \mathbf{X} = \begin{bmatrix} 1 & X_{11} & X_{12} & \dots & X_{1p} \\ 1 & X_{21} & X_{22} & \dots & X_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & X_{n1} & X_{n2} & \dots & X_{np}\end{bmatrix} \end{align} \]
The coefficient for \(x\) is \(\hat\beta\) (95% CI: \(LB_\hat\beta, UB_\hat\beta\)). A one-unit increase in \(x\) yields an expected increase in y of \(\hat\beta\), holding all other variables constant.
\[\textrm{Var}(\hat{\beta}) =\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\]
Dr. Lucy D’Agostino McGowan adapted from slides by Hastie & Tibshirani