Logistic regression

Recap

Last week we had a linear regression refresher
We covered how to write a linear model in matrix form
We learned how to minimize RSS to calculate $\hat{\beta}$ with $(\mathbf{X^TX)^{-1}X^Ty}$
Linear regression is a great tool when we have a continuous outcome
We are going to learn some fancy ways to do even better in the future

Classification

What are some examples of classification problems?

Qualitative response variable in an unordered set, $\mathcal{C}$
eye color $\in$ {blue, brown, green}
email $\in$ {spam, not spam}
Response, $Y$ takes on values in $\mathcal{C}$
Predictors are a vector, $X$
The task: build a function $C(X)$ that takes $X$ and predicts $Y$, $C(X)\in\mathcal{C}$
Many times we are actually more interested in the probabilities that $X$ belongs to each category in $\mathcal{C}$

Example: Credit card default

Code

set.seed(1)
Default |>
  sample_frac(size = 0.25) |>
  ggplot(aes(balance, income, color = default)) +
  geom_point(pch = 4) +
  scale_color_manual(values = c("cornflower blue", "red")) +
  theme_classic() +
  theme(legend.position = "top") -> p1

p2 <- ggplot(Default, aes(x = default, y = balance, fill = default)) +
  geom_boxplot() +
  scale_fill_manual(values = c("cornflower blue", "red")) +
  theme_classic() +
  theme(legend.position = "none")

p3 <- ggplot(Default, aes(x = default, y = income, fill = default)) +
  geom_boxplot() +
  scale_fill_manual(values = c("cornflower blue", "red")) +
  theme_classic() +
  theme(legend.position = "none")
grid.arrange(p1, p2, p3, ncol = 3, widths = c(2, 1, 1))

Can we use linear regression?

We can code Default as

\[Y = \begin{cases} 0 & \textrm{if }\texttt{No}\\ 1&\textrm{if }\texttt{Yes}\end{cases}\] Can we fit a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y}> 0.5$?

In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, you can read more about that in your book!)
$E[Y|X=x] = P(Y=1|X=x)$, so it seems like this is a pretty good idea!
The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱

Can we use linear regression?

We can code Default as

\[Y = \begin{cases} 0 & \textrm{if }\texttt{No}\\ 1&\textrm{if }\texttt{Yes}\end{cases}\] Can we fit a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y}> 0.5$?

What may do a better job?

Logistic regression!

Linear versus logistic regression

Code

Default <- Default |>
  mutate(
  p = glm(default ~ balance, data = Default, family = "binomial") |>
  predict(type = "response"),
  p2 = lm(I(default == "Yes") ~ balance, data = Default) |> predict(),
  def = ifelse(default == "Yes", 1, 0)
)


Default |>
  sample_frac(0.25) |>
ggplot(aes(balance, p2)) +
  geom_hline(yintercept = c(0, 1), lty = 2, size = 0.2) +
  geom_line(color = "cornflower blue") +
  geom_point(aes(balance, def), shape = "|", color = "orange") +
  theme_classic() +
  labs(y = "probability of default") -> p1

Default |>
  sample_frac(0.25) |>
ggplot(aes(balance, p)) +
  geom_hline(yintercept = c(0, 1), lty = 2, size = 0.2) +
  geom_line(color = "cornflower blue") +
  geom_point(aes(balance, def), shape = "|", color = "orange") +
  theme_classic() +
  labs(y = "probability of default") -> p2

grid.arrange(p1, p2, ncol = 2)

Which does a better job at predicting the probability of default?

The orange marks represent the response $Y\in\{0,1\}$

Linear Regression

What if we have $>2$ possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

\[ \begin{align} Y = \begin{cases} 1& \textrm{if }\texttt{stroke}\\2&\textrm{if }\texttt{drug overdose}\\3&\textrm{if }\texttt{epileptic seizure}\end{cases} \end{align} \]

What could go wrong here?

The coding implies an ordering
The coding implies equal spacing (that is the difference between stroke and drug overdose is the same as drug overdose and epileptic seizure)

Linear Regression

What if we have $>2$ possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

\[ \begin{align} Y = \begin{cases} 1& \textrm{if }\texttt{stroke}\\2&\textrm{if }\texttt{drug overdose}\\3&\textrm{if }\texttt{epileptic seizure}\end{cases} \end{align} \]

Linear regression is not appropriate here
Mutliclass logistic regression or discriminant analysis are more appropriate

Logistic Regression

\[ p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \]

Note: $p(X)$ is shorthand for $P(Y=1|X)$
No matter what values $\beta_0$, $\beta_1$, or $X$ take $p(X)$ will always be between 0 and 1

Logistic Regression

\[ p(X) = \frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}} \]

We can rearrange this into the following form:

\[ \log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X \]

What is this transformation called?

This is a log odds or logit transformation of $p(X)$

Linear versus logistic regression

Logistic regression ensures that our estimates for $p(X)$ are between 0 and 1 🎉

Maximum Likelihood

Refresher: How did we estimate $\hat\beta$ in linear regression?

Maximum Likelihood

Refresher: How did we estimate $\hat\beta$ in linear regression?

In logistic regression, we use maximum likelihood to estimate the parameters

\[\mathcal{l}(\beta_0,\beta_1)=\prod_{i:y_i=1}p(x_i)\prod_{i:y_i=0}(1-p(x_i))\]

This likelihood give the probability of the observed ones and zeros in the data
We pick $\beta_0$ and $\beta_1$ to maximize the likelihood
We’ll let R do the heavy lifting here

Let’s see it in R

logistic_reg() |>
  set_engine("glm") |>
  fit(default ~ balance, 
      data = Default) |>
  tidy()

# A tibble: 2 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -10.7      0.361        -29.5 3.62e-191
2 balance       0.00550  0.000220      25.0 1.98e-137

Use the logistic_reg() function in R with the glm engine

Making predictions

What is our estimated probability of default for someone with a balance of $1000?

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

\[ \hat{p}(X) = \frac{e^{\hat{\beta}_0+\hat{\beta}_1X}}{1+e^{\hat{\beta}_0+\hat{\beta}_1X}}=\frac{e^{-10.65+0.0055\times 1000}}{1+e^{-10.65+0.0055\times 1000}}=0.006 \]

Making predictions

What is our estimated probability of default for someone with a balance of $2000?

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

\[ \hat{p}(X) = \frac{e^{\hat{\beta}_0+\hat{\beta}_1X}}{1+e^{\hat{\beta}_0+\hat{\beta}_1X}}=\frac{e^{-10.65+0.0055\times 2000}}{1+e^{-10.65+0.0055\times 2000}}=0.586 \]

Logistic regression example

Let’s refit the model to predict the probability of default given the customer is a student

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

\[P(\texttt{default = Yes}|\texttt{student = Yes}) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}}=0.0431\]

How will this change if student = No?

\[P(\texttt{default = Yes}|\texttt{student = No}) = \frac{e^{-3.5041+0.4049\times0}}{1+e^{-3.5041+0.4049\times0}}=0.0292\]

Multiple logistic regression

\[\log\left(\frac{p(X)}{1-p(X)}\right)=\beta_0+\beta_1X_1+\dots+\beta_pX_p\] \[p(X) = \frac{e^{\beta_0+\beta_1X_1+\dots+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+\dots+\beta_pX_p}}\]

term	estimate	std.error	statistic	p.value
(Intercept)	-10.8690452	0.4922555	-22.080088	0.0000000
balance	0.0057365	0.0002319	24.737563	0.0000000
income	0.0000030	0.0000082	0.369815	0.7115203
studentYes	-0.6467758	0.2362525	-2.737646	0.0061881

Why is the coefficient for student negative now when it was positive before?

Confounding

What is going on here?

Confounding

Students tend to have higher balances than non-students
Their marginal default rate is higher
For each level of balance, students default less
Their conditional default rate is lower

Logistic regression for more than two classes

\[P(Y=k|X) = \frac{e ^{\beta_{0k}+\beta_{1k}X_1+\dots+\beta_{pk}X_p}}{\sum_{l=1}^Ke^{\beta_{0l}+\beta_{1l}X_1+\dots+\beta_{pl}X_p}}\]

So far we’ve discussed binary outcome data
We can generalize this to situations with multiple classes
Here we have a linear function for each of the $K$ classes
This is known as multinomial logistic regression

A bit about “odds”

The “odds” tell you how likely an event is
🌂 Let’s say there is a 60% chance of rain today * What is the probability that it will rain?
$p = 0.6$
What is the probability that it won’t rain?
$1-p = 0.4$
What are the odds that it will rain?
3 to 2, 3:2, $\frac{0.6}{0.4} = 1.5$

Transforming logs

How do you “undo” a $\log$ base $e$?
Use $e$! For example:
$e^{\log(10)} = 10$
$e^{\log(1283)} = 1283$
$e^{\log(x)} = x$

Transforming logs

How would you get the odds from the log(odds)?

How do you “undo” a $\log$ base $e$?
Use $e$! For example:
$e^{\log(10)} = 10$
$e^{\log(1283)} = 1283$
$e^{\log(x)} = x$

$e^{\log(odds)}$ = odds

Transforming odds

odds = $\frac{\pi}{1-\pi}$
Solving for $\pi$
$\pi = \frac{\textrm{odds}}{1+\textrm{odds}}$
Plugging in $e^{\log(odds)}$ = odds
$\pi = \frac{e^{\log(odds)}}{1+e^{\log(odds)}}$
Plugging in $\log(odds) = \beta_0 + \beta_1x$
$\pi = \frac{e^{\beta_0 + \beta_1x}}{1+e^{\beta_0 + \beta_1x}}$

The logistic model

✌️ forms

Form	Model
Logit form	$\log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + \beta_1x$
Probability form	$\Large\pi = \frac{e^{\beta_0 + \beta_1x}}{1+e^{\beta_0 + \beta_1x}}$

The logistic model

probability	odds	log(odds)
$\pi$	$\frac{\pi}{1-\pi}$	$\log\left(\frac{\pi}{1-\pi}\right)=l$

⬅️

log(odds)	odds	probability
$l$	$e^l$	$\frac{e^l}{1+e^l} = \pi$