202412091014
Status: #idea
Tags: Classification, Discriminative Models
State: #nascient

Logistic Regression

One of the big two of Linear Methods for Classification and also just a classic classification method. That is Logistic Regression and Linear Discriminant Analysis (LDA).

It arises from the desire to fit a model that can predict the posterior probability given some vector $x$ $P (G = k | X = x)$ similarly to what we'd do with linear regression but while constraining to what make probabilities probabilities, that is:

The probability of each class sums to $1$
Every individual probability is between $0$ and $1$ inclusive.

It takes the form:

p (X) = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}}

Where as we can see we are using $e x p (linear model)$ on both the top and bottom, and then add a 1 to the denominator.

This has two main effects:

No matter how small or even negative the $β_{0} + β_{1} X$ part is, the output will be strictly positive (the more negative the closer to $0$ )
As $β_{0} + β_{1} X$ gets very big, the $p (X)$ approaches one as we'd expect.

It is called logistic because if we take the $l o g$ of the odds that is $\frac{p (x)}{1 - p (x)}$ we are capable through that transformation to recover the linear model, this mean:

l n (\frac{p (x)}{1 - p (x)}) = β_{0} + β_{1} X

Before the application of the transformation analogous to the activation function in typical machine learning, the output of the linear model $β_{0} + β_{1} X$ is called a logit.

The logistic model is in the family of Linear Methods for Classification exactly because of the above, as we see the logit is linear in $X$ .

Note that while we show that the logit of the logistic regression is a linear regression model you might mistakenly assume that the vector $β$ is estimated in the same way. Nothing could be further from the truth, while in Linear Regression we generally estimate the parameters directly though functions, for the logistic model we must use Maximum Likelihood Estimation to find the answer.

This stems from the fact that while we recover the linear equation using

What about if we have multiple features?

Well in such a case, you can easily extend the model to more features by analogy.
After all if we know that the logit for a single variable is simply $β_{0} + β_{1} X$ , then it follows that for more variable, we'd just chuck em in the linear equation similarly to what we would do in Multiple Linear Regression.

l n (\frac{p (X)}{1 - p (X)}) = β_{0} + β_{1} X_{1} + \dots + β_{n} X_{n}

The rest is just a matter of solving for $p (X)$ which gives:

p (X) = \frac{e^{β_{0} + β_{1} X_{1} + \dots + β_{n} X_{n}}}{1 - e^{β_{0} + β_{1} X_{1} + \dots + β_{n} X_{n}}}

Obviously throughout the way we assumed $n$ total predictors not counting the intercept.

Covered more in depth in Multinomial Logistic Regression.

In the case of more classes while in theory we can still deal with it by adding more equations for everything tht we are tryign to compute (one probability for each class or $K - 1$ depending of if we use the softmax or traditional approach), weird errors can ensue so as a result when we have more than two classes it is often a better idea to just another method like Linear Discriminant Analysis (LDA) which better copes with those problems than Logistic Regression—if we are adamant on using linear decision boundaries, or just any other method that better copes with more classes if not.

Logistic Regression

What about if we have multiple features?

Relevant Links