202412091014
Status: #idea
Tags: Classification, Discriminative Models
State: #nascient

Logistic Regression

One of the big two of Linear Methods for Classification and also just a classic classification method. That is Logistic Regression and Linear Discriminant Analysis (LDA).

It arises from the desire to fit a model that can predict the posterior probability given some vector x P(G=k|X=x) similarly to what we'd do with linear regression but while constraining to what make probabilities probabilities, that is:

It takes the form:

p(X)=eβ0+β1X1+eβ0+β1X

Where as we can see we are using exp(linear model) on both the top and bottom, and then add a 1 to the denominator.

This has two main effects:

It is called logistic because if we take the log of the odds that is p(x)1p(x) we are capable through that transformation to recover the linear model, this mean:

ln(p(x)1p(x))=β0+β1X

Before the application of the transformation analogous to the activation function in typical machine learning, the output of the linear model β0+β1X is called a logit.

The logistic model is in the family of Linear Methods for Classification exactly because of the above, as we see the logit is linear in X.

Note that while we show that the logit of the logistic regression is a linear regression model you might mistakenly assume that the vector β is estimated in the same way. Nothing could be further from the truth, while in Linear Regression we generally estimate the parameters directly though functions, for the logistic model we must use Maximum Likelihood Estimation to find the answer.

This stems from the fact that while we recover the linear equation using

What about if we have multiple features?

Well in such a case, you can easily extend the model to more features by analogy.
After all if we know that the logit for a single variable is simply β0+β1X, then it follows that for more variable, we'd just chuck em in the linear equation similarly to what we would do in Multiple Linear Regression.

So

ln(p(X)1p(X))=β0+β1X1++βnXn

The rest is just a matter of solving for p(X) which gives:

p(X)=eβ0+β1X1++βnXn1eβ0+β1X1++βnXn

Obviously throughout the way we assumed n total predictors not counting the intercept.

Covered more in depth in Multinomial Logistic Regression.

In the case of more classes while in theory we can still deal with it by adding more equations for everything tht we are tryign to compute (one probability for each class or K1 depending of if we use the softmax or traditional approach), weird errors can ensue so as a result when we have more than two classes it is often a better idea to just another method like Linear Discriminant Analysis (LDA) which better copes with those problems than Logistic Regression—if we are adamant on using linear decision boundaries, or just any other method that better copes with more classes if not.