202412110806
Status: #idea
Tags: Logistic Regression, Linear Methods for Classification
State: #nascient

Multinomial Logistic Regression

What if you wanted to use Logistic Regression on a classification problem with more than 2 labels?

You use multinomial logistic regression.
This a generalization of what we already did in the single variable case where p(x)=eXTβ1+eXTβ and the complement or baseline ends up being 1p(x)=11+eXTβ.

Similarly for K levels, we pick any one of them (it doesn't matter in the end) to be a baseline, this means that we do not compute regression parameters for it.

As a result, for any k{1,,K}{Kchosen}
we have

P(Y=k|X=x)=eβk0+βk1x1++βkpxp1+kKchosenβi0+βi1x1++βipxp

and for the baseline

P(Y=Kchosen|X=x)=11+kKchosenβi0+βi1x1++βipxp

We'd then assign the label with the highest probability.

We can then show in a fashion identical to the binary case that the log-odds of any k with the baseline Kchosen is linear in the X, so we still have linear decision model.

For that reason, it doesn't strictly matter which level we pick as the baseline, so if we use software it will generally pick it as either the first level or the last level automatically, or if we want to set a baseline ourselves, we can generally do so using the provided method in whatever framework we use.

While no matter which level we pick the decisions made by the model will be identical, being aware of the baseline is extremely important as the log-odds and the meaning of a change in one direction or another relates back to that baseline.

For that reason, at times it might make sense to pick a different baseline than the automatically picked one if we want to change the comparison reference.

Softmax Coding

The way we did it above is an extension of the classical logistic regression model, but those familiar with machine learning and deep learning are likely familiar with another notation:

P(Y=k|X=x)=eβk0+βk1x1++βkpxpk=1Kβi0+βi1x1++βipxp

Here we do not pick a baseline and simply fit parameters for the whole slew, we then need to estimate the K set of coefficients rather than K1 sets. It is extensively done in Machine Learning and Deep Learning, because this method is efficient computationally as it aligns better with the gradient-based optimization frameworks that are often used.

In this configuration we can take the log-odds between any two ratios k,k as:

ln(P(Y=k|X=x)P(Y=k|X=x))=(βk0βk0)+(βk1βk1)x1++(βknβkn)xn

Pros & Cons

Pros

Cons

For these reasons, the replacement if we want to keep the decision boundary linear while still being to retrieve the probability if we need it is often the Linear Discriminant Analysis (LDA).