202412110806
Status: #idea
Tags: Logistic Regression, Linear Methods for Classification
State: #nascient
Multinomial Logistic Regression
What if you wanted to use Logistic Regression on a classification problem with more than 2 labels?
You use multinomial logistic regression.
This a generalization of what we already did in the single variable case where
Similarly for
As a result, for any
we have
and for the baseline
We'd then assign the label with the highest probability.
We can then show in a fashion identical to the binary case that the log-odds of any
For that reason, it doesn't strictly matter which level we pick as the baseline, so if we use software it will generally pick it as either the first level or the last level automatically, or if we want to set a baseline ourselves, we can generally do so using the provided method in whatever framework we use.
While no matter which level we pick the decisions made by the model will be identical, being aware of the baseline is extremely important as the log-odds and the meaning of a change in one direction or another relates back to that baseline.
For that reason, at times it might make sense to pick a different baseline than the automatically picked one if we want to change the comparison reference.
Softmax Coding
The way we did it above is an extension of the classical logistic regression model, but those familiar with machine learning and deep learning are likely familiar with another notation:
Here we do not pick a baseline and simply fit parameters for the whole slew, we then need to estimate the
In this configuration we can take the log-odds between any two ratios
Pros & Cons
Pros
- Natural generalization of Logistic Regression
- Calibrated probability regressor for each class (all sum to
)
Cons
- Can be unstable where the classes are well separable (surprisingly)
- Other issues that I forget at this point
For these reasons, the replacement if we want to keep the decision boundary linear while still being to retrieve the probability if we need it is often the Linear Discriminant Analysis (LDA).