202412090918
Status: #idea
Tags: Classification
State: #nascient
Linear Methods for Classification
There are two standard ways and a third that kinda work but is theoretically hard to justify basically all the time. There are more ways that end up having a linear boundary when developped, but ISLR talks about these two. :shrug:
We focus here on Logistic Regression and Linear Discriminant Analysis (LDA).
The idea of Linear Methods for classification is simply that our data exists on some topological space and we make the assumption that lines exist that can separate the labels of our data neatly.
This is a simple but extremely powerful idea because if you remember anything from linear regression, by linear we mean linear in the coefficients.
Hence, we are perfectly in our rights to allow
Scratch that last line...
Still, we can get complexity and curves out of linear models by augmenting them with themselves. In practice, we call linear any method for which a transformation exists that make the equation linear which is why even though:
where
which is indeed linear, and so Logistic Regression is in that class of Classifiers (well technically it's a regressor, like it's in the name, but it regresses a probability that we then use for classification, so kinda potay-to, potah-toe?)
Why not Linear Regression?
In practice, sometimes you could. If you have exactly two classes, it is quite possible that your linear regression model will give you similar performance to a more standard linear classification model.
So why not?
- The assumptions are often not justifiable (we require normality of
remember) - The predictions of the model can and often are negative/above one which is nonsensical if they represent posterior probabilities. Even if within the training scope this doesn't happen in theory we could always get something where the regressed value is beyond the range of probabilities.
- For anything more than 2 classes, there will likely be a masking problem where since we can only fit one line, some class will be ignored (masked) even in cases where data is clearly linearly separable. (we could probably just fit two lines then? but the other problems still exist)

from Elements of Statistical Learning page 124
So yeah, that's why.
So what do we do if we want a linear decision boundary? There's three main candidates, but ISLR mentions two at this point, let's list them all: