202412090953
Status: #idea
Tags: Classification, Generative Models
State: #nascient

Linear Discriminant Analysis (LDA)

This is simple but powerful method that is often used thanks to its interpretative power, strong theoretical underpinning and wide applicability.

It excels in the following situations:

Data is said to be linearly separable. (Logistic Regression performs suprisingly poorly here)
If sample size $n$ is small, and distribution of $X$ in each class is approximately normal (it is more stable than logistic regression again)
Intra-label populations follow a Normal Distribution.
When we have more than two levels, while Logistic Regression can be used, LDA is preferred as it offers low-dimensional views of the data.
Also it can be demonstrated that Bayes Classifier is the best we can do, so if the distribution of the $X$ is actually normal, Linear Discriminant Analysis (LDA) is one of our best choices as it directly tries to approximate it.

Contrarily to the logistic regression which tries to model $p_{Y} (x)$ , LDA and its brother Quadratic Discriminant Analysis (QDA) try to model instead:

P (x | Y = C_{k})

In other words, while logistic regression tries to model the probability of being in a class, linear discriminant analysis instead tries to model the probability of $x$ being generated by a given class.

The class is then assigned to whichever class had the highest probability of generating it. Note that if we want to convert that result to a posterior probability of the form $P (Y = C_{k} | X = x)$ we can, after all by Bayes Theorem we can do the following:

P (Y = C_{k} | X = x) = \frac{P (C_{k}) P (x | Y = C_{k})}{\sum_{i = 1}^{K} P (C_{i}) P (x | Y = C_{i})}

Where the yellow term is what we estimate with regression (that is the likelihood), the $P (C_{k})$ is the prior which is generally trivially estimated computing the ratio of the class $k$ to the rest in the training set, and the total probability of $P (X = x)$ using the Law of Total Probability.

While possible, we typically do not bother as since the denominator will be the same for all, we can focus on the numerator to which we will apply a few transformations. It will then be called the Discriminant Score which will be given by by the natural log of the numerator probability.
We take the natural log of the numerator as in $L D A$ we make the assumption of data being normally distributed, as a result taking the natural log allows us to make the numerator linear in $X$ (under the assumption of equal covariance made by the Linear Discriminant Analysis (LDA).)
If we do not make the assumption of equal covariance we will instead get Quadratic Discriminant Analysis (QDA).

No matter the case, this discriminant score then gives us the relative order still and will work just fine with our use of $a r g m a x$ .

Note that in practice we CAN use other distributions, but the Normal Distribution is most often used. In those cases, we have general Discriminant Analysis.

Regularization?

In 1989, Friedman proposed a model which allows a compromise between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).

We mean that while the assumption of different covariances is still made, the regularization term acts to shrink the covariances in such a way that they are closer, creating a continuous function connecting the two.

The regularized covariance matrices are given by:

{\hat{Σ}}_{k} (α) = α {\hat{Σ}}_{k} (α) + (1 - α) \hat{Σ}

Where $\hat{Σ}$ is a pooled covariance matrix similar to in the simple LDA case.

Linear Discriminant Analysis (LDA)

Regularization?

Relevant Links