202412111029
Status: #idea
Tags: Generative Models
State: #nascient

Naive Bayes

It is a cousin of the Discriminant Analysis models such as Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).

While they attempt to solve the same problem, that is generating the posterior probability through estimation of the prior and likelihood of the classes, the Naive Bayes makes an assumption that is arguably stronger than even its cousins.

While Linear Discriminant Analysis (LDA) makes the assumptions of normality of X within the classes and of equal covariance across classes leading us to having to estimate only the means within the respective classes with one pooled covariance matrix, and the Quadratic Discriminant Analysis (QDA) keeps the normality assumption while dropping the often untenable assumption of equal covariance, Naive Bayes flips the bird and just says: "You know what, all the predictors are conditionally independent."

By that we mean that while we would expect that k will have higher readings of a given class and lower readings of another class on average as compared to some other class k, when fixing our attention to a single class k, the covariance between predictors is 0.

In other words, P(x|Y=Ck)=P(x1|Y=Ck)×P(x2|Y=Ck)××P(xp|Y=Ck).

Plugging it into Bayes theorem as for the other models we get:

P(Y=Ck|X=x)=P(Ck)P(x1|Y=Ck)×P(x2|Y=Ck)××P(xp|Y=Ck)i=1KP(Ci)×P(x|Y=Ck)×P(x1|Y=Ck)×P(x2|Y=Ck)××P(xp|Y=Ck)

Though it's simplifying assumption, Naive Bayes increases the bias as we essentially don't have to fit covariance matrices, but decreases the variance. This means that we would expect it to come into its own in cases where p is big or when n is smaller, that is in contexts where the bias is important to prevent overfitting.

Naive Bayes success can be puzzling as after all, in most contexts, we'd expect the assumption to be entirely false.

Still it can be explained through the lens of the Bias-Variance Tradeoff, thanks to the assumptions it makes we require less parameters to fit at the cost of bias, but considering the fact that for a rigorous estimation of the likelihoods would require ungodly amounts of data which in most cases are not even available, this bias ends up smaller than the great variance that would arise if other methods were used simply because generally the lack of data prevents us from actually estimating the parameters.

As a result, even though the assumptions are really strong, it often gives pretty good results.

References

olio