202411252000
Status: #idea
Tags: Machine Learning, Non-Parametric Model
State: #awakened

k-Nearest Neighbor

Typically a Classification algorithm in the Machine Learning family—which can be used for regression—that assigns a label to a new observation based on the label of its kNearest  Neighbors.

Definition:

Y^(x)=1kxiNk(x)yi

In English, consider all the xi within the neighborhood of x.
Look at their label yi, and take the average. Here the assumption is that we have a binary label so that it can be encoded as 0 and 1, making the following feasible. Assuming you have more than two levels, you just assign x to whichever label occurs more often in its neighborhood. You can convince yourself that this is equivalent to the above formula and can be seen as a generalization of the formula.

k-NN is a form of Non-Parametric Model, a type of model that does not make assumptions about the functional form of a model. Here in fact no model is learned since we do a majority vote on the observations on a given x0's neighborhood (where x0 is the objection we care about.)
k-NN for the reason it could be mistakenly understood to belong to a class of model called transductive models, it does not as because as even though no global model is built and inference is indeed done by relating test data to the training set (comparing the test observation to its neighborhood within the training set) the biggest difference is that kNN models do not use the testing cases that it is trying to fit in its optimization process (of which there is none).

Since the averaging function used is generalizable to any unseen observation, it is fundamentally different from transductive models for which the insights learned are only applicable to the specific test samples provided.

The main assumption that a kNN model makes is that observations which are close in space are similar. It is an example of a low bias, high variance model!

In a Regression context, we'd call it, k-Nearest Neighbor Averaging. Using the exact same idea, but instead of assigning to the majority label of the kNearest  Neighbors, we'd assign to the average response level of these neighbors.

k-Nearest Neighbor is likely the most intuitive machine learning algorithm of them all.

The only complexities are that:

Besides that, this is a deceptively simple but effective method that tries to approximate the Bayesian Averaging Estimator (E(Y|X)), which is the optimal estimator you could hope for, by averaging a "neighborhood" of "local" points rather than the values of the points exactly at X which in practice happens is not often feasible.