202411252000
Status: #idea
Tags: Machine Learning, Non-Parametric Model
State: #awakened

k-Nearest Neighbor

Typically a Classification algorithm in the Machine Learning family—which can be used for regression—that assigns a label to a new observation based on the label of its $k - N e a r e s t N e i g h b o r s$ .

Definition:

\hat{Y} (x) = \frac{1}{k} \sum_{x_{i} \in N_{k} (x)} y_{i}

In English, consider all the $x_{i}$ within the neighborhood of $x$ .
Look at their label $y_{i}$ , and take the average. Here the assumption is that we have a binary label so that it can be encoded as $0$ and $1$ , making the following feasible. Assuming you have more than two levels, you just assign $x$ to whichever label occurs more often in its neighborhood. You can convince yourself that this is equivalent to the above formula and can be seen as a generalization of the formula.

k-NN is a form of Non-Parametric Model, a type of model that does not make assumptions about the functional form of a model. Here in fact no model is learned since we do a majority vote on the observations on a given $x_{0}$ 's neighborhood (where $x_{0}$ is the objection we care about.)
k-NN for the reason it could be mistakenly understood to belong to a class of model called transductive models, it does not as because as even though no global model is built and inference is indeed done by relating test data to the training set (comparing the test observation to its neighborhood within the training set) the biggest difference is that $k - N N$ models do not use the testing cases that it is trying to fit in its optimization process (of which there is none).

Since the averaging function used is generalizable to any unseen observation, it is fundamentally different from transductive models for which the insights learned are only applicable to the specific test samples provided.

The main assumption that a $k - N N$ model makes is that observations which are close in space are similar. It is an example of a low bias, high variance model!

In a Regression context, we'd call it, k-Nearest Neighbor Averaging. Using the exact same idea, but instead of assigning to the majority label of the $k - N e a r e s t N e i g h b o r s$ , we'd assign to the average response level of these neighbors.

k-Nearest Neighbor is likely the most intuitive machine learning algorithm of them all.

The only complexities are that:

as opposed to most algorithms $k$ being smaller means more complex. (Intuitively, we have $n / k$ effective parameters in the dataset based on where $x$ falls in the dataset assuming non-overlapping clusters.)
The "closest" and "neighborhood" are entirely dependent of the measure of distance. Typically it is the Euclidean distance but there's nothing in the specification of the algorithm that forces this move.

Besides that, this is a deceptively simple but effective method that tries to approximate the Bayesian Averaging Estimator ( $E (Y | X)$ ), which is the optimal estimator you could hope for, by averaging a "neighborhood" of "local" points rather than the values of the points exactly at $X$ which in practice happens is not often feasible.

k-Nearest Neighbor

Relevant Links