202411252000
Status: #idea
Tags: Machine Learning, Non-Parametric Model
State: #awakened
k-Nearest Neighbor
Typically a Classification algorithm in the Machine Learning family—which can be used for regression—that assigns a label to a new observation based on the label of its
Definition:
In English, consider all the
Look at their label
k-NN is a form of Non-Parametric Model, a type of model that does not make assumptions about the functional form of a model. Here in fact no model is learned since we do a majority vote on the observations on a given
k-NN for the reason it could be mistakenly understood to belong to a class of model called transductive models, it does not as because as even though no global model is built and inference is indeed done by relating test data to the training set (comparing the test observation to its neighborhood within the training set) the biggest difference is that
Since the averaging function used is generalizable to any unseen observation, it is fundamentally different from transductive models for which the insights learned are only applicable to the specific test samples provided.
The main assumption that a
In a Regression context, we'd call it, k-Nearest Neighbor Averaging. Using the exact same idea, but instead of assigning to the majority label of the
k-Nearest Neighbor is likely the most intuitive machine learning algorithm of them all.
The only complexities are that:
- as opposed to most algorithms
being smaller means more complex. (Intuitively, we have effective parameters in the dataset based on where falls in the dataset assuming non-overlapping clusters.) - The "closest" and "neighborhood" are entirely dependent of the measure of distance. Typically it is the Euclidean distance but there's nothing in the specification of the algorithm that forces this move.
Besides that, this is a deceptively simple but effective method that tries to approximate the Bayesian Averaging Estimator (