202412112230
Status: #idea
Tags: Linear Methods for Classification
State: #nascient
Support Vector Classifiers (Soft Margin Classifier)
These classifiers build upon maximal margin classifiers and tackle the cases where the assumption of linear separability is not tenable.
After all, there are more cases than not where it isn't true, moreover even when it is true, using the maximal margin makes us overly sensitive on a few points and might force our hand to choose variance that is quite close to the observations.
The support vector classifier essentially says, "Just chill, we can afford to lose that much" and maximizes while allotting from the start a budget of classifications.
It allows observations to not only be on the wrong side of the margin, but also on the wrong side of the hyperplane as well, the latter is inevitable if the data is not linearly separable.

We see that it is really similar to the function we try to optimize for maximum margin classifiers except that we now have an allowance on the right hand side that allows for classification errors, as long as the total of those errors is contained within the budget
Let's note a few things about the formula:
- The error term
represents the error associated with the th observation - If the error term
, it means is on the right side of the margin - If the error term is
we see that has violated the margin, it is still on the right side but penetrated the margin. - If the error term is greater than
, then has not only penetrated the margin, it is fully on the wrong side.
Ifthen this model is the same as the Maximal Margin Classifier (Optimal Separating Hyperplaene).
By the facts we just explained, enforcing
As
For big
Like all other tuning parameters before it,
For this method, the Support Vectors are extended to include not only the points that touch the margin, but also all the points that are on it, as logically moving those points would eat more or less within our margin.
All the observations that are neatly labeled do not affect fit.
This makes this model much more robust than other similar methods. Linear Discriminant Analysis (LDA) for one, depends on the mean of all the variables within one class to predict things. Naive Bayes similarly computes the prior and likelihood based on observations, so changing those observations might significantly affect our fit.
This is in stark contrast to Logistic Regression which is not affected by observations beyond the decision boundary.
But what if we need something... non-linear? Cue in the Support Vector Machine (SVM) which is the final generalization we consider.