202405170137
Status: #idea
Tags: Machine Learning

Supervised Learning

Fundamentally the idea of supervised learning is simply to fit a curve to data.

That is literally it.
Everything else is just mathematical jargon and calculus (to find the right parameters and what not.) I lied, there's also some Probability and Statistics in there as well as Linear Algebra...

The rest is just math to derive your parameters.

Supervised learning can be summarized as the following equations:

y = f [x, ϕ]

L = some function that quantifies how much the model fucked up

$y$ here is a vector of observations, $f$ is whatever class of mathematical functions we think is a appropriate to fit in our model. In the case of Simple Linear Regression and Multiple Linear Regression they are linear functions, but in the case of Deep Learning it will invariably be some kind of neural network architecture. Finally we have $ϕ$ which represents all our parameters.

When someone says with the chest puffed that they are "training" a model or that the model "learned" from the data, it is nothing more than fancy speak for "we mathe-magically tweaked the vector of parameters $ϕ$ until it gave us results that made sense."

The above sentence might change in the future with the advent of things like KANN which allow us to train the activation functions as well, but that is besides the point for now.

Methodology

Well, it depends, and if you are hardy enough for some of the more well-known models you could actually sit down to derive the best parameters (those that minimize the loss) by hand.

But in Deep Learning the models we are dealing with are way too big for that, and even if they weren't we have computers to do the job. So the idea is more or less always something along the line of:

Initialize the parameters at random values (Making sure those values are in the right ballpark can make the difference between a model that is impossible to train, and one that trains in seconds)
Compute the loss function $L$ for those randomly initialized parameters (ie: Least Square Error)
Compute the gradient for that loss function (it represents the direction of steepest ascent)
Back-propagate it to your parameters (now they all should have data that shows how to increase the loss)
Take a step proportional to your Learning Rate in the direction opposite to the gradient.
Repeat until either:
- Gradient is $0$ and no more improvement can be made
- We've completed to train for all the allocated epochs
- We have run out of time
- We have run out of resources
- The model's loss has been stagnating for a given number of epochs
- The model's loss has actually started going up
- etc.

This is all there is to it fundamentally, we just want to find the parameters often called weights, which minimize our loss function or in math-speak:

\hat{ϕ} = a r g m i n [L [ϕ]]

Since we never know the actual population statistics so-to speak of our model, we need to approximate them. And the above equation just means "the vector of parameters that best estimate $ϕ$ is given by the set of parameters that minimize the loss function."

The rest is just the math to derive the parameters, and whatnot.

Finding architectures is a rough task as well.

Generative vs Discriminative Models

Discriminative Models--what we presented above--are models which go the intuitive way of using the regressors or independent variables to predict the dependent variables. Generative Models on the other hand, choose the unorthodox approach of "generating" the measurements/regressors that could have lead to the observation from the observations; essentially inverting the direction.

So while discriminative models take the form $$
\boldsymbol y = f[\boldsymbol x, \phi]

g e n e r a t i v e m o d e l s t a k e t h e f o r m :

\boldsymbol x = g[\boldsymbol y,\phi]

U h . . . W h a t ? Y e a h, w e a r e f i n d i n g a n e q u a t i o n f o r $ x $ u s i n g $ y $, y o u a r e n o t r e a d i n g w r o n g . T h i s h a s t h e o b v i o u s d i s a d v a n t a g e t h a t t h i s m o d e l c a n n o t a c t u a l l y b e u s e d " a s i s " t o m a k e a n y i n f e r e n c e s i n c e w e a r e c u r r e n t l y f i n d i n g $ x $ . A s a r e s u l t, w e n e e d t o i n v e r t t h i s m o d e l t o g e t :

y = g^{-1}[\boldsymbol x, \phi]

Now, this looks like it should. Such a model has an advantage, notice how we started by generating our model from $\boldsymbol y$? This is meaningful in contexts where the value of $\boldsymbol y$ is actually meaningful as it comes to what kind of data might have generated; if we have domain knowledge that lead us to believe that these or that things can be garnered from the outputs, such a model allows us to bake that knowledge into the model. This is quite a significant benefit. So why have you never heard of the term? Because the above equation assumes that $g$ is invertible. Note that even when it is invertible, it does in no way mean that the derivation will be trivial. It can be a huge pain to compute so huge in fact that despite it's big advantage, it's standarddiscriminative counterpartis by far the most used. After all, while discriminative model lack that "bake-in knowledge" ability, they more than make up for it by their flexibility and simplicity. We can already byUniversal Approximation Theoremapproximate any function we could ever want to use with a fewRectified Linear Units (RELUs)anyways. ## Relevant LinksUnderstanding deep Learning