202405220058
Status: #reference
Tags: Machine Learning
State: #foundation

Neural Networks

Fundamentally all a neural network is, is an infinitely flexible class of function, that can be anything and everything you want; from the straightest of line, to the squiggliest of squigglies. As Josh Stalmer (from StatQuest) calls it, it's not much more than Big Fancy Squiggly Fitting Machine.

Neural Networks are made of a few components which are all better treated in their own notes, because I foresee that anything more than surface level will evolve into something... complex.

The 4 Pillars of Neural Networks

The fundamental components:

Nodes (Neurons (Machine Learning)) : The "neural" in neural networks. They are not much more than small functions, in a network (yes I said it) of functions. They take in a set of inputs from the previous layers (or alternatively from the ether if we are the input layer) and then combine all the inputs along with respective weights and outputs them either to the next layer of neurons, or back into the world. A glorified dot product machine. Note that two neurons on the same Layer never talk to each other.
Layers (Machine Learning): This is the name given to groups of related but mutually independent neurons. While technically it is possible to approximate any function by doing everything on one layer. something called Shallow Neural Networks. In practice, unless we are in an educational context we will use anything from a few layers to ... arbitrary big finite numbers! Why? Shallow Neural Networks become too big and cumbersome for anything beyond simple demonstrations and pedagogic material. Adding a neuron is to adding a layer what addition is to multiplication; if it wasn't for the ability to add arbitrarily many layers to a neural networks, Machine Learning and Deep Learning would have died in the 20th century and remaining the object of phantasms and fiction without much practical use (cue in Alchemy)... I really gotta watch Fullmetal Alchemist...
Weights (Parameters): In the current paradigm of machine learning, this is what actually changes when you say "I am training a model", or "my model is learning." The ideas are really identical to what is explored in Simple Linear Regression. I am saying in the current paradigm cause in the future, trainable Activation Functions might become a thing which would upend the way we think about training. I have some function(curve) that I am trying to fit, I need to find (estimate) the best parameters such that model fits the data that I have observed and pray that it generalizes. Well in general you are a little more active than in prayer and have more bargaining power (Learning Rate, Activation Functions, Dropout, Optimizers, etc.), but yeah that's the gist. The main difference is that instead of doing fancy calculus and other derivations , you use the good ol Chain Rule (yes the one from Calculus) to compute how to improve the parameters (How to reduce the Loss) and and let the Machine Learning magic (Backpropagation) occur. This will be treated more rigorously in the actual note for Weights (Parameters) and Backpropagation.
Activation Functions: One of the tools you have to bargain with the machine, they give you the ability to give some sort of meaning to a given output (whether it be from a neuron or a specific layer). For example, if the outputs of my output layer are supposed to be probabilities, but that in their current form they do not sum to $1$ , have potentially negative values, etc. I could leave them as is (if I am a masochist), or I could pass them through a Softmax and not have to worry about any of that. Activation Functions can be any function that you want as long as it makes sense in your case, but common ones are the Logistic Function which squashes values between $0$ and $1$ and is often used in binary classification, the Rectified Linear Units (RELUs) (the GOAT) which is often used on intermediate layers to increase the flexibility of the model, and the Softmax which is used on multi-classification problems where there are more than two labels to ensure that we have probabilities, it is often preferred to other max methods because when we want to use the probabilities as weights it ensures the lowe probability could still contribute instead of being squashed to $0$ as they would in a regular max..

These are the components of neural networks, and I hope you know understand why I decided to keep everything to their own notes.

Why Neural Networks?

Fundamentally, because they are a universal approximators (most important point) and are comparably straightforward when it comes to training them.

First, the first point. Yes, this sounds as cool as it sounds.
It means that it is provable mathematically that a neural network given enough layers and neurons can approximate any... ANY mathematical function you could think of. See Universal Approximation Theorem.

The thing is more than a few classes of functions can do that, like the polynomials we all know and love can do that as well. So then why neural networks?

Fundamentally because they are dead simple and can be used like in pretty much any context. The list of their benefits would be too big but a few are:

Can be trained on much higher dimensionality data than other methods (see Convolutional Neural Networks (CNN))
Non-linearity allow the modelling of complex functions with comparatively less parameters
Features are allowed to "emerge" from the data rather than being explicitly baked in (see Collaborative Filtering)
Pretrained models and Transfer Learning, as opposed to other types of methods, neural networks excel in its ability to be repurposed and reused by simply finetuning it to our data. They allow us to stand on the shoulders of giants, and train record-breaking and even field-revolutionizing models with even small datasets.

No other method boast the same level of flexibility, generalizability, effectiveness and power. There is a very real issue with the fact that as things stand a lot of Machine Learning models, especially on the Deep Learning side of things are inscrutable black boxes. Still their other advantages make that significant downside palatable. This is the difference between Inference and Prediction. If what we care about is how specific elements relate to each other, neural networks are terrible; but for anything else they are a game-changer and are guaranteed to work.

Caution, do not let your hype over Artifical Intelligence and neural networks take you over. While machine learning and neural networks are extremely powerful and high quality mithril hammers, not all the problems in your life require hammering nails.

Simple and interpretable models like Decision Trees, and Random Forests are still relevant and can often offer similar if not better results on the same data as the neural network alternative. Sometimes the data at hand simply does not have enough "hidden features" to make the overhead of a neural network worth it. Furthermore, even if we are adamant about using a neural network, first creating models using simpler methods like Linear Regression, Tree-Based Methods and whatnot has value even simply as a baseline. You do not want to spend weeks optimizing a model, trying different learning rates, and learning schedules, tweaking the number of epochs and the whole shebang just to realize your model does about the same (or potentially worse) than a Multiple Linear Regression model you fitted in 5 seconds in R or Python.

Relevant Links

File	Folder	Last Modified
Gradient Boosting Machine-Trees	1. Cosmos	8:50 AM - January 14, 2026
Practical Deep Learning for Coders	2. White Holes/References	3:41 PM - January 11, 2026
Bayes Classifier	1. Cosmos	3:37 PM - January 11, 2026
Supervised Learning	1. Cosmos	3:32 PM - January 11, 2026
StatQuest ~ Neural Networks ! Deep Learning	2. White Holes/References	3:31 PM - January 11, 2026