202405202143
Status: #idea
Tags: Regression Analysis

The Method Of Least Squares In Simple Linear Regression

Pasted image 20240503141814.png
The above gives the derived formula for the two estimators. Try to remember the formula for Y_bar and X_bar and all the other terms.
Pasted image 20240614094953.png
I paste this here, because while I realize showing unbiasedness is almost trivial if you take the general case of Multiple Linear Regression, these formulas make the proof for unbiasedness basically one liners.
The idea of the Method of Least Squares is something you should already be keenly aware of, since it's a loss function that is used a lot in Machine Learning. But the idea is to take the difference between all the errors and sum them together. The issue is that since some differences will positive, and others would be negative, simply summing them would lead to cancellation of them.

Therefore there are two ways to proceed forward, either we take the absolute value of the errors and proceed what is called as Mean Absolute Error (L1 Loss) which is a sensible method that is used quite often in machine learning models because of its simplicity of interpretation, and the its resistance to outliers or you use use Mean Squared Error (L2 Loss)
which is quite sensible to outliers.

You might be surprised to learn that Mean Squared Error (L2 Loss) is the defacto method we use in Linear Regression problems in Statistics and Probability. Why?

Because, by taking the derivative of our summations of squared errors with respect to β0 the constant term (here represented as b) and β1 the slope (here represented as m) it is possible to solve for the values, and to get these terms in term of equations which by Gauss-Malkov Theorem will be unbiased estimators which minimise the variance for that term.

If you remember your calculus, you likely already know why we wouldn't use Mean Absolute Error (L1 Loss) in an algorithm that makes use of derivatives. But the point is, that the absolute value function is not differentiable at 0. Try to prove it using Fundamental Theorem of Calculus.

If you wonder how Mean Absolute Error (L1 Loss) can be used in Machine Learning even though so many models use Stochastic Gradient Descent (SGD) (a differentiation based algorithm) as an optimiser, the simple answer is that while Mean Absolute Error (L1 Loss) does not have a gradient at 0 it does have a Subgradient. For more indepth explanation, please consult this discussion with the AI :Short ChatGPT Convo on Simple Linear Regression