202406061224
Status: #idea
Tags: Regression Analysis
State: #nascient
Analysis of Variance (ANOVA)
What is it?
Anova is the method that we use to analyze the variance (shockers) of elements of our model. It is used to check the statistical significance of a model as a whole.
While the name is cryptic the principle is actually quite intuitive.
We start by considering the total deviation of the observations under consideration from their mean:
This summation has
Now while a sum of square deviations from the mean is extremely useful on its own (and is the first step towards variance), for evaluating a model we generally want to break it down further. More specifically, break it down into sum of squared deviations of the model's value from the sample mean
This boils down to the share of variance explained by the model (you want this to be high), and then the rest is error part (residuals that are unexplained.) The specific degrees of freedom depend on the model under consideration, but dividing the sum of squares by their degrees of freedom, we get independent Chi-Squared Distribution, and then we can take the ratio for comparison.
We can show by Cochran's Theorem that the distributions so obtained will be independent and suitable to construct a F statistic, since they are the decomposition of a quadratic form and are themselves quadratic forms.
Careful, that F statistic is only useful if under the null hypothesis, the expected value of the numerator and denominator are the same (so that the comparison is meaningful and we can use a standard Fisher Distribution, non-central Fisher is annoying without R.) Those numerator and denominators will be the respective Mean Sum of Squares and are analogous to the variance but for the specific thing under consideration, in fact the variance is a mean sum of squares.
In practice, we want to see if our model whatever it may be explains a significant share of the variability. If the ratio is close to one, then the model explains about as much as is left unexplained. This is true, since it can be shown that both Mean Sum of Squares of Regression and Mean Sum of Squares of Residuals have an expected value of
This is the intuition.
This analysis is done through something called an ANOVA table, which is the following:

This is for ONE-Way ANOVA where we assume intra-group variance is 0, and we want to check if inter-group variance is 0.
SSR is called: Sum of Square of Regression (represents distance between mean line and regression line)
SSE is called Sum of Square of Errors or Sum of Square of Residuals (represents distance between the observations the regression line)
SSTO is called the Total Sum of Squares (represents the distance between the mean line and the observations)
MS is the Mean Square which is simply whatever SS is relevant divided by its degree of freedoms.
Observe that since the SS are computed from normally distributed parameters, the following holds:
It follows thus
Since the ratio of two chi-squared divided by their respective degrees of freedom is a Fisher Distribution.
Inference
This is done to check whether or not our model is significant, in other words to check if all the coefficients are 0 (
In the Simple Linear Regression case, this is perfectly equivalent to a Test On Individual Regression Coefficient (Assuming We know our Model Is Significant).
Since we have a Fisher, and the Fisher is non-symmetric. We reject
Interval on Mean Response vs Interval on New Prediction
The former is typically referred to as the Confidence Intervals and the latter as a Prediction Intervals.
While they are extremely similar, the nuance is really important. When doing an Interval on the Mean Response, we are using the model to predict what would be the average response at a given point. In other words, the variance of
For that reason, the confidence interval is always less than or equal to the prediction interval. The two will only be equal if the variation around the mean is null, and therefore we are dealing with a mathematical function. In other words, this will never occur in natural contexts.