202406061224
Status: #idea
Tags: Regression Analysis
State: #nascient

Analysis of Variance (ANOVA)

What is it?

Anova is the method that we use to analyze the variance (shockers) of elements of our model. It is used to check the statistical significance of a model as a whole.

While the name is cryptic the principle is actually quite intuitive.

We start by considering the total deviation of the observations under consideration from their mean:

\sum_{i = 1}^{N} x_{i}^{2} - N {\bar{x}}^{2}

This summation has $N - 1$ degrees of freedom as can be observed by the fact we have $N$ observations, but since we are removing a fixed ${\bar{x}}^{2}$ from the sum, while $N - 1$ of the $x_{i}$ could land on any value, the last one is determined as it MUST be whatever $x$ yields the obtained $\bar{x}$ given the rest of the observations.

Now while a sum of square deviations from the mean is extremely useful on its own (and is the first step towards variance), for evaluating a model we generally want to break it down further. More specifically, break it down into sum of squared deviations of the model's value from the sample mean $\sum_{i = 1}^{N} {\hat{x}}_{i}^{2} - N {\bar{x}}^{2}$ and then the squared deviations of the model prediction and the actual value ( $\sum_{i = 1}^{N} ({\hat{x}}_{i} - x_{i})^{2}$ .

This boils down to the share of variance explained by the model (you want this to be high), and then the rest is error part (residuals that are unexplained.) The specific degrees of freedom depend on the model under consideration, but dividing the sum of squares by their degrees of freedom, we get independent Chi-Squared Distribution, and then we can take the ratio for comparison.

We can show by Cochran's Theorem that the distributions so obtained will be independent and suitable to construct a F statistic, since they are the decomposition of a quadratic form and are themselves quadratic forms.

Careful, that F statistic is only useful if under the null hypothesis, the expected value of the numerator and denominator are the same (so that the comparison is meaningful and we can use a standard Fisher Distribution, non-central Fisher is annoying without R.) Those numerator and denominators will be the respective Mean Sum of Squares and are analogous to the variance but for the specific thing under consideration, in fact the variance is a mean sum of squares.

In practice, we want to see if our model whatever it may be explains a significant share of the variability. If the ratio is close to one, then the model explains about as much as is left unexplained. This is true, since it can be shown that both Mean Sum of Squares of Regression and Mean Sum of Squares of Residuals have an expected value of $σ^{2}$ under the null hypothesis that the model is non-predictive. Therefore, if the ratio is significantly less than 1, then the model is very bad since it fails to account for most of the variance observed, if the ratio is significantly bigger than 1 the share of variance explained by the model is significant and we can be confident it captures something real.

This is the intuition.

This analysis is done through something called an ANOVA table, which is the following:
Pasted image 20260214123031.png
This is for ONE-Way ANOVA where we assume intra-group variance is 0, and we want to check if inter-group variance is 0.

SSR is called: Sum of Square of Regression (represents distance between mean line and regression line)
SSE is called Sum of Square of Errors or Sum of Square of Residuals (represents distance between the observations the regression line)
SSTO is called the Total Sum of Squares (represents the distance between the mean line and the observations)
MS is the Mean Square which is simply whatever SS is relevant divided by its degree of freedoms.

Observe that since the SS are computed from normally distributed parameters, the following holds:
It follows thus

Since the ratio of two chi-squared divided by their respective degrees of freedom is a Fisher Distribution.

Inference

This is done to check whether or not our model is significant, in other words to check if all the coefficients are 0 ( $H_{0}$ ) or if at least one coefficient is non 0 $H_{1}$ .

In the Simple Linear Regression case, this is perfectly equivalent to a Test On Individual Regression Coefficient (Assuming We know our Model Is Significant).

Since we have a Fisher, and the Fisher is non-symmetric. We reject $H_{0}$ if the $F$ statistic is bigger than the $α$ level. The bigger $F$ statistic is, the smaller $M S E$ the error of our model is compared to $M S R$ the deviation of the model from the means of $Y$ . In other words, the model is useful to predict values. The smaller it is, the bigger the error is, and/or the smaller the deviation of the model from the mean in such a way that the coefficient is not much more useful than the mere mean line.

Interval on Mean Response vs Interval on New Prediction

The former is typically referred to as the Confidence Intervals and the latter as a Prediction Intervals.

While they are extremely similar, the nuance is really important. When doing an Interval on the Mean Response, we are using the model to predict what would be the average response at a given point. In other words, the variance of $Y$ at that point is inconsequential since the mean is fixed this leads to smaller error bounds. On the other hand for a prediction interval, while the point statistic is the same (since it uses the same estimator), the error becomes bigger because we HAVE to account for variation about the mean, and therefore have to make the interval bigger to compensate for that variation.

For that reason, the confidence interval is always less than or equal to the prediction interval. The two will only be equal if the variation around the mean is null, and therefore we are dealing with a mathematical function. In other words, this will never occur in natural contexts.

Analysis of Variance (ANOVA)

What is it?

Inference

Interval on Mean Response vs Interval on New Prediction

Relevant Links