202405171706
Status: #idea
Tags: Regression Analysis

Model Adequacy Checking

In statistics we always deal with models.

And each model has a set of assumptions tied to its usage. Model Adequacy Checking is the science of ensuring that a given dataset fulfills all the assumptions that are required to use the model in question.

For the assumptions of Linear Regression Models we do the following assumptions:

Linearity (There's a linear (in $X$ ) relation between the predictor variables and the response variables)
Normality (The error terms are normally distributed)
Homoscedasticity (The error terms are distributed according to the same variance (error does not vary as $X$ varies))
Independence of errors ( $ε \sim N (0, σ^{2})$ )

As a general rule, besides for linearity we are focused on the error terms. The issue is that typically the error terms can be all over the place, despite the assumptions holding. After all, while they all have mean $0$ (by assumption), no specific bounds on there variance exists. Therefore we need to scale them so that they're in the same ballpark.

After all, in a context with a really high variance errors being far apart, will mean less than similar spreads in a low variance context. Therefore to uniformize the errors and make it so we can analyze the residuals, similar to how we normalize the gradients in calculus 3 to get only the directions, here we will "standardize" or "studentize" the errors to put them all on the same scale.

Uniformizing the $ε$ terms:

Standardized Residuals
For standardized residuals, we divide each error term by its own standard deviation.

ε_{i}^{*} = \frac{ε_{i}}{\sqrt{V a r (ε_{i})}} = \frac{ε_{i}}{\sqrt{M S E (1 - h_{i i})}}

Semi Standardized Residuals ~ To Check non-constant variance

ε_{i}^{*} = \frac{ε_{i}}{\sqrt{1 - h_{i i}}}

We will use this by plotting it against $\hat{y}$ and observing the variation around the regression line.
Studentized Residuals:

Semi Standardized Residuals

Checking for Constant Variance

You can also use a scale-location plot, which plots the square root of the absolute value of $ε^{*}$ against $\hat{y}$ .
Semi-Standardize vs Y Hat
Scale-Location Plot

Checking for Normality

QQ-Plot
Boxplot of the Errors (around median 0)
Histogram

$R^{2}$ and Adjusted $R^{2}$

The coefficient of determination called $R^{2}$ is given by the ratio:

R^{2} = \frac{S S_{r e g}}{S S_{t o t a l}} = 1 - \frac{S S_{e}}{S S_{t o t a l}}

It is a crucial number that gives us the proportion of variation that is explained by the regression line.

Why do we need an adjusted version though? Simply because if I wanted to, I could artificially increase $R^{2}$ by randomly adding new parameters. Indeed, since the new model would have all the old parameters as well, it will never be less performant than the model with less parameters, which is problematic considering I could add parameters thathave some predictive power out of sheer luck.

Therefore to make sure, I only add parameters if they give a substantial advantage, I will "dock" points from my $R^{2}$ the more parameters there are, to compensate for the random increase in performance a random parameter could give us.

Therefore if the adjusted $R^{2}$ is still high, I will be able to conclude that indeed thse additions were good.
It is calculated as follows:

R_{a d j}^{2} = 1 - ((1 - R^{2}) [\frac{n - 1}{n - 2}])

In practice, we will always use that over $R^{2}$ , especially if we want to compare models with different number of variables.

While $R^{2}$ is good, one must keep in mind that is by no means perfect. It is possible to get a $R^{2}$ that is quite high despite not having a linear relation between the predictors and the response variable. Whether that's acceptable should be considered, but while a high $R^{2}$ typically correlate with linearity, it by no means guarantee it.

Model Adequacy Checking

Uniformizing the ε terms:

Checking for Constant Variance

Checking for Normality

R2 and Adjusted R2

Relevant Links

Uniformizing the $ε$ terms:

$R^{2}$ and Adjusted $R^{2}$