202405171706
Status: #idea
Tags: Regression Analysis
Model Adequacy Checking
In statistics we always deal with models.
And each model has a set of assumptions tied to its usage. Model Adequacy Checking is the science of ensuring that a given dataset fulfills all the assumptions that are required to use the model in question.
For the assumptions of Linear Regression Models we do the following assumptions:
- Linearity (There's a linear (in
) relation between the predictor variables and the response variables) - Normality (The error terms are normally distributed)
- Homoscedasticity (The error terms are distributed according to the same variance (error does not vary as
varies)) - Independence of errors (
)
As a general rule, besides for linearity we are focused on the error terms. The issue is that typically the error terms can be all over the place, despite the assumptions holding. After all, while they all have mean
After all, in a context with a really high variance errors being far apart, will mean less than similar spreads in a low variance context. Therefore to uniformize the errors and make it so we can analyze the residuals, similar to how we normalize the gradients in calculus 3 to get only the directions, here we will "standardize" or "studentize" the errors to put them all on the same scale.
Uniformizing the terms:
Standardized Residuals
For standardized residuals, we divide each error term by its own standard deviation.
Semi Standardized Residuals ~ To Check non-constant variance
We will use this by plotting it against
Studentized Residuals:
Semi Standardized Residuals
Checking for Constant Variance
You can also use a scale-location plot, which plots the square root of the absolute value of
Semi-Standardize vs Y Hat
Scale-Location Plot
Checking for Normality
QQ-Plot
Boxplot of the Errors (around median 0)
Histogram
and Adjusted
The coefficient of determination called
It is a crucial number that gives us the proportion of variation that is explained by the regression line.
Why do we need an adjusted version though? Simply because if I wanted to, I could artificially increase
Therefore to make sure, I only add parameters if they give a substantial advantage, I will "dock" points from my
Therefore if the adjusted
It is calculated as follows:
In practice, we will always use that over
While