There are four assumptions about linear regression, three of which are relatively straightforward
Linearity – the data should show a linear trend.
Independent observations – no two points are dependent on each other.
Constant Variability – variation of points around least squares line remains roughly constant.
Normality – the residuals should be nearly normal.
Scatter plots of (standardized) residuals (y-axis) on fitted values (x-axis) help you check the assumptions
Good.
What do we like about the above plot?
Imagine we fit linear regression to non-linear data.
Bad. The residuals are not randomly scattered about, but instead have a clear pattern.
Good.
Residuals show no clear pattern as a function of \(\hat{y}\).
What do we like about the above plot?
Linearity? Yes, because no consistent trend to the data.
Constant Variability? Yes, because no (horizontal) megaphone/cone (nor alligator/pacman mouth).
Imagine we fit linear regression to data with non-constant variability.
Bad. The residuals have non-constant variation along the fitted values.
Good.
To make plots of the residuals on the fitted values,
To check the normality of the residuals, make a histogram of the residuals. Ask, do they seem normal-ish?
To make histograms of standardized residuals,
If the information is available, you could plot residuals in the order that they were recorded, though this information is not always available.
Often times you simply need think carefully about how the data were collected.
We could define outliers as observations that have large residuals. Naturally then, the next question is, ``How large is large?’’ We use the normality assumption to help answer this question.
Let’s standardize the residuals to the standard normal distribution, \(N(0,1)\). Since the mean of the residuals will always be equal to zero, we simply divide by the appropriate standard deviation
\[r_i = \frac{e_i}{\sigma_{e_i}}\]
The R function rstandard will do this for you.
Any observation more than three standard deviations away from the mean could be considered an outlier. It isn’t difficult to find such standardized residuals, but it is difficult to find the observations these large residuals correspond to.
Outliers in linear regression are tough. Sometimes they heavily influence your least squares line. General recommendations:
Let’s return to the data frame . First use linear regression to predict with , then calculate the data we need.
Residuals on fitted values
Histogram of residuals
Checking model assumptions is not easy skill, but the plots aren’t bad. Plots, yet again, help.