Linear Regression Assumptions

Edward A. Roualdes

Linear Regression Assumptions

There are four assumptions about linear regression, three of which are relatively straightforward

Linearity – the data should show a linear trend.
Independent observations – no two points are dependent on each other.
Constant Variability – variation of points around least squares line remains roughly constant.
Normality – the residuals should be nearly normal.

Checking Assumptions, linearity and constant variation

Scatter plots of (standardized) residuals (y-axis) on fitted values (x-axis) help you check the assumptions

linearity
constant variation

Linearity

Good.

Linearity

What do we like about the above plot?

Linearity? Yes, because no consistent pattern to the data.

Linearity

Imagine we fit linear regression to non-linear data.

Linearity

Bad. The residuals are not randomly scattered about, but instead have a clear pattern.

Linearity

Good.

Residuals show no clear pattern as a function of \(\hat{y}\).

Constant Variation

What do we like about the above plot?

Linearity? Yes, because no consistent trend to the data.
Constant Variability? Yes, because no (horizontal) megaphone/cone (nor alligator/pacman mouth).

Constant Variation

Imagine we fit linear regression to data with non-constant variability.

Code

vardata <- lmData(125, sd=seq(1, 10, length.out=125))
qplot(x, y, data=vardata) + stat_smooth(method="lm", se=FALSE)

Constant Variation

Bad. The residuals have non-constant variation along the fitted values.

Constant Variation

Good.

residuals show no clear pattern as a function of \(\hat{y}\), and
all residuals have roughly equal (vertical) width.

Residuals on Fitted Values

To make plots of the residuals on the fitted values,

Code

## fake code
fit <- lm(y~x, data=data) # first fit a model
dfr <- data.frame(
    yhat = fitted(fit), # fitted values
    r = rstandard(fit)  # standardized residuals
)
ggplot(dfr, aes(yhat, r)) +
    geom_point() +
    geom_hline(aes(yintercept=0))

Normality

To check the normality of the residuals, make a histogram of the residuals. Ask, do they seem normal-ish?

Histogram of Residuals

To make histograms of standardized residuals,

Code

## fake code
fit <- lm(y~x, data=data)
dfr <- data.frame(r = rstandard(fit))
ggplot(dfr, aes(r)) + geom_histogram()

Independence

If the information is available, you could plot residuals in the order that they were recorded, though this information is not always available.

Often times you simply need think carefully about how the data were collected.

Potential Outliers

We could define outliers as observations that have large residuals. Naturally then, the next question is, ``How large is large?’’ We use the normality assumption to help answer this question.

Potential Outliers

Let’s standardize the residuals to the standard normal distribution, \(N(0,1)\). Since the mean of the residuals will always be equal to zero, we simply divide by the appropriate standard deviation

\[r_i = \frac{e_i}{\sigma_{e_i}}\]

The R function rstandard will do this for you.

Potential Outliers

Any observation more than three standard deviations away from the mean could be considered an outlier. It isn’t difficult to find such standardized residuals, but it is difficult to find the observations these large residuals correspond to.

Code

## fake code
r <- rstandard(fit) # standardized residuals
## named indices where r is greater than 3
which(abs(r) > 3)

Potential Outliers

Outliers in linear regression are tough. Sometimes they heavily influence your least squares line. General recommendations:

fit linear regression with and without outliers
report qualitative and quantitative differences in the models
if you are convinced that the outlier(s) is(are) in error
- you better have good reason to justify exclusion, state the reason
- not liking the model with the point(s) included is not good reason

Example

Let’s return to the data frame . First use linear regression to predict with , then calculate the data we need.

Code

url <- "https://raw.githubusercontent.com/roualdes/data/refs/heads/master/elmhurst.csv"
elmhurst <- read.csv(url)
fit <- lm(gift_aid ~ family_income, data=elmhurst)
dfr <- data.frame(
    yhat = fitted(fit),
    r = rstandard(fit)
)

Example

Residuals on fitted values

Code

ggplot(dfr, aes(yhat, r)) +
    geom_point() +
    geom_hline(aes(yintercept = 0))

Example

Histogram of residuals

Code

ggplot(dfr, aes(r)) +
    geom_histogram(binwidth = 1/3)

Example

Code

r <- dfr$r
(idx <- which(abs(r) > 3))

integer(0)

Code

(jdx <- which(abs(r) > 2))

[1] 16 34

Code

(xout <- elmhurst[jdx, "family_income"])

[1] 73.598 97.664

Code

(yout <- elmhurst[jdx, "gift_aid"])

[1] 32.72 10.00

Example

Code

ggplot(data=elmhurst, aes(family_income, gift_aid),
       ylab="Gift aid ($1000)",
       xlab="Family income ($1000)") +
    geom_point() +
    stat_smooth(method="lm", se=FALSE) +
    geom_point(data=data.frame(x=xout, y=yout), aes(x, y), colour="red")

Take Away

Checking model assumptions is not easy skill, but the plots aren’t bad. Plots, yet again, help.

Use standardized residuals.
Common plots to help check assumptions
- Residuals on fitted – scatter plot
- histogram of residuals
Outliers are tough
- You better have good, explicitly stated reason to report only the data set and subsequent model with the outliers removed