Linear Models
These notes will develop Normal linear regression models. Intuition will start with a line through a scatter plot of data. The likelihood function for linear models will help us develop intuition for particular aspects of these models. Then we'll generalize the likelihood to more than one predictor.
Intuition
One of the simplest Normal linear models assumes we have a scatter plot of data like in the plot below. Here are displayed penguins' flipper length measured in millimeters on the x-axis and body mass measured in grams on the y-axis. A line best fit through these data helps us understand a linear relationship between the predictor flipper length and the response body mass.
Most of this page is about what the phrase best fit means. Much of the rest of this page describes particular aspefcts of the best fit line through such data.
We could imagine a best fitting line through the plot of penguins body
measurements by choosing simultaneously an intercept, which we'll ball
Simiarly, the (wide)-hat over the variable
To help you develop a sense of what it means to choose values of
Since we want to be able to describe such a line no matter what the
x,y-axis variables are, we'll generalize the specific line above to
abstract x,y-axis variables. The y-axis variable is called the reponse
variable and is denoted simply
Normal Distribution
Following from the notes on the Likelihood Method, if
we assume we have independent data
Likelihood Function, one predictor
For Normal linear models with just one predictor, namely the
explantory variable, we assume the response variable
Somewhere else on the internet, you might see the model above written as
where the error term denoted
Notice that we've dropped the hats,
Because Normal linear regression models assume the errors are Normally
distributed, the simplified log-likelihood from the previous section
gives us the necessary form for the simplified log-likelihood for
(Normal) linear regression models. The only change is to treat the
parameter
If we replace
This (simplified) log-likelihood reveals the distance we use to
measure a line that best fits the data. We measure the squared
(vertical) distance between the observation
When we find the coefficients
Interpretting Coefficients, one predictor
When learning about how to interpret the coefficients of a linear model, it helps to have something tangible to talk about. So let's consider again the data set about penguins' body measurements from above, where body mass is measured in grams (g) and flipper length is measured in millimeters (mm). The fitted linear model equation is
The intercept is estimated as
Even though it doesn't always make sense, a strict interpretation of the intercept goes like this.
When flipper length is
, we expect the body mass to be .
This doesn't make sense for two reasons. First, we don't expect any
penguin to have a flipper length of
An interpretation of the slope goes like this.
For each extra millimeter increase in flipper length, we expect to see an increase of 49.69g in body mass.
Notice the word expect in both interpretations. This word is important. We don't really expect any particular penguin to match the intercept nor slope interpretations exactly. Both the intercept and slope are what we expect on average. This is in similar spirit to the fact that mean doesn't have to be a number in the original data set: the mean number of children per household in the United States is roughly 2.4. We don't believe that any particular hosuehold will have 2.4 children.
Predictions, one predictor
We can use a fitted linear model equation to make predictions, say for a value of the predictor variable that doesn't exist in our data set. Mathematically, this is easy. Simply replace the predictor variable with the value at which you want to make a prediction. Let's consider again the fitted equation about penguins.
The expected body mass for a penguin with a flipper length of
Sensitivity to Outliers
- make plot where an observation is movable
More than one Predictor
Normal linear regression models assume the numeric response variable
In other places, you might see the same model written as
where the error is assumed to follow a Normal distribution,
The predictors
- raised to a power different than one, e.g.
, - an indicator function over the levels of a categorical variable,
- a generally non-linear transformation of underlying explanatory variables, e.g.
or , - part of a spline, e.g. a natural cubic spline, or a B-spline
Including an explanatory variable in quadratic form looks like
This linear model fits a curve through data and it yet still
considered a linear model. ¿Why is this model linear? Because the
adjective linear in the phrase linear model describes a linear
combination of coefficients
Predictors can be indicator functions of the levels of a categorical
variable. Suppose we have a categorical variable
This linear model fits group means, where the coefficient
¿Why did lm
sorts the levels (unless told otherwise) in lexicographic order, which
generally means alphabetic order when no special characters are used
in the variable levels.
Univariate non-linear transformations such as
In fact, combinations of multiple explanatory variables are too allowed in the world of linear models. For instance, a second order term in two numeric explanatory variables is a valid predictor
Moreover, any linear combination of the examples discussed above makes a valid linear model.
Likelihood
Like before, the only change necessary to account for
Predictions
Using a linear model to make predictions follows a similar strategy as
before. Figure out the values of the predictors for which you want to
make a prediction of
Suppose we have the model
To make a prediction, first choose values for the predictions. Say, we
chose
Interpretting Coefficients
With
The simplest case is when no predictor has a term in common with any
other predictor. In this case, we take a derivative of
Since the partial derivative treats all other variables than
Holding all else constant, for a one unit increase in
, we expect to increase by units.
This is a general statement, from which you should fill in variable
names, and
In the case that one predictor is a prior predictor with a power
greater than one, say
Here,
Next, let's consider one predictor as a multiple of two previously
included predictors, say
Interpretting one of these would go something like this.
Holding all else constant, for a one unit increase in
and when , we expect to increase by units.
Sometimes it is easier to approximate a change in
and
This creates two predictions for a one unit change in
This strategy is really quite general and can be used as a first order approximation for rates of change of quite general linear models.
Assumptions
Some of the assumptions behind linear models can be deduced from the statistical notation for the model
Linearity is the first assumption of linear models. As mentioned aboved, this is a bit more delicate than simply looking at the data, both because linear models an successfully fit curves through data and because with multiple predcitors, it's not always simple to determine from where non-acceptable linearities arise.
Equal variance along the range of the data being modeled. This
assumptions is sometimes called homogeneity, but such a name seems
less direct about its meaning. The equal variance assumption comes
from the fact that the scale term
Independent errors comes from the development of the likelihood. If you suspect your data have underlying correlations, look into hierarchical and/or time-series modeling.
Normality comes from the fact that we are using a Normal distribution to describe the errors between the linear model and the y-axis observations. This assumption is often listed last as there are a number of details that work to most applied statisticians benefit, e.g. the Gauss-Markov Theorem and/or the Central Limit Theorem.
Checking Asumptions
In my opinion, checking assumptions in linear models is more of an art than it is an exact science. There are certainly more methods to throw at this probelm: TODO link to some test for homogeneity and indpendence. However, such tests try to answer with a binary hypothesis test (pass/fail), a topic that is, I believe, more grey than hypothesis testing admits. Some of the assumptions of linear models are less important in most instances than others. Some statistical analyses don't require all the assumptions. Some interpretations/decisions/conclusions from a fit linear model are more robust to varying degrees of departure from the technical assumptions.
To check the assumptions of linearity, equal variance, and Normality, the following two plots are helpful. The assumption of independent errors is, not exclusively, but largely up to the analysist to think hard about.
The standardized residuals are of great help in assessing the
assumptions of linear models. We'll start with residuals. There is
one residual for each data point in the original data set a linear
model was fit to. Let's call them
With the
Standard residuals are scaled to have a standard deviation of one, but
it takes some work to prove what the proper scaling term is. We
present the proper scaling term as part of the expression of
standardized residuals. The proof is defered to another page; TOOD
make a page proving standardized residuals. The term
where
A histogram of standardized residuals
A scatter plot of the standardized residuals,
Matrix Notation
If you are comfortable with matrix notation, then let
Then the simplified log-likelihood can be written compactly as
References
- Penn State's STAT 462: Applied Regression Analysis
- TODO: complete references
- R doc
- Python's statsmodels doc
- Wikipedia