Linear Models

These notes will develop Normal linear regression models. Intuition will start with a line through a scatter plot of data. The likelihood function for linear models will help us develop intuition for particular aspects of these models. Then we'll generalize the likelihood to more than one predictor.

Intuition

One of the simplest Normal linear models assumes we have a scatter plot of data like in the plot below. Here are displayed penguins' flipper length measured in millimeters on the x-axis and body mass measured in grams on the y-axis. A line best fit through these data helps us understand a linear relationship between the predictor flipper length and the response body mass.

Most of this page is about what the phrase best fit means. Much of the rest of this page describes particular aspefcts of the best fit line through such data.

We could imagine a best fitting line through the plot of penguins body measurements by choosing simultaneously an intercept, which we'll ball , and a slope, which we'll call . The hats on the coefficients help remind us that these are estimated coefficients based on a limited data set of penguins -- if we had all the penguins in the world, the values would almost certainly be different. Such a line, specific to these data, then looks like

Simiarly, the (wide)-hat over the variable reminds us that we are predicting expected values of penguins' body mass, not necessarily true values.

To help you develop a sense of what it means to choose values of , here's a couple of coefficient pairs, with corresponding lines overlaid on the plot. While you guess coefficients correspond to which lines, start thinking about what distances (physical distances on the plot) are important to define a good choice of coefficients.

Since we want to be able to describe such a line no matter what the x,y-axis variables are, we'll generalize the specific line above to abstract x,y-axis variables. The y-axis variable is called the reponse variable and is denoted simply . The x-axis variable is called the explanatory variable and is denoted . To help you remember which is the explanatory variable, emphasize the x when you say explanatory. The general line looks like

Normal Distribution

Following from the notes on the Likelihood Method, if we assume we have independent data , where is the location parameter and is the scale parameter, the simplified log-likelihood for the parameter is

Likelihood Function, one predictor

For Normal linear models with just one predictor, namely the explantory variable, we assume the response variable follows a Normal distribution with a location parameter set by the line through

Somewhere else on the internet, you might see the model above written as

where the error term denoted is assumed to follow a Normal distribution, . These two phrasings of the same model are equivalent because the Normal distribution is closed under additive shifts of the location parameter.

Notice that we've dropped the hats, , when we describe the model in theory, as we're doing here. The hats only show up when we've estimated the coefficients from the available data.

Because Normal linear regression models assume the errors are Normally distributed, the simplified log-likelihood from the previous section gives us the necessary form for the simplified log-likelihood for (Normal) linear regression models. The only change is to treat the parameter as a place holder for a line in the coefficients .

If we replace for a linear model in the predictor , that is , then a simple substitution reveals the simplified log-likelihood for (Normal) linear regression models in one predictor.

This (simplified) log-likelihood reveals the distance we use to measure a line that best fits the data. We measure the squared (vertical) distance between the observation from the line for all the data available .

When we find the coefficients that minimize the log-likelihood above, we'll have identified the line that best fits the data. We won't develop the calculus in theses notes. Instead, try dragging around the coefficients below to find the line that best fits the data. You use both the line, the data, and the value of the likelihood presented just below the plot.

Interpretting Coefficients, one predictor

When learning about how to interpret the coefficients of a linear model, it helps to have something tangible to talk about. So let's consider again the data set about penguins' body measurements from above, where body mass is measured in grams (g) and flipper length is measured in millimeters (mm). The fitted linear model equation is

The intercept is estimated as . The slope is estimated to be .

Even though it doesn't always make sense, a strict interpretation of the intercept goes like this.

When flipper length is , we expect the body mass to be .

This doesn't make sense for two reasons. First, we don't expect any penguin to have a flipper length of . Second, we don't any penguin to have a negative body mass. Nevertheless, intercepts are useful for best fitting a line through data. So we won't just drop it.

An interpretation of the slope goes like this.

For each extra millimeter increase in flipper length, we expect to see an increase of 49.69g in body mass.

Notice the word expect in both interpretations. This word is important. We don't really expect any particular penguin to match the intercept nor slope interpretations exactly. Both the intercept and slope are what we expect on average. This is in similar spirit to the fact that mean doesn't have to be a number in the original data set: the mean number of children per household in the United States is roughly 2.4. We don't believe that any particular hosuehold will have 2.4 children.

Predictions, one predictor

We can use a fitted linear model equation to make predictions, say for a value of the predictor variable that doesn't exist in our data set. Mathematically, this is easy. Simply replace the predictor variable with the value at which you want to make a prediction. Let's consider again the fitted equation about penguins.

The expected body mass for a penguin with a flipper length of is .

Sensitivity to Outliers

make plot where an observation is movable

More than one Predictor

Normal linear regression models assume the numeric response variable depends on some predictors via a linear combinations of coefficients . Notationally, we might write

In other places, you might see the same model written as

where the error is assumed to follow a Normal distribution, . These two phrasing are equivalent, just as before, despite the fact that we now have more than one predictor.

The predictors can be included in a linear model in a number of different ways. For instance, a predictor can be

raised to a power different than one, e.g. ,
an indicator function over the levels of a categorical variable,
a generally non-linear transformation of underlying explanatory variables, e.g. or ,
part of a spline, e.g. a natural cubic spline, or a B-spline

Including an explanatory variable in quadratic form looks like

This linear model fits a curve through data and it yet still considered a linear model. ¿Why is this model linear? Because the adjective linear in the phrase linear model describes a linear combination of coefficients . It doesn't really matter what the predictors are, so long as the coefficients are included linearly.

Predictors can be indicator functions of the levels of a categorical variable. Suppose we have a categorical variable which has levels . One can attempt to predict a response variable using the levels of . An example of such a linear model might look like

This linear model fits group means, where the coefficient is the group mean of the variable for all the data that are . The coefficient is the offset for the level relative to the level , such that is the group mean of for all data that are . The coefficient is the offset for the level relative to the level .

¿Why did assume the level ? I can't break my R habits and in R, the function lm sorts the levels (unless told otherwise) in lexicographic order, which generally means alphabetic order when no special characters are used in the variable levels.

Univariate non-linear transformations such as are also allowed in linear models, since this amounts to a labeling issue, but never changes the fact that the cofficients are linear.

In fact, combinations of multiple explanatory variables are too allowed in the world of linear models. For instance, a second order term in two numeric explanatory variables is a valid predictor

Moreover, any linear combination of the examples discussed above makes a valid linear model.

Likelihood

Like before, the only change necessary to account for predictors is to treat the parameter as a place holder for a linear model. If we replace from the likelihood for a Normal distribution for a linear model in the predictors , that is , then a simple substitution reveals the simplified log-likelihood for (Normal) linear regression models.

Predictions

Using a linear model to make predictions follows a similar strategy as before. Figure out the values of the predictors for which you want to make a prediction of , then plug these values into the linear model equation and do the math.

Suppose we have the model

To make a prediction, first choose values for the predictions. Say, we chose , , and . The a prediction follows as

Interpretting Coefficients

With predictors, it's best to return to calculus to aid our understanding of interpretting coefficients in linear models. Consider the linear model equation

The simplest case is when no predictor has a term in common with any other predictor. In this case, we take a derivative of with respect to the corresponding predictor. To interpret the coefficient , we first calculate

Since the partial derivative treats all other variables than as constants, we would interpret the coefficient as follows.

Holding all else constant, for a one unit increase in , we expect to increase by units.

This is a general statement, from which you should fill in variable names, and variable units.

In the case that one predictor is a prior predictor with a power greater than one, say , then the calculus changes slightly and our interpretation must follow.

Here, is included in the linear model quadratically -- is included in the linear model both as a first order term, , and as a second order term . A one unit change in induces change in that follows a line in . The change in with respect to a one unit change in depends on the magnitude of .

Next, let's consider one predictor as a multiple of two previously included predictors, say . Notice then that the derivitive of with respect to depends on the value of , and vice versa.

Interpretting one of these would go something like this.

Holding all else constant, for a one unit increase in and when , we expect to increase by units.

Sometimes it is easier to approximate a change in based on a one unit change in a predictor, say . Using numerical software, you can approximate a change in based on a one unit change in predictor , by calculating two predictions, let's call them . Such predictions are calculated as

and

This creates two predictions for a one unit change in . We can thus approximate with a line the change in due to a one unit change in as

This strategy is really quite general and can be used as a first order approximation for rates of change of quite general linear models.

Assumptions

Some of the assumptions behind linear models can be deduced from the statistical notation for the model

Linearity is the first assumption of linear models. As mentioned aboved, this is a bit more delicate than simply looking at the data, both because linear models an successfully fit curves through data and because with multiple predcitors, it's not always simple to determine from where non-acceptable linearities arise.

Equal variance along the range of the data being modeled. This assumptions is sometimes called homogeneity, but such a name seems less direct about its meaning. The equal variance assumption comes from the fact that the scale term in the model notation is fixed and not a function of any predictor . In theory, one could develop a model where the scale parameter is a function of the data, but this is less common and will not be discussed here.

Independent errors comes from the development of the likelihood. If you suspect your data have underlying correlations, look into hierarchical and/or time-series modeling.

Normality comes from the fact that we are using a Normal distribution to describe the errors between the linear model and the y-axis observations. This assumption is often listed last as there are a number of details that work to most applied statisticians benefit, e.g. the Gauss-Markov Theorem and/or the Central Limit Theorem.

Checking Asumptions

In my opinion, checking assumptions in linear models is more of an art than it is an exact science. There are certainly more methods to throw at this probelm: TODO link to some test for homogeneity and indpendence. However, such tests try to answer with a binary hypothesis test (pass/fail), a topic that is, I believe, more grey than hypothesis testing admits. Some of the assumptions of linear models are less important in most instances than others. Some statistical analyses don't require all the assumptions. Some interpretations/decisions/conclusions from a fit linear model are more robust to varying degrees of departure from the technical assumptions.

To check the assumptions of linearity, equal variance, and Normality, the following two plots are helpful. The assumption of independent errors is, not exclusively, but largely up to the analysist to think hard about.

The standardized residuals are of great help in assessing the assumptions of linear models. We'll start with residuals. There is one residual for each data point in the original data set a linear model was fit to. Let's call them where indexes the number of data , . To calculate the th residual, first calculate the predicted value for the th observation,

With the predictions , calculate the residuals as

Standard residuals are scaled to have a standard deviation of one, but it takes some work to prove what the proper scaling term is. We present the proper scaling term as part of the expression of standardized residuals. The proof is defered to another page; TOOD make a page proving standardized residuals. The term is called a hat value and is discussed in the section Matrix Notation below. The standardized residuals are calculated as

where is the meas squared error from the fit model; TODO write more explicityl MSE.

A histogram of standardized residuals helps to check the assumption Normality and can be used to help identify potential outliers. It is easiest to use some numerical software to calculate standardized residuals. In R the function rstandard (or rstudent, same link) can be used. In Python, the analogous methods are resid_studentized_internal and resid_studentized_external. With these numbers in hand, make a histogram. You should see at least approximate Normality in the standardized residuals. Outliers will present as points on the x-axis with magnitude greater than some threshold. I tend to use three as this threshold, but there is no set rule.

A scatter plot of the standardized residuals, for, on the predictions from the fit model for all data in the original data set will help assess the assumptions of linearity and equal variance. Put the standardized re

Matrix Notation

If you are comfortable with matrix notation, then let be a model matrix, let , and write

Then the simplified log-likelihood can be written compactly as

References

Penn State's STAT 462: Applied Regression Analysis
TODO: complete references
R doc
Python's statsmodels doc
Wikipedia