Simple Linear Regression

Edward A. Roualdes

Recap

recap: point estimates

Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes) named sampling distributions.

recap: standard errors

The standard deviation of a sampling distribution is called a standard error. The standard error shrinks with the square root of the sample size.

recap: central limit theorem

The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”

recap: confidence intervals

From the CLT, we can approximate confidence intervals from an approximate sampling distribution.

recap: hypothesis tests

From the CLT, we can approximate area in tails (p-values) from an approximate sampling distribution.

recap: ANOVA

ANOVA breaks up means of numerical response variable by levels of one categorical variable.

recap: scatter plot

Scatterplots are a graphical description of two numerical variables; consider Darwin’s finch data.

Code

url <- "https://raw.githubusercontent.com/roualdes/data/master/finches.csv"
finch <- read.csv(url)
ggplot(data=finch, aes(middletoelength, winglength)) +
    geom_point() +
    labs(xlab="Middle toe length (mm)",
         ylab="Wing length (mm)")

recap: scatter plot, take 2

Some keywords to describe scatterplots

associated or not.
direction: positive or negative association.
structure: linear or nonlinear.

Correlation, definition

Correlation is (often/sometimes) denoted by \(R\) and is the numeric analogue of the words above.

Correlation describes the strength of the linear relationship between two variables, and always takes values between -1 and 1.

Correlation, notes

Notes on correlation

More accurate name is Pearson correlation coefficient.
Describes linear relationships only.
Bounded by -1 and 1.
The value \(0\) denotes no association.
The sign dictates directionality.

Correlation, R code

In R we should just use the function cor.

Code

cor(finch$middletoelength, finch$winglength)

[1] 0.7034241

Correlation, example

Correlation, (watch out) example

Correlation, plots

This is another reason plots are so important.

Simple Linear Regression

Linear Regression, introduction

For all of the appropriately linear correlation examples above, it was easy to think about a line through the data and then ask, “How closely do the data fall onto that line?” That line through the data however, has a name and a mathematical definition.

lite example

The least squares line for Darwin’s finch data is plot below in blue.

Code

ggplot(data=finch, aes(middletoelength, winglength)) +
    geom_point() +
    labs(xlab="Middle toe length (mm)",
         ylab="Wing length (mm)") +
    stat_smooth(method="lm", se=FALSE)

Linear Regression, idea

Simple linear regression decomposes the response variable \(Y\) into three components:

the intercept
- the value \(Y\) takes on when \(X\) is equal to \(0\);
- above, the length of a wing when the middle toe length is \(0\)
the slope
- on the explanatory variable \(X\)
- represents the increase in \(Y\) for a unit increase in \(X\);
- above, some increase in wing length for every mm increase in the middle toe length
errors/residuals
- some left over bits

Linear Regression, model

Given a response variable \(Y\) and an explanatory variable \(X\), the simple linear regression model is

\[Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \underbrace{\epsilon}_{\text{errors}}\]

Simple Linear Regression, parameters

The population parameters \(\beta_0\) and \(\beta_1\) are estimated with \(\hat{\beta}_0\) and \(\hat{\beta}_1\). These estimates within the linear regression equation are written

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

The predicted/fitted value of \(Y\) is a function of the estimates of the intercept and slope, dependent on some value of \(X\).

Linear Regression, residuals

Not every observation will fall of the least squares line. The difference between the true observation \(Y_i\) and the predicted value of \(\hat{Y}_i\) at \(X_i\), is the \(i\)th residual

\[e_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i)\]

Simple Linear Regression, residuals by picture

Some residuals will be positive and some negative.

Simple Linear Regression, best

The word “best” is cleverly defined and not without debate. The most common definition of best means the line that minimizes the sum of the squared residuals. This idea is intuitive. We are to find the values of \(\beta_0\) and \(\beta_1\) that

take \(e_i = Y_i - \hat{Y}_i\), for all \(i\),
square each residual, \(e_i^2\), and
minimize \(\frac{1}{n} \sum_{i=1}^n e_i^2\).

Simple Linear Regression in R

In R we use the function lm to fit linear regression.

Code

fit <- lm(winglength ~ middletoelength, data = finch)
## summary(fit) # RStudio
## yhat <- fitted(fit) # predicted/fitted values
## e <- residuals(fit) # resiudal values

Simple Linear Regression, interpretation

The coefficients from the model can be extracted with the function coefficients.

Code

(beta <- coefficients(fit))

    (Intercept) middletoelength 
      23.751547        2.494932

Thus, our fitted linear model is written as

\[\hat{Y} = 23.75 + 2.49 X\]

Example

Elmhurst Colege Data

We’ll consider a dataset named elmhurst. With these data, we might have the question, “How is family income related to the amount of gift aid a student receives from the college?”

Code

url <- "https://raw.githubusercontent.com/roualdes/data/master/elmhurst.csv"
elmhurst <- read.csv(url)
head(elmhurst)

  family_income gift_aid price_paid
1        92.922    21.72      14.28
2         0.250    27.47       8.53
3        53.092    27.75      14.25
4        50.200    27.22       8.78
5       137.613    18.00      24.00
6        47.957    18.52      23.48

Elmhurst College Data

Step 1?

Elmhurst College Data

Plot the data!

Code

cor(elmhurst$family_income, elmhurst$gift_aid)

[1] -0.4985561

Code

ggplot(data = elmhurst, aes(family_income, gift_aid),
       ylab="Gift aid ($1000)",
       xlab="Family income ($1000)") +
    geom_point()

Elmhurst College Data

Code

ggplot(data = elmhurst, aes(family_income, gift_aid),
       ylab="Gift aid ($1000)",
       xlab="Family income ($1000)") +
    geom_point() +
    stat_smooth(method = "lm", se = FALSE) # no standard errors

Elmhurst College Data

Code

elmReg <- lm(gift_aid ~ family_income, data=elmhurst)
## summary(elmReg) ## RStudio
## yhat <- fitted(elmReg)
## e <- residuals(elmReg)

Elmhurst College Data

Code

coef(elmReg)

  (Intercept) family_income 
  24.31932901   -0.04307165

Elmhurst College Data

Our estimated linear model looks like

\[\widehat{aid} = 23.75 + 2.49 \times family\_income\]

How do we interpret this? Can we make causal connections from this model?

Take Away

Correlation is a helpful summary statistic between two numerical variables
Linear regression fits “best” line through scatterplot
best means minimized squared residuals
expected value of response given some value of explanatory