Simple Linear Regression

Edward A. Roualdes

Recap

recap: point estimates

Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes) named sampling distributions.

recap: standard errors

The standard deviation of a sampling distribution is called a standard error. The standard error shrinks with the square root of the sample size.

recap: central limit theorem

The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”

recap: confidence intervals

From the CLT, we can approximate confidence intervals from an approximate sampling distribution.

recap: hypothesis tests

From the CLT, we can approximate area in tails (p-values) from an approximate sampling distribution.

recap: ANOVA

ANOVA breaks up means of numerical response variable by levels of one categorical variable.

recap: scatter plot

Scatterplots are a graphical description of two numerical variables; consider Darwin’s finch data.

Code
url <- "https://raw.githubusercontent.com/roualdes/data/master/finches.csv"
finch <- read.csv(url)
ggplot(data=finch, aes(middletoelength, winglength)) +
    geom_point() +
    labs(xlab="Middle toe length (mm)",
         ylab="Wing length (mm)")

recap: scatter plot, take 2

Some keywords to describe scatterplots

  • associated or not.
  • direction: positive or negative association.
  • structure: linear or nonlinear.

Correlation, definition

Correlation is (often/sometimes) denoted by \(R\) and is the numeric analogue of the words above.

  • Correlation describes the strength of the linear relationship between two variables, and always takes values between -1 and 1.

Correlation, notes

Notes on correlation

  • More accurate name is Pearson correlation coefficient.
  • Describes linear relationships only.
  • Bounded by -1 and 1.
  • The value \(0\) denotes no association.
  • The sign dictates directionality.

Correlation, R code

In R we should just use the function cor.

Code
cor(finch$middletoelength, finch$winglength)
[1] 0.7034241

Correlation, example

Correlation, example

Correlation, example

Correlation, example

Correlation, example

Correlation, example

Correlation, example

Correlation, example

Correlation, example

Correlation, (watch out) example

Correlation, (watch out) example

Correlation, plots

This is another reason plots are so important.

Simple Linear Regression

Linear Regression, introduction

For all of the appropriately linear correlation examples above, it was easy to think about a line through the data and then ask, “How closely do the data fall onto that line?” That line through the data however, has a name and a mathematical definition.

lite example

The least squares line for Darwin’s finch data is plot below in blue.

Code
ggplot(data=finch, aes(middletoelength, winglength)) +
    geom_point() +
    labs(xlab="Middle toe length (mm)",
         ylab="Wing length (mm)") +
    stat_smooth(method="lm", se=FALSE)

Linear Regression, idea

Simple linear regression decomposes the response variable \(Y\) into three components:

  • the intercept
    • the value \(Y\) takes on when \(X\) is equal to \(0\);
    • above, the length of a wing when the middle toe length is \(0\)
  • the slope
    • on the explanatory variable \(X\)
    • represents the increase in \(Y\) for a unit increase in \(X\);
    • above, some increase in wing length for every mm increase in the middle toe length
  • errors/residuals
    • some left over bits

Linear Regression, model

Given a response variable \(Y\) and an explanatory variable \(X\), the simple linear regression model is

\[Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \underbrace{\epsilon}_{\text{errors}}\]

Simple Linear Regression, parameters

The population parameters \(\beta_0\) and \(\beta_1\) are estimated with \(\hat{\beta}_0\) and \(\hat{\beta}_1\). These estimates within the linear regression equation are written

\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]

The predicted/fitted value of \(Y\) is a function of the estimates of the intercept and slope, dependent on some value of \(X\).

Linear Regression, residuals

Not every observation will fall of the least squares line. The difference between the true observation \(Y_i\) and the predicted value of \(\hat{Y}_i\) at \(X_i\), is the \(i\)th residual

\[e_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i)\]

Simple Linear Regression, residuals by picture

Some residuals will be positive and some negative.

Simple Linear Regression, best

The word “best” is cleverly defined and not without debate. The most common definition of best means the line that minimizes the sum of the squared residuals. This idea is intuitive. We are to find the values of \(\beta_0\) and \(\beta_1\) that

  • take \(e_i = Y_i - \hat{Y}_i\), for all \(i\),
  • square each residual, \(e_i^2\), and
  • minimize \(\frac{1}{n} \sum_{i=1}^n e_i^2\).

Simple Linear Regression in R

In R we use the function lm to fit linear regression.

Code
fit <- lm(winglength ~ middletoelength, data = finch)
## summary(fit) # RStudio
## yhat <- fitted(fit) # predicted/fitted values
## e <- residuals(fit) # resiudal values

Simple Linear Regression, interpretation

The coefficients from the model can be extracted with the function coefficients.

Code
(beta <- coefficients(fit))
    (Intercept) middletoelength 
      23.751547        2.494932 

Thus, our fitted linear model is written as

\[\hat{Y} = 23.75 + 2.49 X\]

Example

Elmhurst Colege Data

We’ll consider a dataset named elmhurst. With these data, we might have the question, “How is family income related to the amount of gift aid a student receives from the college?”

Code
url <- "https://raw.githubusercontent.com/roualdes/data/master/elmhurst.csv"
elmhurst <- read.csv(url)
head(elmhurst)
  family_income gift_aid price_paid
1        92.922    21.72      14.28
2         0.250    27.47       8.53
3        53.092    27.75      14.25
4        50.200    27.22       8.78
5       137.613    18.00      24.00
6        47.957    18.52      23.48

Elmhurst College Data

Step 1?

Elmhurst College Data

Plot the data!

Code
cor(elmhurst$family_income, elmhurst$gift_aid)
[1] -0.4985561
Code
ggplot(data = elmhurst, aes(family_income, gift_aid),
       ylab="Gift aid ($1000)",
       xlab="Family income ($1000)") +
    geom_point()

Elmhurst College Data

Code
ggplot(data = elmhurst, aes(family_income, gift_aid),
       ylab="Gift aid ($1000)",
       xlab="Family income ($1000)") +
    geom_point() +
    stat_smooth(method = "lm", se = FALSE) # no standard errors

Elmhurst College Data

Code
elmReg <- lm(gift_aid ~ family_income, data=elmhurst)
## summary(elmReg) ## RStudio
## yhat <- fitted(elmReg)
## e <- residuals(elmReg)

Elmhurst College Data

Code
coef(elmReg)
  (Intercept) family_income 
  24.31932901   -0.04307165 

Elmhurst College Data

Our estimated linear model looks like

\[\widehat{aid} = 23.75 + 2.49 \times family\_income\]

How do we interpret this? Can we make causal connections from this model?

Take Away

  • Correlation is a helpful summary statistic between two numerical variables
  • Linear regression fits “best” line through scatterplot
  • best means minimized squared residuals
  • expected value of response given some value of explanatory