Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes) named sampling distributions.
The standard deviation of a sampling distribution is called a standard error. The standard error shrinks with the square root of the sample size.
The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”
From the CLT, we can approximate confidence intervals from an approximate sampling distribution.
From the CLT, we can approximate area in tails (p-values) from an approximate sampling distribution.
ANOVA breaks up means of numerical response variable by levels of one categorical variable.
Scatterplots are a graphical description of two numerical variables; consider Darwin’s finch data.
Some keywords to describe scatterplots
Correlation is (often/sometimes) denoted by \(R\) and is the numeric analogue of the words above.
Notes on correlation
In R we should just use the function cor.
This is another reason plots are so important.
For all of the appropriately linear correlation examples above, it was easy to think about a line through the data and then ask, “How closely do the data fall onto that line?” That line through the data however, has a name and a mathematical definition.
The least squares line for Darwin’s finch data is plot below in blue.
Simple linear regression decomposes the response variable \(Y\) into three components:
Given a response variable \(Y\) and an explanatory variable \(X\), the simple linear regression model is
\[Y = \underbrace{\beta_0}_{\text{intercept}} + \underbrace{\beta_1}_{\text{slope}} X + \underbrace{\epsilon}_{\text{errors}}\]
The population parameters \(\beta_0\) and \(\beta_1\) are estimated with \(\hat{\beta}_0\) and \(\hat{\beta}_1\). These estimates within the linear regression equation are written
\[\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\]
The predicted/fitted value of \(Y\) is a function of the estimates of the intercept and slope, dependent on some value of \(X\).
Not every observation will fall of the least squares line. The difference between the true observation \(Y_i\) and the predicted value of \(\hat{Y}_i\) at \(X_i\), is the \(i\)th residual
\[e_i = Y_i - \hat{Y}_i = Y_i - (\hat{\beta}_0 + \hat{\beta}_1 X_i)\]
Some residuals will be positive and some negative.
The word “best” is cleverly defined and not without debate. The most common definition of best means the line that minimizes the sum of the squared residuals. This idea is intuitive. We are to find the values of \(\beta_0\) and \(\beta_1\) that
In R we use the function lm to fit linear regression.
The coefficients from the model can be extracted with the function coefficients.
Thus, our fitted linear model is written as
\[\hat{Y} = 23.75 + 2.49 X\]
We’ll consider a dataset named elmhurst. With these data, we might have the question, “How is family income related to the amount of gift aid a student receives from the college?”
Step 1?
Plot the data!
[1] -0.4985561Our estimated linear model looks like
\[\widehat{aid} = 23.75 + 2.49 \times family\_income\]
How do we interpret this? Can we make causal connections from this model?