Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes).
The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”
The Central Limit Theorem says, if we have a collection of sample means, the shape (histogram) of this collection is basically Normal (unimodal and symmetric).
It doesn’t matter if that data are from a Normal distribution or not; use mean => Central Limit Theorem => can approximate confidence intervals.
A better guess than a point estimate would include a range of likely values, an interval say, in which the population parameter of interest might live. If we want to be very certain we capture the population parameter, should we use a wider interval or a smaller interval? Garfield knows.
Doesn’t it seem reasonable to trade a little bit of confidence for a lot of bit of width?
We combine these two ideas, intervals and confidence, to form a confidence interval. The idea comes from how much data lives between two numbers on a normal distribution.
We build a confidence interval of the sample mean \(\bar{X}\) by adding and subtracting \(t\) standard errors \(s_{\bar{X}}\). We write this as
\[\bar{X} \pm t * s_{\bar{X}}\]
Note
CLT is all about the Normal distribution, which is usually represented as \(z\). We use \(t\) here because we are recognizing that we also need to estimate \(\sigma\).
We will use a sample of Darwin’s finch data set [Swarth:1931]. Make a 95% confidence interval about the wing length using Darwin’s finches.
98% confidence interval for Darwin’s finch data set.
\[\bar{x} \pm t * \frac{std}{\sqrt{n}}\]
We are 95% confident that the true population mean wing length of Galapagos Island finches is between 70.5 and 72.5 milimeters.
Our best guess \(\bar{X}\) puts us near the population mean. In an effort to capture the true mean we use an interval, i.e. we cast a wide net. The width of the net is formed by adding and subtracting the margin of error.
Despite the fact that we say, “We are \(X\)% confident that the true population mean of [insert context] is between [insert lower bound] and [upper bound],” we don’t mean that. What we mean is, of course more complicated.
We mean: If we were to re-sample \(N\) times and create a confidence interval from each new sample, \(X\)% of those intervals would include the true population mean.
We mean: If we were to re-sample \(N\) times and create a confidence interval from each new sample, \(X\)% of those intervals would include the true population mean.
If we were to re-sample \(N\) times (we won’t) and create a confidence interval from new sample (we won’t), \(X\)% of those intervals would (only with an infinite number of intervals) include the true population mean.
We will use a sample of published journal articles Piwowar:2009. Make a 99% confidence interval about the proportion of articles that share their data.
One Sample t-test
data: article$is_data_shared
t = 18.684, df = 396, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
0.4036094 0.5334183
sample estimates:
mean of x
0.4685139
We are 99% confident that the true population proportion of published journal articles that share their data is between 0.4 and 0.53.