Confidence Intervals

Edward A. Roualdes

Recap

recap: point estimates

Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes).

recap: Central Limit Theorem

  • The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”

  • The Central Limit Theorem says, if we have a collection of sample means, the shape (histogram) of this collection is basically Normal (unimodal and symmetric).

recap: CLT, so what

It doesn’t matter if that data are from a Normal distribution or not; use mean => Central Limit Theorem => can approximate confidence intervals.

Confidence Interval

Intuition

A better guess than a point estimate would include a range of likely values, an interval say, in which the population parameter of interest might live. If we want to be very certain we capture the population parameter, should we use a wider interval or a smaller interval? Garfield knows.

A Trade Off

Doesn’t it seem reasonable to trade a little bit of confidence for a lot of bit of width?

Confidence plus Intervals

We combine these two ideas, intervals and confidence, to form a confidence interval. The idea comes from how much data lives between two numbers on a normal distribution.

Confidence Intervals, definition for sample mean

We build a confidence interval of the sample mean \(\bar{X}\) by adding and subtracting \(t\) standard errors \(s_{\bar{X}}\). We write this as

\[\bar{X} \pm t * s_{\bar{X}}\]

Note

CLT is all about the Normal distribution, which is usually represented as \(z\). We use \(t\) here because we are recognizing that we also need to estimate \(\sigma\).

Example I

We will use a sample of Darwin’s finch data set [Swarth:1931]. Make a 95% confidence interval about the wing length using Darwin’s finches.

Code
# load data
url <- "https://raw.githubusercontent.com/roualdes/data/master/finches.csv"
finch <- read.csv(url)
# look at data in RStudio

Example, continued

98% confidence interval for Darwin’s finch data set.

Code
xbar <- mean(finch$winglength)
std <- sd(finch$winglength)
n <- length(finch$winglength)
t <- qt(0.975, n - 1)

\[\bar{x} \pm t * \frac{std}{\sqrt{n}}\]

Code
xbar - t * std / sqrt(n) # lower
[1] 70.58959
Code
xbar + t * std / sqrt(n) # upper
[1] 72.54276

Example, simplified

Code
t.test(finch$winglength, conf.level = 0.95)

    One Sample t-test

data:  finch$winglength
t = 146.27, df = 67, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 70.58959 72.54276
sample estimates:
mean of x 
 71.56618 

Confidence Intervals, interpretation in context of the data

We are 95% confident that the true population mean wing length of Galapagos Island finches is between 70.5 and 72.5 milimeters.

Confidence Intervals, what?

Our best guess \(\bar{X}\) puts us near the population mean. In an effort to capture the true mean we use an interval, i.e. we cast a wide net. The width of the net is formed by adding and subtracting the margin of error.

Confidence Intervals, interpretations

Despite the fact that we say, “We are \(X\)% confident that the true population mean of [insert context] is between [insert lower bound] and [upper bound],” we don’t mean that. What we mean is, of course more complicated.

Confidence Intervals, literal translation

We mean: If we were to re-sample \(N\) times and create a confidence interval from each new sample, \(X\)% of those intervals would include the true population mean.

Confidence Intervals, translation

We mean: If we were to re-sample \(N\) times and create a confidence interval from each new sample, \(X\)% of those intervals would include the true population mean.

If we were to re-sample \(N\) times (we won’t) and create a confidence interval from new sample (we won’t), \(X\)% of those intervals would (only with an infinite number of intervals) include the true population mean.

Confidence Intervals, translation in picture

Example II

We will use a sample of published journal articles Piwowar:2009. Make a 99% confidence interval about the proportion of articles that share their data.

Code
# load data
url <- "https://raw.githubusercontent.com/roualdes/data/master/articles.csv"
article <- read.csv(url) # look at data in RStudio
t.test(article$is_data_shared, conf.level = 0.99)

    One Sample t-test

data:  article$is_data_shared
t = 18.684, df = 396, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
 0.4036094 0.5334183
sample estimates:
mean of x 
0.4685139 

Examle II, interpretted in context of the data

We are 99% confident that the true population proportion of published journal articles that share their data is between 0.4 and 0.53.

Take Away

  • CLT enables confidence intervals based off of the Normal t-distribution
  • You will be responsible for
    • loading data into R
    • calculating confidence intervals with some % confidence
    • interpretting confidence intervals in the context of the data
    • understanding the literal translation meaning of the interpretation