Central Limit Theorem

Edward A. Roualdes

Recap

recap: point estimates

Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes).

Central Limit Theorem

The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”
The Central Limit Theorem says, if we have a collection of sample means, the shape (histogram) of this collection is basically Normal (unimodal and symmetric).

CLT, definition (in Statistics)

The Central Limit Theorem says, “Under certain conditions, the sampling distribution for the sample mean converges to the normal distribution as the sample size increases.”

CLT, definition (in symbols)

When \(n\) is sufficiently large,

\[\frac{\bar{X} - \mu}{\sigma_{\bar{X}}} \overset{\cdot}{\sim} N(0, 1).\]

As \(n\) increases, the approximation improves.

CLT’s Assumptions

We should mention the conditions necessary for this to happen.

independent observations,
identically distributed, \(X_i \sim \mathcal{F}\) for all \(i\), and
variance is finite, \(\sigma^2 < \infty\).

CLT, so what

The assumptions are really not that bad, so we can safely assume they hold in many real world applications. With that, so long as we use the mean, then we can say, at least approximately, how the distribution of means is shaped – even if we never actually sample/calculate multiple means.

Not Necessarily Normal Data

It doesn’t matter if that data are from a Normal distribution or not; use mean => Central Limit Theorem.

Example

We will use a sample of Darwin’s finch data set [Swarth:1931]. Make a 98% confidence interval about the beak height using Darwin’s finches.

# load data
url <- "https://raw.githubusercontent.com/roualdes/data/master/finches.csv"
finch <- read.csv(url)
# look at data in RStudio

Example, continued

98% confidence interval for Darwin’s finch data set.

xbar <- mean(finch$beakheight)
std <- sd(finch$beakheight)
n <- length(finch$beakheight)
t <- qt(0.99, n - 1)

\[\bar{x} \pm t * \frac{std}{\sqrt{n}}\]

xbar - t * std / sqrt(n) # lower

[1] 12.23514

xbar + t * std / sqrt(n) # upper

[1] 13.49427

Example, simplified

t.test(finch$beakheight, conf.level = 0.98)


    One Sample t-test

data:  finch$beakheight
t = 48.701, df = 67, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
98 percent confidence interval:
 12.23514 13.49427
sample estimates:
mean of x 
 12.86471

Take Away

Because math/statistics

we can approximate confidence intervals for sample means
we can approximate hypothesis tests (coming soon)…
t-test shows up more often than normal in CLT
- their similar in spirit, but t has fatter tails
for a long time, this was the only way to statistics
we’ll keep referring back to the CLT for the rest of the class