Hypothesis Testing

Edward A. Roualdes

Motivation

Motivating Hypothesis Testing Framekwork

Suppose your buddy claims he shoots 90% from the free throw line. Since we are all statisticians, we a) don’t believe them, and b) insist upon testing their claim empirically. So we collect some data. They step up and start shooting. At what point do we reject their claim?

Motivating Hypothesis Testing Framekwork

Implicitly, we used logical framework to evaluate our buddy’s claim. Let’s unpack that framework and give it a name.

Establish two hypotheses
- they shoot as well as they say they do, \(p = 0.9\)
- they do not shoot as well as they claim, \(p < 0.9\).
Collect data
- He took \(n\) shots from the line.
Analyze the data
- Estimated \(p\) with \(\hat{p}\)
- Determined the probability of observing \(\hat{p}\) if indeed he shoots as well as he claims.
Make a conclusion based on your analysis
- Given his claim, if \(\hat{p}\) seems too unlikely, the claim is probably not true.

Hypothesis Test, another example

Suppose Giggle wants to test their browser Chime. They sample 1014 randomly selected websites and make a simple decision, this website was displayed properly on Chime or it was not. They found that 984 websites displayed correctly. Test whether or not Chime displays 99% of webpages correctly, and compare your conclusion to a confidence interval. Choose \(\alpha = 0.05\).

Establish two hypotheses
Collect data
Analyze data \(\Rightarrow\) calculate z-score-like thing
Make decision

Hypothesis Test, another example continued

Hypothesis Test:

\[H_0: p = .99 \quad \text{ versus } \quad H_1: p \ne .99\]

Code

n <- 1014
x <- rep(0, n)
x[1:984] <- 1
t.test(x, mu = 0.99, conf.level = 0.99) # or


    One Sample t-test

data:  x
t = -3.679, df = 1013, p-value = 0.0002465
alternative hypothesis: true mean is not equal to 0.99
99 percent confidence interval:
 0.9566753 0.9841531
sample estimates:
mean of x 
0.9704142

Code

# prop.test(sum(x), n) # if you're being picky
# compare p-value to α = 0.05

Framework

Hypothesis Testing Framework

We call this framework hypothesis testing. Let’s rephrase hypothesis testing into the language of statistics.

State hypotheses, null \(H_0\) and alternative \(H_1\).
Collect data.
Calculate test statistic and p-value.
Conclude by comparing p-value to level of significance.

hypotheses

The null and alternative hypotheses generally follow some conventions.

\(H_0\) and \(H_1\) are statements about population parameters
\(H_0\) declares the parameter of interest to be equal to some value.
- \(H_0: p = 0.99\) or \(H_0: \mu = 10\)
\(H_1\) declares the (same) parameter of interest to be less than, greater than, or not equal to the same value in the null hypothesis \(H_0\) – the researcher chooses one before collecting data, let alone conducting the test.
- \(H_1: p \ne 0.99\) or \(H_1: \mu \; \{<,>,\ne\} \; 10\)

test statistic

A test statistic is a summary statistic that is particularly useful for evaluating a hypothesis test or calculating the p-value.

Code

t.test(x, mu = 0.99, conf.level = 0.99)


    One Sample t-test

data:  x
t = -3.679, df = 1013, p-value = 0.0002465
alternative hypothesis: true mean is not equal to 0.99
99 percent confidence interval:
 0.9566753 0.9841531
sample estimates:
mean of x 
0.9704142

p-value

The p-value is a probability, which provides weights evidence against \(H_0\) (but never provides evidence in favor of anything, let alone \(H_1\))

Note

It’s too easy to think you proved something when p-value \(< \alpha\), however statistics rarely proves anything. At best, statistics, via p-values, provides evidence against a specific conclusion, namely \(H_0\).

p-value

p-value. The probability of observing the test statistic we did, or something more extreme¹, assuming the null hypothesis is true.

Note

Because the p-value is a probability, there’s two sides to it. The other side results in an error in decision making.

p-value, in picture

For an alternative hypothesis of \(\ne\), namely \(H_1: mu \ne 0\),

level of significance

We define a largest level at which we are willing to incorrectly conclude. We call this value the level of significance, and give it the symbol \(\alpha\)¹.
level of significance. The largest probability of incorrectly rejecting \(H_0\) when in fact \(H_0\) is true.

Hypothesis Testing, conclusions via p-value

We evaluate hypotheses by comparing the p-value to the significance level, \(\alpha\). If our sample statistic (observed data) is so unusual with respect to the null hypothesis that it casts doubt on the validity of \(H_0\), then we have some evidence against \(H_0\) (but not confirming \(H_1\)).

evidence against H_0

Hypothesis testing never confirms anything, it provides some evidence against the null hypothesis.

When

p-value \(\leq \alpha \Rightarrow\) reject \(H_0\)
p-value \(> \alpha \Rightarrow\) fail to reject \(H_0\)

Note

Despite the overly cautious words above (some, evidence, probably), the world of statistics continues to use the strong phrases “reject” and “fail to reject”.

Last Example, finch beak height

Consider again Darwin’s finch data set. Formally test that the mean beak height is equal to 9mm versus not equal to, at \(\alpha = 0.05\).

Code

url <- "https://raw.githubusercontent.com/roualdes/data/master/finches.csv"
finch <- read.csv(url)

Last Example, finch beak height

\(H_0: \mu = 9\) versus \(H_1: \mu \ne 9\); \(\alpha = 0.05\).
calculate test statistic
calculate p-value from test statistic
conclude

Code

t.test(finch$beakheight, mu = 9, conf.level = 0.95)


    One Sample t-test

data:  finch$beakheight
t = 14.63, df = 67, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 9
95 percent confidence interval:
 12.33744 13.39197
sample estimates:
mean of x 
 12.86471

Last Example, finch beak height

Given \(H_0: \mu = 9\) and \(H_1: \mu \ne 9\), we reject the null hypothesis because the p-value \(2.2e-16 < \alpha = 0.05\). There is sufficient evidence to say that the true population mean beak height of finches from the Galapagos islands is not equal to 9mm.

Decision Making is hard

Hypothesis Testing, the decision

We conclude a formal hypothesis test by making a decision between \(H_0\) and \(H_1\). But did we decide correctly?

Hypothesis Testing, decision errors

We essentially made a choice, but our choice could be correct or incorrect.

	fail to reject \(H_0\)	reject \(H_0\)
\(H_0\) true	correct	type 1 error
\(H_1\) true	type 2 error	correct

Hypothesis Testing, errors

Type 1 Error
- A type 1 error is rejecting the null hypothesis when \(H_0\) is actually true.
Type 2 Error
- A type 2 error is failing to reject the null hypothesis when the alternative is actually true.

Hypothesis Testing, error trade offs

OS4 Example 5.26. How could we reduce the Type 2 Error rate in US courts? What influence would this have on the Type 1 Error rate?

	found innocent	found guilty
didn’t commit crime	correct	type 1 error
did commit crime	type 2 error	correct

Lower Type 2 Error by convicting more people; change standards from “beyond a reasonable doubt” to “beyond a little doubt”.
What then of wrongful convictions, i.e. Type 1 Error rates?

Hypothesis Testing, Connection to CIs

Confidence intervals can be used to conclude hypothesis tests if

two sided alternative
\(\alpha\) is equal to \(1 - confidence\).

Re finch beak height. So what if we tested \(H_0: \mu = 13\) versus \(H_1: \mu \ne 13\)? Would we reject or fail to reject \(H_0\)?

Take Away

Hypothesis testing

formalizes someone’s intuition and logical thought process
relies on new words: hypotheses, test statistic, p-value, level of significance
right or wrong, is ubiquitous in statistcs
connects to confidence intervals in specific ways