ANOVA

Edward A. Roualdes

Recap

recap: point estimates

Point estimates are random variables. Random variables follow shapes, called distributions. Therefore, point estimates follow distributions (and have shapes).

recap: central limit theorem

The Central Limit Theorem says, “If our sample size is large enough, the sample mean will be approximately Normally distributed.”

recap: confidence intervals

From the CLT, we can approximate confidence intervals

recap: hypothesis tests

From the CLT, we can approximate area in tails (p-values)

recap: two sample t-tests

The two sample \(t\)-test, compares the means of two groups with the hypothesis

\[ \begin{align*} H_0: \quad & \mu_1 = \mu_2 \\ H_A: \quad & \mu_1 \ne \mu_2. \\ \end{align*} \]

R’s function t.test() did all the hard work for us.

Analysis of Variance

Analaysis of Variance

If there were three or more groups, the two sample \(t\)-test would not work. We could force the test on the data by comparing two groups at a time, but this has dangerous implications. We thus require a new statistical method, analysis of variance.

  • Analysis of variance (ANOVA) uses a single hypothesis to check whether the means across two or more groups are equal.

ANOVA, hypothesis

The ANOVA hypothesis test for \(k\) groups is

\[ \begin{align*} H_0: \quad & \mu_1 = \mu_2 = \mu_3 = \ldots = \mu_k \\ H_A: \quad & \text{at least one mean is different.} \end{align*} \]

Note

ANOVA tests equality of means across groups, despite its name.

ANOVA, hypothesis examples

With ANOVA you can compare the means by groups for many different data sets.

  • mean batting average by position,
  • mean movie budget by genre, (or year),
  • mean CO2 intake by treatment, or location, (or concentration)
  • mean rem sleep by conservation status,
  • mean birth/body weight by family,

ANOVA, picture

Let’s visualize what is going on with ANOVA. It starts with box plots by groups – draw more pictures on board.

ANOVA, intuition

Analysis of variance tells us about means by groups, despite its name. Large variation amongst the groups relative to small variation within the groups indicates different population means.

ANOVA, intuition in picture

What do you think of these means – think of variation within and amongst groups?

ANOVA, intution via picture

What do you think of these means – think of variation within and amongst groups?

ANOVA, test statistic

ANOVA calculates one fraction based on two numbers, variation amongst (between) groups and variation within groups. These two numbers are generally referred to as mean square values.

  • mean squared amongst groups
    • MSG is a strictly positive measure of the variation across all groups.
  • mean squared error/residuals
    • MSE is a strictly positive measure of the variation within groups.
  • Test statistic
    • \(F = \frac{MSG}{MSE}\)

First Example

Baseball example, hypotheses

Are baseball players paid on average differently by position?

\[ \begin{align*} H_0: \quad & \mu_{catcher} = \mu_{dh} = \mu_{first} = \ldots = \mu_{third} \\ H_A: \quad & \text{at least one mean salary is different.} \end{align*} \]

with \(\alpha = 0.05\).

Baseball example, plot

Load the data and make a plot.

url <- "https://raw.githubusercontent.com/roualdes/data/master/mlb.csv"
mlb <- read.csv(url)
suppressMessages(library(tidyverse))
ggplot(data = mlb, aes(position, salary)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Baseball example, R code

The R code to run ANOVA

linear_model <- lm(salary ~ position, data = mlb)
anova(linear_model)

Baseball example, R code details

The tilde \(\sim\) is meant to be read as, predict the left hand side with the right hand side. Hence, we read the following as, predict salary by different levels of position.

linear_model <- lm(salary ~ position, data = mlb)

Baseball example, R output

The degrees of freedom, F-statistic, and p-value are the most important pieces of information to extract from an ANOVA table.

anova(linear_model)
Analysis of Variance Table

Response: salary
           Df     Sum Sq  Mean Sq F value    Pr(>F)    
position    8 6.0975e+08 76219146  3.9307 0.0001422 ***
Residuals 819 1.5881e+10 19390502                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Baseball example, conclusion

Because the p-value \(= 10^{-4} < \alpha = 0.05\) we reject the null in favor of the alternative. There is sufficient evidence to claim that the population of mean salaray of baseball players probably varies by position.

Multiple Comparisons

Multiple Comparisons, the problem

Say you’ve got some data and you want to test equality of the means. What to do? ANOVA!

ggplot(data = mlb, aes(position, salary)) +
    geom_boxplot() +
    theme(axis.text.x = element_text(angle = 45, hjust=1))

Multiple Comparisons, the problem

What not to do? Immediately compare all pairwise combinations of the \(k = 9\) groups. That would result in 36 tests, and an increase in your Type I Error rate.

Multiple Comparisons, how bad a problem

Suppose you chose a level of significance of \(\alpha = 0.05\). What’s the probability of observing at least one significant result just by chance?

This is just a binomial distribution, \(X \sim\) Binomial\((36, 0.05)\), and we want to know \(P(X \ge 1)\).

[1] 0.8422208

Multiple Comparisons, a solution

Many solutions exist. A simple one, in R, is Tukey’s honest significant difference (HSD). Only use this if you have first run ANOVA and rejected \(H_0\).

model <- lm(salary ~ position, data = mlb) # if reject H0
TukeyHSD(aov(model)) # do in R

Last Example

Carnivora Brain Weights by Family, plot

Consider brain weights of the Families of the order Carnivora. Let’s compare mean brain weights by Family.

suppressMessages({library(tidyverse)
  library(ape)
  data(carnivora)})
ggplot(data = carnivora, aes(Family, SB)) +
    geom_boxplot() +
    labs(y="Brain weights (g)") +
    theme(axis.text.x=element_text(angle=45, hjust=1))

Carnivora Brain Weights by Family, not all groups

Since some groups don’t have enough data, let’s remove them.

carnivs <- filter(carnivora,
                  !(Family %in% c("Ailuridae",
                                  "Procyonidae",
                                  "Viverridae")))

Carnivora Brain Weights by Family, hypotheses

\[ \begin{align*} H_0: \quad & \mu_c = \mu_f = \mu_h = \mu_m = \mu_u \\ H_A: \quad & \text{at least one mean is different.} \end{align*} \]

with \(\alpha = 0.01\).

Carnivora Brain Weights by Family, R code

mod <- lm(SB ~ Family, data = carnivs)
anova(mod)
Analysis of Variance Table

Response: SB
          Df Sum Sq Mean Sq F value    Pr(>F)    
Family     4 354999   88750  37.333 < 2.2e-16 ***
Residuals 70 166406    2377                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Carnivora Brain Weights by Family, conclusion

Because the p-value is tiny, we reject \(H_0\) in favor of the alternative. There is sufficient evidence to say that at least one mean is different from the rest.

Carnivora Brain Weights by Family, post-hoc analysis

TukeyHSD(aov(mod), conf.level = 0.99) # do in R

Take Away

  • ANOVA breaks up the response variable into groups, chosen by you.
  • ANOVA essentially calculates group means
    • average error within groups and average error amongst the groups.
  • Understand the general ANOVA table
  • Conclusions from ANOVA tables are easy
    • compare p-value to level of significance
  • Interpretation of such conclusions is trickier, practice