Summaries By Group

Edward A. Roualdes

Basic Idea

Means by Group

Consider the dataset.

Code
suppressMessages(library(datasets))
# random 6 rows of ChickWeight
ChickWeight[sample(1:nrow(ChickWeight), 6),]
    weight Time Chick Diet
159     79    6    14    1
126    168   12    11    1
269     40    0    25    2
215     77   12    20    1
252    135   14    23    2
10     171   18     1    1

Means by Group

Means of chicks’s weight by diet.

Code
suppressMessages(library(dplyr))
summarise(group_by(ChickWeight, Diet),
          Mean=mean(weight))
# A tibble: 4 × 2
  Diet   Mean
  <fct> <dbl>
1 1      103.
2 2      123.
3 3      143.
4 4      135.

Chicken Weight by Diet, means and more

Mean and standard deviations of chicks’s weight by diet.

Code
(df <- summarise(group_by(ChickWeight, Diet),
                 Mn = mean(weight),
                 StDev = sd(weight),
                 q1 = quantile(weight, .25),
                 q3 = quantile(weight, .75)))
# A tibble: 4 × 5
  Diet     Mn StDev    q1    q3
  <fct> <dbl> <dbl> <dbl> <dbl>
1 1      103.  56.7  57.8  136.
2 2      123.  71.6  65.5  163 
3 3      143.  86.5  67.5  199.
4 4      135.  68.8  71.2  185.

Workhorses

Group Summary Statistics, the workhorses

The underlying ideas behind so much of statistics rely on three ideas/functions and one sentence enhancer.

  • group_by – groups a data frame by some variables
  • summarise – summarizes many observations into one number
  • mutate – mutate variables of a data frame into new variables
  • %>% – make code read (almost) like English

group_by

The function group_by is incredibly helpful, but not that exciting.

Code
# ?group_by
head(ChickWeight) # start with this
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

group_by

If we group the dataset ChickWeight by Diet, things change only slightly. But what group_by returns is now ready to be passed into summarise.

Code
group_by(ChickWeight, Diet)
# A tibble: 578 × 4
# Groups:   Diet [4]
   weight  Time Chick Diet 
    <dbl> <dbl> <ord> <fct>
 1     42     0 1     1    
 2     51     2 1     1    
 3     59     4 1     1    
 4     64     6 1     1    
 5     76     8 1     1    
 6     93    10 1     1    
 7    106    12 1     1    
 8    125    14 1     1    
 9    149    16 1     1    
10    171    18 1     1    
# ℹ 568 more rows

summarise

The function summarise collapses multiple observations down into one number, for instance into a summary statistic. As we saw before, we can summarize multiple variables at once.

Code
# ?summarise

summarise

What does the following code do?

Code
summarise(ChickWeight,
          Mn = mean(weight, na.rm=TRUE))
        Mn
1 121.8183
Code
# hint
mean(ChickWeight$weight, na.rm=TRUE)
[1] 121.8183

summarise

We can also summarize multiple variables at once – by group or not.

Code
summarise(ChickWeight,
          mn = mean(weight),
          sd = sd(weight),
          mdn = median(weight),
          iqr = IQR(weight))
        mn       sd mdn    iqr
1 121.8183 71.07196 103 100.75

mutate

The function mutate allows us to create new variables and add them to the data frame. Recall our summarized data named

Code
df
# A tibble: 4 × 5
  Diet     Mn StDev    q1    q3
  <fct> <dbl> <dbl> <dbl> <dbl>
1 1      103.  56.7  57.8  136.
2 2      123.  71.6  65.5  163 
3 3      143.  86.5  67.5  199.
4 4      135.  68.8  71.2  185.

mutate

We can create a new variable (column)

Code
(newdf <- mutate(df, iqr=q3-q1))
# A tibble: 4 × 6
  Diet     Mn StDev    q1    q3   iqr
  <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1      103.  56.7  57.8  136.  78.8
2 2      123.  71.6  65.5  163   97.5
3 3      143.  86.5  67.5  199. 131. 
4 4      135.  68.8  71.2  185. 114. 

mutate

In case our new variable isn’t automatically printed, remember we can make R print things for us.

Code
newdf[,c("Diet", "iqr")]
# A tibble: 4 × 2
  Diet    iqr
  <fct> <dbl>
1 1      78.8
2 2      97.5
3 3     131. 
4 4     114. 

mutate

mutate works on any data frame. For instance, you might have two variables that are obviously better off as a ratio.

  • miles driven and time (hours) \(\Rightarrow\) miles/hr.
  • state income and state population \(\Rightarrow\) per capita income
  • or any state total, \(x\), and state population \(\Rightarrow\) per capita x
  • surface area and volume \(\Rightarrow\) surface-area-to-volume ratio, SA:V

Group Summary Statistics

Summary statistics and a lesson about missing values in R.

Code
library(ape)
data(carnivora)
summarise(
    group_by(carnivora, Family),
        Mn = mean(LS), # litter size
        Total = n(),   # counts
        Sm = sum(LS, na.rm=TRUE))
# A tibble: 8 × 4
  Family         Mn Total    Sm
  <fct>       <dbl> <int> <dbl>
1 Ailuridae    1.5      1   1.5
2 Canidae      4.43    18  79.8
3 Felidae      2.69    19  51.2
4 Hyaenidae    2.4      4   9.6
5 Mustelidae   3.65    30 110. 
6 Procyonidae  3.08     4  12.3
7 Ursidae      2.1      4   8.4
8 Viverridae  NA       32  83.1

Group Summary Statistics, take 2

We removed NAs from mean calculation and made code (almost) read like English

Code
carnivora %>%
    group_by(Family) %>%
    summarise(Mn = mean(LS, na.rm = TRUE)) # Litter Size
# A tibble: 8 × 2
  Family         Mn
  <fct>       <dbl>
1 Ailuridae    1.5 
2 Canidae      4.43
3 Felidae      2.69
4 Hyaenidae    2.4 
5 Mustelidae   3.65
6 Procyonidae  3.08
7 Ursidae      2.1 
8 Viverridae   2.77

Group Summary Statistics, take 3

Code
carnivora %>%
    group_by(Family) %>%
    summarise(
        Mn = mean(LS, na.rm = TRUE),
        Total = n(),
        Sm = sum(LS, na.rm = TRUE)
    )
# A tibble: 8 × 4
  Family         Mn Total    Sm
  <fct>       <dbl> <int> <dbl>
1 Ailuridae    1.5      1   1.5
2 Canidae      4.43    18  79.8
3 Felidae      2.69    19  51.2
4 Hyaenidae    2.4      4   9.6
5 Mustelidae   3.65    30 110. 
6 Procyonidae  3.08     4  12.3
7 Ursidae      2.1      4   8.4
8 Viverridae   2.77    32  83.1

My Favorite Example

mutate

Maybe relative brain size matters to you, i.e. heaviest brain relative to body weight.

Code
heavy <- carnivora %>%
    group_by(Family) %>%
    mutate(brbo = SB/SW) # brain/body

mutate and summarise

Then throw in some summarise and find which family has greatest mean brain to body weight ratio.

Code
heavy %>%
    summarise(Mn_brbo = mean(brbo, na.rm=TRUE))
# A tibble: 8 × 2
  Family      Mn_brbo
  <fct>         <dbl>
1 Ailuridae      1.74
2 Canidae        7.90
3 Felidae        5.50
4 Hyaenidae      3.28
5 Mustelidae    10.5 
6 Procyonidae   10.8 
7 Ursidae        1.94
8 Viverridae     9.45

who?