Plotting Data

Edward A. Roualdes

Categorical Data

Visualizing Categorical Variables

With categorial data we can make bar charts.

Code

suppressMessages(library(tidyverse))
suppressMessages(library(openintro))
data(email)
ggplot(data = email, aes(number)) +
    geom_bar() +
    labs(title = "email 'Number' Frequency")

`county`

Recall the county data set? How big?

Code

data(county)
tail(county[, c("state", "name")])

# A tibble: 6 × 2
  state   name             
  <fct>   <chr>            
1 Wyoming Sublette County  
2 Wyoming Sweetwater County
3 Wyoming Teton County     
4 Wyoming Uinta County     
5 Wyoming Washakie County  
6 Wyoming Weston County

`county`

Frequency table of county data.

Code

table(county$state)


             Alabama               Alaska              Arizona 
                  67                   29                   15 
            Arkansas           California             Colorado 
                  75                   58                   64 
         Connecticut             Delaware District of Columbia 
                   8                    3                    1 
             Florida              Georgia               Hawaii 
                  67                  159                    5 
               Idaho             Illinois              Indiana 
                  44                  102                   92 
                Iowa               Kansas             Kentucky 
                  99                  105                  120 
           Louisiana                Maine             Maryland 
                  64                   16                   24 
       Massachusetts             Michigan            Minnesota 
                  14                   83                   87 
         Mississippi             Missouri              Montana 
                  82                  115                   56 
            Nebraska               Nevada        New Hampshire 
                  93                   17                   10 
          New Jersey           New Mexico             New York 
                  21                   33                   62 
      North Carolina         North Dakota                 Ohio 
                 100                   53                   88 
            Oklahoma               Oregon         Pennsylvania 
                  77                   36                   67 
        Rhode Island       South Carolina         South Dakota 
                   5                   46                   66 
           Tennessee                Texas                 Utah 
                  95                  254                   29 
             Vermont             Virginia           Washington 
                  14                  133                   39 
       West Virginia            Wisconsin              Wyoming 
                  55                   72                   23

`county`

Bar chart of county data.

Code

(pl <- ggplot(data = county, aes(state)) +
     geom_bar() +
     labs(title = "?", x = "States"))

`county`

Code

pl + theme(axis.text.x = element_text(angle = 45,
    hjust=1))

Numerical Data

Towards Histograms

What do we give up by finding a mean? Consider county

Code

mean(county$per_capita_income, na.rm = TRUE)

[1] 26093.12

Code

min(county$per_capita_income, na.rm = TRUE)

[1] 10466.84

Code

max(county$per_capita_income, na.rm = TRUE)

[1] 69532.86

Histograms

A histogram helps summarize tails of the data:

Code

(p <- ggplot(data = county, aes(per_capita_income)) +
     geom_histogram(bins = 21) +
     labs(x = "Per capita income"))

Histograms, in words

Think of each observation as belonging to a bin. These binned observations are plotted as bars to form a histogram. That is, the height of each bar depicts the number of observations within each bin.

The county$per_capita_income data has data from $10,466.84 up to $69,532.86. Setting binwidth = 1800 creates bins of width 1,800 and places the appropriate observations in each. Observations on the boundary of a bin are allocated to the lower bin.

Histograms, picking bins

Since the researcher has to choose the bin width, we should learn the difference between bins that are too small and bins that are too wide.

Histograms, too many

Extremely small bins

Code

ggplot(data = county, aes(per_capita_income)) +
    geom_histogram(binwidth = 3) +
    labs(x = "Per capita income")

Histograms, too few

Extremely large bins

Code

ggplot(data = county, aes(per_capita_income)) +
    geom_histogram(binwidth = 80000) +
    labs(x = "Per capita income")

Histograms, by group

Compare chicken weights by diet.

Code

ggplot(data=ChickWeight, aes(weight)) +
    geom_histogram(bins = 21) +
    facet_wrap(Diet ~ ., ncol = 1)

Histograms, keywords

right skewed. When data trail to the right and have a longer right tail, the shape is said to be right skewed.
left skewed. Data with the reverse characteristic – a long, thin tail to the left – are said to be left skewed.
symmetric. Data that show roughly equal tails in both directions are called symmetric.

Histograms, and measures of center

Where is the mean on a histogram? Median?

Code

mn <- mean(county$per_capita_income, na.rm = TRUE)
mdn <- median(county$per_capita_income, na.rm = TRUE)
p + geom_vline(aes(xintercept=mn), color="red", size=1) +
    geom_vline(aes(xintercept=mdn), color="blue", size=1)

Histograms, they tell us more

What more, other than the center of the data, do histograms tell us? Consider a histogram from the email data set.

Code

em <- mean(email$num_char)
ggplot(data=email, aes(num_char)) +
    geom_histogram(bins = 21) +
    geom_vline(aes(xintercept = em), color = "blue")

Histograms, an ideal

Data Width, 1 standard deviation

Histograms naturally connect with standard deviations. If the data are unimodal (ish) and symmetric (ish), roughly 68% of the data will be within one standard deviation of the mean.

Data Width, 2 standard deviations

If the data are unimodal (ish) and symmetric (ish), roughly 95% of the data will be within two standard deviations of the mean.

Data Width, 3 standard deviations

If the data are unimodal (ish) and symmetric (ish), roughly 99.7% of the data will be within three standard deviations of the mean.

Standard Deviations, be careful

Three different population distributions with the same (population) mean $\mu$ = 0 and (population) standard deviation $\sigma = 1$.

Towards Box plots

We will use the data set carnivora found within the library ape.

Code

library(ape)
data(carnivora)
# ?carnivora

recap: summary statistics

R will fairly easily produce six number summaries for us. Here’s the variable for average brain weight amongst male and female animals from the order Carnivora.

Code

summary(carnivora$SB)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   15.68   33.75   56.43   57.17  459.50

Box Plots

A box plot summarizes a data set using five (standard) statistics while also plotting unusual observations.

Box Plot, explained

Box Plot, carnivora example

Code

ggplot(data = carnivora, aes(Order, SB)) +
    geom_boxplot() +
    labs(y = "Average brain weight (g)",
         x = "Order")

Box Plot, better carnivora example

Have you seen some of the families in carnivora? From adorable, Ailuridae, to huge, Ursidae.

Code

ggplot(data = carnivora, aes(Family, SB)) +
    geom_boxplot() +
    labs(x = "Family", y = "Average brain weight (g)")

Towards scatter plots

The most common bivariate plot is the scatter plot.

A scatterplot provides a case-by-case view of data for two numerical variables.

Scatterplot, `carnivora` example

Code

ggplot(data = carnivora, aes(SW, SB)) +
    geom_point() +
    labs(x = "Average body weight (kg)",
         y = "Average brain weight (g)")

Scatter plots, keywords

associated. When two variables show some connection with one another, they are called associated variables.
independent. If two variables are not associated, then they are said to be independent.

Scatter plots, keywords

Direction of association can be described as

positive association. When two variables have an upward/positive trend, they are called positively associated.
negative association. When two variables have a downward/negative trend, they are called negatively associated.

Scatter plots, keywords

Structure of association can be described as linear or non-linear.

Scatter plot, `Indometh` example

Code

library(datasets)
# ?Indometh
ggplot(data = Indometh, aes(time, conc)) +
    geom_point()

Scatter plot, `CO2` example

Code

# ?CO2
ggplot(data = CO2, aes(conc, uptake, colour = Type)) +
    geom_point()

Scatter plots, careful

Code

# x: Australian residents, quarterly 03/1971 - 03/1994
# y: WI beaver body temps, 10 min intervals
df <- data.frame(x = as.vector(austres),
                 y = beaver1[1:length(austres), "temp"])
ggplot(data = df, aes(x, y)) +
    geom_point()

for fun

https://xkcd.com/552/

Take Away

categorical variable -> bar chart
one numerical variable -> histogram
one numerical and one categorical -> box plot
two numerical -> scatter plot
… many variations

Plotting Data

Categorical Data

Visualizing Categorical Variables

county

county

county

county

Numerical Data

Towards Histograms

Histograms

Histograms, in words

Histograms, picking bins

Histograms, too many

Histograms, too few

Histograms, by group

Histograms, keywords

Histograms, and measures of center

Histograms, they tell us more

Histograms, an ideal

Data Width, 1 standard deviation

Data Width, 2 standard deviations

Data Width, 3 standard deviations

Standard Deviations, be careful

Towards Box plots

recap: summary statistics

Box Plots

Box Plot, explained

Box Plot, carnivora example

Box Plot, better carnivora example

Towards scatter plots

Scatterplot, carnivora example

Scatter plots, keywords

Scatter plots, keywords

Scatter plots, keywords

Scatter plot, Indometh example

Scatter plot, CO2 example

Scatter plots, careful

for fun

Take Away

`county`

`county`

`county`

`county`

Scatterplot, `carnivora` example

Scatter plot, `Indometh` example

Scatter plot, `CO2` example