With categorial data we can make bar charts.
countyRecall the county data set? How big?
countyFrequency table of county data.
             Alabama               Alaska              Arizona 
                  67                   29                   15 
            Arkansas           California             Colorado 
                  75                   58                   64 
         Connecticut             Delaware District of Columbia 
                   8                    3                    1 
             Florida              Georgia               Hawaii 
                  67                  159                    5 
               Idaho             Illinois              Indiana 
                  44                  102                   92 
                Iowa               Kansas             Kentucky 
                  99                  105                  120 
           Louisiana                Maine             Maryland 
                  64                   16                   24 
       Massachusetts             Michigan            Minnesota 
                  14                   83                   87 
         Mississippi             Missouri              Montana 
                  82                  115                   56 
            Nebraska               Nevada        New Hampshire 
                  93                   17                   10 
          New Jersey           New Mexico             New York 
                  21                   33                   62 
      North Carolina         North Dakota                 Ohio 
                 100                   53                   88 
            Oklahoma               Oregon         Pennsylvania 
                  77                   36                   67 
        Rhode Island       South Carolina         South Dakota 
                   5                   46                   66 
           Tennessee                Texas                 Utah 
                  95                  254                   29 
             Vermont             Virginia           Washington 
                  14                  133                   39 
       West Virginia            Wisconsin              Wyoming 
                  55                   72                   23 countyBar chart of county data.
countyWhat do we give up by finding a mean? Consider county
A histogram helps summarize tails of the data:
Think of each observation as belonging to a bin. These binned observations are plotted as bars to form a histogram. That is, the height of each bar depicts the number of observations within each bin.
county$per_capita_income data has data from $10,466.84 up to $69,532.86. Setting binwidth = 1800 creates bins of width 1,800 and places the appropriate observations in each. Observations on the boundary of a bin are allocated to the lower bin.Since the researcher has to choose the bin width, we should learn the difference between bins that are too small and bins that are too wide.
Extremely small bins
Extremely large bins
Compare chicken weights by diet.
right skewed. When data trail to the right and have a longer right tail, the shape is said to be right skewed.
left skewed. Data with the reverse characteristic – a long, thin tail to the left – are said to be left skewed.
symmetric. Data that show roughly equal tails in both directions are called symmetric.
Where is the mean on a histogram? Median?
What more, other than the center of the data, do histograms tell us? Consider a histogram from the email data set.
Histograms naturally connect with standard deviations. If the data are unimodal (ish) and symmetric (ish), roughly 68% of the data will be within one standard deviation of the mean.
If the data are unimodal (ish) and symmetric (ish), roughly 95% of the data will be within two standard deviations of the mean.
If the data are unimodal (ish) and symmetric (ish), roughly 99.7% of the data will be within three standard deviations of the mean.
Three different population distributions with the same (population) mean \(\mu\) = 0 and (population) standard deviation \(\sigma = 1\).
We will use the data set carnivora found within the library ape.
R will fairly easily produce six number summaries for us. Here’s the variable for average brain weight amongst male and female animals from the order Carnivora.
A box plot summarizes a data set using five (standard) statistics while also plotting unusual observations.
Have you seen some of the families in carnivora? From adorable, Ailuridae, to huge, Ursidae.
The most common bivariate plot is the scatter plot.
A scatterplot provides a case-by-case view of data for two numerical variables.
carnivora exampleassociated. When two variables show some connection with one another, they are called associated variables.
independent. If two variables are not associated, then they are said to be independent.
Direction of association can be described as
positive association. When two variables have an upward/positive trend, they are called positively associated.
negative association. When two variables have a downward/negative trend, they are called negatively associated.
Structure of association can be described as linear or non-linear.
Indometh exampleCO2 example