With categorial data we can make bar charts.
county
Recall the county data set? How big?
county
Frequency table of county data.
Alabama Alaska Arizona
67 29 15
Arkansas California Colorado
75 58 64
Connecticut Delaware District of Columbia
8 3 1
Florida Georgia Hawaii
67 159 5
Idaho Illinois Indiana
44 102 92
Iowa Kansas Kentucky
99 105 120
Louisiana Maine Maryland
64 16 24
Massachusetts Michigan Minnesota
14 83 87
Mississippi Missouri Montana
82 115 56
Nebraska Nevada New Hampshire
93 17 10
New Jersey New Mexico New York
21 33 62
North Carolina North Dakota Ohio
100 53 88
Oklahoma Oregon Pennsylvania
77 36 67
Rhode Island South Carolina South Dakota
5 46 66
Tennessee Texas Utah
95 254 29
Vermont Virginia Washington
14 133 39
West Virginia Wisconsin Wyoming
55 72 23
county
Bar chart of county data.
county
What do we give up by finding a mean? Consider county
A histogram helps summarize tails of the data:
Think of each observation as belonging to a bin. These binned observations are plotted as bars to form a histogram. That is, the height of each bar depicts the number of observations within each bin.
county$per_capita_income
data has data from $10,466.84 up to $69,532.86. Setting binwidth = 1800
creates bins of width 1,800 and places the appropriate observations in each. Observations on the boundary of a bin are allocated to the lower bin.Since the researcher has to choose the bin width, we should learn the difference between bins that are too small and bins that are too wide.
Extremely small bins
Extremely large bins
Compare chicken weights by diet.
right skewed. When data trail to the right and have a longer right tail, the shape is said to be right skewed.
left skewed. Data with the reverse characteristic – a long, thin tail to the left – are said to be left skewed.
symmetric. Data that show roughly equal tails in both directions are called symmetric.
Where is the mean on a histogram? Median?
What more, other than the center of the data, do histograms tell us? Consider a histogram from the email
data set.
Histograms naturally connect with standard deviations. If the data are unimodal (ish) and symmetric (ish), roughly 68% of the data will be within one standard deviation of the mean.
If the data are unimodal (ish) and symmetric (ish), roughly 95% of the data will be within two standard deviations of the mean.
If the data are unimodal (ish) and symmetric (ish), roughly 99.7% of the data will be within three standard deviations of the mean.
Three different population distributions with the same (population) mean \(\mu\) = 0 and (population) standard deviation \(\sigma = 1\).
We will use the data set carnivora
found within the library ape
.
R will fairly easily produce six number summaries for us. Here’s the variable for average brain weight amongst male and female animals from the order Carnivora.
A box plot summarizes a data set using five (standard) statistics while also plotting unusual observations.
Have you seen some of the families in carnivora? From adorable, Ailuridae, to huge, Ursidae.
The most common bivariate plot is the scatter plot.
A scatterplot provides a case-by-case view of data for two numerical variables.
carnivora
exampleassociated. When two variables show some connection with one another, they are called associated variables.
independent. If two variables are not associated, then they are said to be independent.
Direction of association can be described as
positive association. When two variables have an upward/positive trend, they are called positively associated.
negative association. When two variables have a downward/negative trend, they are called negatively associated.
Structure of association can be described as linear or non-linear.
Indometh
exampleCO2
example