Summary Statistics

Edward A. Roualdes

Basic R

Data Frames in R

Run these commands in R

Code

suppressMessages(library(openintro))
data(email) # load dataset email
str(email)  # ensure we're looking at a df
head(email) # top 6 rows; try tail
head(email, n = 10) # use n = x, for head/tail
email$spam  # just one variable
names(email) # column names
some_cols <- c("spam", "num_char", "format", "number")
email[ , some_cols] # specified variables/columns
email[c(1,2,3,50), some_cols] # Table 1

Categorical Data

recap: `email`

Recall the email data set.

Code

data(email)
head(email[, c("spam", "format", "number")])

# A tibble: 6 × 3
  spam  format number
  <fct> <fct>  <fct> 
1 0     1      big   
2 0     1      small 
3 0     1      small 
4 0     1      small 
5 0     0      none  
6 0     0      none

`email`, table

A good start

Code

table(email$number)


 none small   big 
  549  2827   545

`email`, contingency table

A contingency table is a table that summarizes data for two categorical variables.

Code

(t <- table(email$spam, email$number))

   
    none small  big
  0  400  2659  495
  1  149   168   50

`email`, contingency table of proportions

Code

round(prop.table(t), 2)

   
    none small  big
  0 0.10  0.68 0.13
  1 0.04  0.04 0.01

Numerical Data

`county` data

Consider U.S. county data. Which of these variables is/are numerical?

Code

data(county)
tail(county[, c("name", "state", "pop2010", "per_capita_income")], n=4)

# A tibble: 4 × 4
  name            state   pop2010 per_capita_income
  <chr>           <fct>     <dbl>             <dbl>
1 Teton County    Wyoming   21294            48557.
2 Uinta County    Wyoming   21118            27048.
3 Washakie County Wyoming    8533            27495.
4 Weston County   Wyoming    7208            33297.

Measures of Center

Summary Statistics, mean

The sample mean, denoted \(\bar{x}\), is the sum of a sample of numbers divided by how many numbers are in the sample. Mathematically, we write

\[ \begin{align*} \bar{x} & = \frac{x_1 + x_2 + \ldots + x_n}{n} \\ & = \frac{\sum_{i=1}^n x_i}{n} \\ & = \frac{1}{n} \sum_{i=1}^n x_i \end{align*} \]

Summary Statistics, mean: example by hand

Find the mean of the following numbers

Code

x <- sample(0:100, 10)
x

 [1] 72 12 39 77 78 76 98 79 87 97

Summary Statistics, mean: example by R (computer)

Code

 [1] 72 12 39 77 78 76 98 79 87 97

Code

(m <- mean(x))

[1] 71.5

Mean, by picture

Where is the mean relative to the data?

Mean, by picture

Where is the mean relative to the data?

Mean, an example

Find the mean of a variable stored in a data frame.

Code

mean(county$pop2010)

[1] 98262.04

Mean, another example

Find the mean of a variable stored in a data frame.

Code

mean(email$num_char)

[1] 10.70659

Mean, point estimator

We call the sample mean a point estimator. What is it estimating?

Summary Statistics, median

median. If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their mean. There is no simple mathematical expression for this statistic.

Summary Statistics, median: examples by R

Find the median of the following numbers:

Code

 [1] 72 12 39 77 78 76 98 79 87 97

Code

median(x)

[1] 77.5

Summary Statistics: comparing measures of center

Both the mean and the median measure the center of a dataset. Thus, they are jointly referred to as measures of center.

Code

mean(x)

[1] 71.5

Code

median(x)

[1] 77.5

Though they measure different centers.

What does more data do?

If we add a number to the previous variable x, what will be the effect on the

mean?
median?

What does more data do?

… join \(100\) to x

Code

mean(x)

[1] 71.5

Code

mean(c(x, 100))

[1] 74.09091

Code

median(x)

[1] 77.5

Code

median(c(x, 100))

[1] 78

What does more data do?

How about joining \(10000\) to x.

Code

mean(x)

[1] 71.5

Code

mean(c(x, 10000))

[1] 974.0909

Code

median(x)

[1] 77.5

Code

median(c(x, 10000))

[1] 78

Summary Statistics, outliers

An outlier is an observation that appears extreme relative to the rest of the data.

Note

The number \(10000\) in the datasets above appears to be an outlier.
What effect did this value have on the mean? On the median?
What would have happened if we added \(-100\)? mean? median?
Outliers are often the outcome of a confounding variable.

Measures of Spread/Width

Summarizing Data, data width

Means and medians help summarize the center of the data. The statisticians rigorous description of the width of data uses the idea of deviations. See OS4 section \(2.1.4\).

Deviations, defined

deviation. We call the distance between an observation and the mean its deviation. Mathematically, we write

\[ d_i = x_i - \bar{x} \]

Variance and Standard Deviation

The two most common measures of spread are the sample variance and the sample standard deviation.

The sample variance is (almost) the average of the squared deviations.

\[ s^2 = \frac{d_1^2 + d_2^2 + \ldots + d_n^2}{n-1} = \frac{1}{n-1}\sum_{i=1}^n d_i^2 \]

The sample standard deviation is the square root of the variance.

\[s = \sqrt{s^2}\]

Standard Deviation in R, a simple example

Code

sd(email$num_char)

[1] 14.64579

Notation of Populations

Recall, sample datasets provide information about the population.

The sample mean \(\bar{x}\) estimates the population mean \(\mu\).
The sample standard deviation \(s\) estimates the population standard deviation \(\sigma\).

Standard Deviation Describes Variability

Focus on the conceptual meaning of the standard deviation as a descriptor of variability rather than the formulas.

Other Measures of Spread

Towards Other Measures of Spread

The median is also called the 50th quantile, or \(Q_2\). There are other quantiles.

The \(q\)th quantile is the value that puts \(q\)% of the observations below this number. Also percentile.

The 25th quantile is also known as the first quartile, or \(Q_1\).
The 75th quantile is also known as the third quartile, or \(Q_3\).

Towards Other Measures of Spread, continued

Another way to think about the 25/75th quantile.

The 25th quantile is the median of the dataset that consists of the smallest number up to the median.
The 75th quantile is the median of the dataset that consists of the median up to the largest number.

Interquartile Range

The IQR is calculated the the difference between the 75th and the 25th quantile, \(Q_3\) and \(Q_1\), respectively:

\[ IQR = Q_3 - Q_1 \]

Quantiles and IQR in R

Code

quantile(email$num_char, 0.25)

  25% 
1.459

Code

q <- quantile(email$num_char, c(0.75, 0.25))
q[1] - q[2]

   75% 
12.625

Code

IQR(email$num_char)

[1] 12.625

Mixing Types

Mixing Variable Types

Often with binary categorical data, it’s advantageous to encode the variable numerically. For example,

female \(\Rightarrow 1=\) female, \(0=\) other
high school educated \(\Rightarrow 1=\) yes, \(0=\) other (no)
took pill \(\Rightarrow 1=\) yes, \(0=\) no
…

Mixing Variable Types

Consider the variable spam in the dataset email. We are likely interested in the proportion of emails that are spam, \(\hat{p}\).

Code

spam_number <- ifelse(email$spam == 1, 1, 0)
(phat <- mean(spam_number))

[1] 0.09359857

Take Away

Summarizing data is crucial to statistics
Mean and standard deviation are by far most important
- study their interpretations and intuitions
- see OS4 section 2.1.2 and 2.1.4
we would prefer everything to be a mean, of sorts.

Summary Statistics

Basic R

Data Frames in R

Categorical Data

recap: email

email, table

email, contingency table

email, contingency table of proportions

Numerical Data

county data

Measures of Center

Summary Statistics, mean

Summary Statistics, mean: example by hand

Summary Statistics, mean: example by R (computer)

Mean, by picture

Mean, by picture

Mean, an example

Mean, another example

Mean, point estimator

Summary Statistics, median

Summary Statistics, median: examples by R

Summary Statistics: comparing measures of center

What does more data do?

What does more data do?

What does more data do?

Summary Statistics, outliers

Measures of Spread/Width

Summarizing Data, data width

Deviations, defined

Variance and Standard Deviation

Standard Deviation in R, a simple example

Notation of Populations

Standard Deviation Describes Variability

Other Measures of Spread

Towards Other Measures of Spread

Towards Other Measures of Spread, continued

Interquartile Range

Quantiles and IQR in R

Mixing Types

Mixing Variable Types

Mixing Variable Types

Take Away

recap: `email`

`email`, table

`email`, contingency table

`email`, contingency table of proportions

`county` data