Summary Statistics, mean
The sample mean, denoted \(\bar{x}\), is the sum of a sample of numbers divided by how many numbers are in the sample. Mathematically, we write
\[
\begin{align*}
\bar{x} & = \frac{x_1 + x_2 + \ldots + x_n}{n} \\
& = \frac{\sum_{i=1}^n x_i}{n} \\
& = \frac{1}{n} \sum_{i=1}^n x_i
\end{align*}
\]
Summary Statistics, mean: example by hand
Find the mean of the following numbers
Code
[1] 72 12 39 77 78 76 98 79 87 97
Summary Statistics, mean: example by R (computer)
Code
[1] 72 12 39 77 78 76 98 79 87 97
Code
Mean, by picture
Where is the mean relative to the data?
Mean, by picture
Where is the mean relative to the data?
Mean, an example
Find the mean of a variable stored in a data frame.
Mean, another example
Find the mean of a variable stored in a data frame.
Mean, point estimator
We call the sample mean a point estimator. What is it estimating?
Summary Statistics: comparing measures of center
Both the mean and the median measure the center of a dataset. Thus, they are jointly referred to as measures of center.
Though they measure different centers.
What does more data do?
If we add a number to the previous variable x
, what will be the effect on the
What does more data do?
… join \(100\) to x
What does more data do?
How about joining \(10000\) to x
.
Summary Statistics, outliers
An outlier is an observation that appears extreme relative to the rest of the data.
Note
- The number \(10000\) in the datasets above appears to be an outlier.
- What effect did this value have on the mean? On the median?
- What would have happened if we added \(-100\)? mean? median?
- Outliers are often the outcome of a confounding variable.
Summarizing Data, data width
Means and medians help summarize the center of the data. The statisticians rigorous description of the width of data uses the idea of deviations. See OS4 section \(2.1.4\).
Deviations, defined
deviation. We call the distance between an observation and the mean its deviation. Mathematically, we write
\[ d_i = x_i - \bar{x} \]
Variance and Standard Deviation
The two most common measures of spread are the sample variance and the sample standard deviation.
The sample variance is (almost) the average of the squared deviations.
\[ s^2 = \frac{d_1^2 + d_2^2 + \ldots + d_n^2}{n-1} = \frac{1}{n-1}\sum_{i=1}^n d_i^2 \]
The sample standard deviation is the square root of the variance.
\[s = \sqrt{s^2}\]
Standard Deviation in R, a simple example
Notation of Populations
Recall, sample datasets provide information about the population.
- The sample mean \(\bar{x}\) estimates the population mean \(\mu\).
- The sample standard deviation \(s\) estimates the population standard deviation \(\sigma\).
Standard Deviation Describes Variability
Focus on the conceptual meaning of the standard deviation as a descriptor of variability rather than the formulas.