Sample statistics are best guesses about characterstics of the population; we estiamte unknown population parameters with sample statistics, respectively.
Population Parameter | Sample Statistic |
---|---|
\(\mu\) | \(\bar{x}\) |
\(\sigma, s\) | \(\hat{\sigma}, \hat{s}\) |
\(\sigma^2, v\) | \(\hat{\sigma}^2, \hat{v}\) |
\(p\) | \(\hat{p}\) |
… | … |
We can use a sample (visualized as a histogram) to estimate the proportion of people
Proportion of sample between 180 and 190cm tall?
\(\frac{\text{# people between 180 and 190}}{n} =\) 0.3411
We are still estimating things about the popluation. Now we are estimating proportions between two numbers, instead of where the distribution is centered (\(\mu\)) or how wide the distribution is (\(\sigma\)).
Distributions are the population analogue to histograms, just as \(\mu\) is the population analogue to \(\bar{X}\). Distributions are mathematical descriptions of populations, defined by a function that has a measure of center \(\mu\) and a measure of spread \(\sigma\).
We often model adult male heights with a continuous distribution. Simplified, this means we draw a curve over a histogram with decreasing binwidths of an infinite set heights.
Without a sample, we hypothesize the (population) distribution of heights to have a nice mathematical form.
This smooth curve represents a probability density function (also called a density or distribution). The toal area under the density function is equal to 1.
What is the probability that a randomly selected adult is between 180 and 190cm tall?
\(P(180 < X < 190) =\) 0.3413
The proportion of people within the sample in between 180cm and 190cm 0.3411 is a close estimate of the population probability that a randomly chosen person is between those same heights 0.3413.
Proportion of sample less than 201?
\(\frac{\# \text{people} < 201}{n} =\) 0.8641
What is the probability that a randomly selected adult is shorter than 201?
\(P(X < 201) =\) 0.864
Again, the proportion of people within the sample that are shorter than 201cm 0.8641 is a close estimate of the population probability that a randomly chosen person is shorter than \(201\)cm 0.864.
Though, there is something odd going on. We’ve assumed we know the population distribution; it’s shape, it’s mean, and it’s standard deviation.
In practice, we don’t and won’t know the population distribution. Nevertheless, we can learn things about it via a sample. Soon we’ll find out that this lack of knowledge about the population, really just doesn’t matter.
Continuing our height example: in the case of randomly selected individuals and their height, the random variable \(X\) is the height of an unknown (to be randomly selected) individual.
In this example, the random variable is an adult’s height. Since we don’t yet know how tall a to be randomly selected adult is, we denote the to be determined value by \(X\). Once we observe the variable’s outcome, we write the value of the adult as \(x\).
A random process or variable with a numerical outcome, see OS4 Section 3.4.1.
Each random variable has a distribution function that describes the form of the random variable. Tying the pieces together, we say that each random variable (via its distribution function) has a measure of center \(\mu\) and a measure of spread \(\sigma\).
The discrete uniform distribution is the simplest of discrete distributions. We write \(X \sim U(a, b)\), meaning the random variable \(X\) is distributed uniformly on the interval \([a, b]\).
The canonical example is die rolling. If we let the random process of die rolling be denoted by \(X\), then \(X \sim U(1, 6)\).
Suppose you work for the IRS and you are interested in the probability that a randomly selected tax payer is attempting to commit fraud on their tax return. So you and your boss decide to test this. From a sample of 900 tax payers, you perform an in depth search through each of the 900 tax forms.
Suppose you want to know the frequency of a dominant gene, say A, in a population1. You draw 50 members of the population at random and find that 12 of them display the dominant phenotype.
You are interested in calculating the average G/C fraction of Human genomic DNA across the whole genome. You sample \(50\) individuals, sequence their genome …
Notice what is common to all of these scenarios
random variable has just two outcomes: “success” or “failure”
independent trials
probability of “success,” \(p\), stays the same
The Bernoulli distribution describes the probability of seeing a successes, calculated from \(n\) independent trials with probability of a success \(p\).
Let’s discuss: suppose you work for the IRS and you are interested in the probability that a randomly selected tax payer is attempting to commit fraud on their tax return. So you and your boss decide to test this. From a sample of 900 tax payers, you perform an in depth search through each of the 900 tax forms.
If we can estimate \(p\) from a sample of Bernoulli random variables, we can then answer questions of the sort
The normal distribution is ubiquitous in statistics.
The normal distribution is a probability density function that is symmetric, unimodal, and bell shaped.
Notes
The z-score of an observation is the number of standard deviations the observation lies above or below the mean. We compute the z-score for an observation \(x\) that follows a distribution with mean \(\mu\) and standard deviation \(\sigma\) using \[z = \frac{x - \mu}{\sigma}\]
Suppose SAT scores are distributed \(X \sim N(1500, 300)\) and ACT scores are distributed \(N(21, 5)\). If Ann scored 1800 on the SAT and Tom scored 24 on the ACT, who performed better?
\(z_A = \frac{1800 - 1500}{300} =\) 1
\(z_T = \frac{24 - 21}{5} =\) 0.6
These are effectively quantiles/percentiles. Picture?
Use R to justify the following approximate numbers.
All random variables have a theoretical mean, called the expected value. The sample mean \(\bar{X}\) is the sample analogue of this quantity.
Let \(Y\) take on the values \(0,1\) with equal probability – think coin flip. We can use R to simulate this random variable. We generate a sequence \(y_1, y_2, ..., y_{10000}\), each of which will be \(0\) or \(1\) once observed, and calculate the running mean. That is,
Since we expect the long-run average of a coin flip to be \(1/2\), our running mean should converge to \(1/2\).
Random variables also have a common measure of spread, called the variance. The squared sample standard deviation, \(s^2\), is the sample analogue of this quantity.
Consider \(Y\) a random variable following a Bernoulli distribution with probability of “success” \(p\).
Random variables are said to follow specific distribution functions, eg \(X\) is distributed as [some name].
We calculate the expected value and variance of random variables from these distribution functions. When we calculate statistics, it is these mathematical quantities that we are estimating.