---
title: "Worksheet 03"
format:
html:
self-contained: true
toc: true
toc-location: right
---
::: {.callout-note}
## Due Date
2023-03-03 Friday @ 11:59pm
:::
Please create a folder named `worksheet03` and create a QMD file named
`main.qmd` within it. Put your solutions to Worksheet 03 into
`main.qmd`. When finished, render the file, and submit by dragging
and dropping `worksheet03`, the entire folder along with all its
contents, into our shared Google Drive folder. Thus, you've
successfully submit Worksheet 03 when our shared Google Drive folder
has just one folder, named `worksheet03`, in it and your Worksheet 03
solutions contained within `worksheet03`.
## Steps to find maximum likelihood estimators
The likelihood function is defined as
$$L(\theta | \mathbf{X}) = \prod_{n=1}^N f(X_n | \theta)$$
It's generally easier to work with the log-likelihood, instead of the
likelihood itself. The log-likelihood is defined as a function of the
unknown parameter(s) $\theta$.
$$ l(\theta) = \log{L(\theta | \mathbf{X})} = \sum_{n=1}^N log(f(X_n | \theta)) $$
To find the best guess of $\theta$ in terms of data $X_1, \ldots, X_N$
perform the following steps.
1. Plug in the appropriate density function (for your particular
problem) $f(x|\theta)$ into the log-likelihood function. Simplify.
2. Differentiate (the simplified) log-liklihood function $l(\theta)$
with respect to $\theta$, simplify, and set derivative equal to
$0$. In symbols we'd write,
$$ \frac{d}{d \theta} l(\theta) = 0 $$
3. Solve for parameter $\theta$ in terms of the data $\mathbf{X}$
$$ \hat{\theta} = \hat{\theta}(\mathbf{X}) $$
## Central Limit Theorem
For $X_1, \ldots, X_N \sim_{iid} F$ where $\mathbb{V}[X] < \infty$, a
Central Limit Theorem for the sample mean, $m_N = N^{-1} \sum_{n=1}^N
X_n$, states that as $N \to \infty$
$$m_N \sim N(\mathbb{E}[X], \mathbb{V}[X] / N).$$
The standard deviation of the (random variable) the sample mean $m_N$
is called the standard error and has formula
$$\sqrt{\frac{\mathbb{V}[X]}{N}}.$$
## 1. Exponential Distribution
a. Find the maximum likelihood estimator for the unknown parameter
$\lambda$ from an Exponential distribution. Assume you have data
$X_1, \ldots, X_N \sim_{iid} \text{Exponential}(\lambda)$. The
density function for the Exponential distribution is
$$ f(x | \lambda) = \lambda e^{-x\lambda} $$
b. Use the R function `rexp(N, l)` to generate $N = 101$ observations
from the Exponential distribution with parameter $\lambda =$ `l`. You
should choose a specific value for `l`. Store these data into a
variable named `data`.
c. Use your solution from a. and the data from b. to estimate the
value of `l`, as if you didn't already know it.
::: {.callout-tip}
Provide between 3 and 7 lines of math for part a., and R code for parts b. and c.
:::
## 2. Gamma Distribution
The density function for the Gamma distribution is
$$ f(x | \alpha, \beta) = \frac{\beta ^ {\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} $$
where $\Gamma$ is the [Gamma
function](https://en.wikipedia.org/wiki/Gamma_function) the
generalization of factorial ($!$) for non-integer values.
Unfrotunately, there is no closed form solution for the maximum
likelihood estimators for the parameters $\alpha$ and $\beta$ from the
Gamma distribution. Imagine trying to take the derivative of the log
of the Gamma function (turns out this is called the [digamma
function](https://en.wikipedia.org/wiki/Digamma_function), but
com'on!, seriously!?). Instead, we're going to use a computer to do
the calculus for us.
a. Pick values for $\alpha$ and $\beta$ and store them into variables
named `a` and `b`. Use the R function `rgamma(N, a, rate = b)` to
generate $N = 101$ observations (data) from the Gamma distribution with
parameters $\alpha=$ `a` and $\beta =$ `b`. You pick values for `a`
and `b`. Store these data into a variable named `data`.
b. From the above Steps to find the maximum likelihood estimators,
the log-likelihood function of data $X_1, \ldots, X_N$ from the Gamma
distribution can be written in R code as
```{r}
ll <- function(theta, data) {
a <- theta[1]
b <- theta[2]
return(-sum(dgamma(data, a, rate = b, log = TRUE)))
}
```
Copy and paste this formula into your solutions. Try calling the
function for your two values of `a` and `b` from above. First create
a new variable `theta <- c(a, b)`. Then call `ll(theta, data)` where
`data` is the data generated from a.
c. Use the R function `optim(rexp(2), ll, data = data, method = "L-BFGS-B", lower =
c(1e-5, 1e-5))` to have R finish steps 2 to 3 from the [Steps to find
maximum likelihood
estimators](#steps-to-find-maximum-likelihood-estimators). Look for
the output labeled `$par`, something like
```{r, eval = FALSE}
...
$par
[1] two numbers here (hopefully) close to a and b
...
```
If the two numbers listed here are close to your values of `a` and `b`
then things probably went well for you :)
## 3. The Sample Mean and its standard deviation
Pick your favorite distribution with a finite variance,
$\mathbb{V}[X] < \infty$. Let's call your distribution $F$.
a. Generate $N$ observations from $F$ with whatever finite parameter
values you want -- you can either ask me or Google about how to
generate random data from $F$. Calculate the mean of these
observations.
b. Repeat part a. $R$ times and store the $R$ means into a variable
named `means`. I'd recommend here that you write a for-loop to do
this. Here's some starter code.
```{r, eval = FALSE}
N <- # some number
R <- 501
means <- rep(NA, R) # instantiate a vector of R NAs
for (r in 1:R) {
data <- rF(N, ...) # generate N data from F(...); you'll need to change F
means[r] <- mean(data) # store the rth mean
}
```
c. Based on the parameter(s) you choose for your distribution,
calculate and store the expectation of the distribution as a variable
named `m`. Feel free to look up on Wikipedia the expectation/mean of
your distribution.
d. Based on the parameter(s) you choose for your distribution,
calculate and store the standard deviation as a variable named `s`.
Feel free to look up on Wikipedia the standard deviation of your
distribution.
e. Estimate the standard error of the sample mean as $s / \sqrt{N}$
and store this into a variable named `se`.
f. Using the R function `hist(means, breaks = "fd", freq=FALSE)`, make
a histogram of your means. Describe the shape, location, and spread
of this distribution.
g. Generate a sequence of x-axis values with a range that matches the
x-axis of your histogram into a variable named `x`. Use the R
function `dnorm(x, m, se)` to evaluate the approximate Normal
distribution's density function appropriately matching $F$. Store the
evaluations of the Normal density function into a variable named `fx`.
h. Use the R function `lines(x, fx)` to draw the Normal density
function over your histogram.
i. Increase $N$ by a factor of $100$ and repeat a. through h. What
happened to the standard error when you increased $N$?
i. Note that theoretically, one does not need to have $R$ data sets to
use this Normal approximation (known as ~~the~~ a Central Limit
Theorem for the sample mean). One data set will do. We only need $R$
data sets to visualize this approximation.
## 4. A single data set
Continue to you use the distribution $F$ from 3.
a. Generate $N$ (same value as above) observations from $F$ with the
same parameter values as above. Store these data into a variable
named `data`.
b. Estimate the expectation and the standard deviation of $F$ with the
R functions `mean(...)` and `sd(...)`. Store these as variables named
`m` and `s`.
c. Generate a sequence of x-axis values with a range from `m - 5 * s /
sqrt(N)` to `m + 5 * s / sqrt(N)` into a variable named `x`. Use the
R function `dnorm(x, m, s/sqrt(N))` to evaluate the approximate Normal
distribution's density function appropriately matching $F$. Store the
evaluations of the Normal density function into a variable named `fx`.
d. Use the R function `plot(x, fx, type = "l")` to draw the Normal
density function.
e. Comment on the differences between your plot from 3. and the plot
produced here.