--- title: "Worksheet 03" format: html: self-contained: true toc: true toc-location: right --- ::: {.callout-note} ## Due Date 2023-03-03 Friday @ 11:59pm ::: Please create a folder named worksheet03 and create a QMD file named main.qmd within it. Put your solutions to Worksheet 03 into main.qmd. When finished, render the file, and submit by dragging and dropping worksheet03, the entire folder along with all its contents, into our shared Google Drive folder. Thus, you've successfully submit Worksheet 03 when our shared Google Drive folder has just one folder, named worksheet03, in it and your Worksheet 03 solutions contained within worksheet03. ## Steps to find maximum likelihood estimators The likelihood function is defined as $$L(\theta | \mathbf{X}) = \prod_{n=1}^N f(X_n | \theta)$$ It's generally easier to work with the log-likelihood, instead of the likelihood itself. The log-likelihood is defined as a function of the unknown parameter(s) $\theta$. $$l(\theta) = \log{L(\theta | \mathbf{X})} = \sum_{n=1}^N log(f(X_n | \theta))$$ To find the best guess of $\theta$ in terms of data $X_1, \ldots, X_N$ perform the following steps. 1. Plug in the appropriate density function (for your particular problem) $f(x|\theta)$ into the log-likelihood function. Simplify. 2. Differentiate (the simplified) log-liklihood function $l(\theta)$ with respect to $\theta$, simplify, and set derivative equal to $0$. In symbols we'd write, $$\frac{d}{d \theta} l(\theta) = 0$$ 3. Solve for parameter $\theta$ in terms of the data $\mathbf{X}$ $$\hat{\theta} = \hat{\theta}(\mathbf{X})$$ ## Central Limit Theorem For $X_1, \ldots, X_N \sim_{iid} F$ where $\mathbb{V}[X] < \infty$, a Central Limit Theorem for the sample mean, $m_N = N^{-1} \sum_{n=1}^N X_n$, states that as $N \to \infty$ $$m_N \sim N(\mathbb{E}[X], \mathbb{V}[X] / N).$$ The standard deviation of the (random variable) the sample mean $m_N$ is called the standard error and has formula $$\sqrt{\frac{\mathbb{V}[X]}{N}}.$$ ## 1. Exponential Distribution a. Find the maximum likelihood estimator for the unknown parameter $\lambda$ from an Exponential distribution. Assume you have data $X_1, \ldots, X_N \sim_{iid} \text{Exponential}(\lambda)$. The density function for the Exponential distribution is $$f(x | \lambda) = \lambda e^{-x\lambda}$$ b. Use the R function rexp(N, l) to generate $N = 101$ observations from the Exponential distribution with parameter $\lambda =$ l. You should choose a specific value for l. Store these data into a variable named data. c. Use your solution from a. and the data from b. to estimate the value of l, as if you didn't already know it. ::: {.callout-tip} Provide between 3 and 7 lines of math for part a., and R code for parts b. and c. ::: ## 2. Gamma Distribution The density function for the Gamma distribution is $$f(x | \alpha, \beta) = \frac{\beta ^ {\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x}$$ where $\Gamma$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function) the generalization of factorial ($!$) for non-integer values. Unfrotunately, there is no closed form solution for the maximum likelihood estimators for the parameters $\alpha$ and $\beta$ from the Gamma distribution. Imagine trying to take the derivative of the log of the Gamma function (turns out this is called the [digamma function](https://en.wikipedia.org/wiki/Digamma_function), but com'on!, seriously!?). Instead, we're going to use a computer to do the calculus for us. a. Pick values for $\alpha$ and $\beta$ and store them into variables named a and b. Use the R function rgamma(N, a, rate = b) to generate $N = 101$ observations (data) from the Gamma distribution with parameters $\alpha=$ a and $\beta =$ b. You pick values for a and b. Store these data into a variable named data. b. From the above Steps to find the maximum likelihood estimators, the log-likelihood function of data $X_1, \ldots, X_N$ from the Gamma distribution can be written in R code as {r} ll <- function(theta, data) { a <- theta b <- theta return(-sum(dgamma(data, a, rate = b, log = TRUE))) }  Copy and paste this formula into your solutions. Try calling the function for your two values of a and b from above. First create a new variable theta <- c(a, b). Then call ll(theta, data) where data is the data generated from a. c. Use the R function optim(rexp(2), ll, data = data, method = "L-BFGS-B", lower = c(1e-5, 1e-5)) to have R finish steps 2 to 3 from the [Steps to find maximum likelihood estimators](#steps-to-find-maximum-likelihood-estimators). Look for the output labeled $par, something like {r, eval = FALSE} ...$par  two numbers here (hopefully) close to a and b ...  If the two numbers listed here are close to your values of a and b then things probably went well for you :) ## 3. The Sample Mean and its standard deviation Pick your favorite distribution with a finite variance, $\mathbb{V}[X] < \infty$. Let's call your distribution $F$. a. Generate $N$ observations from $F$ with whatever finite parameter values you want -- you can either ask me or Google about how to generate random data from $F$. Calculate the mean of these observations. b. Repeat part a. $R$ times and store the $R$ means into a variable named means. I'd recommend here that you write a for-loop to do this. Here's some starter code. {r, eval = FALSE} N <- # some number R <- 501 means <- rep(NA, R) # instantiate a vector of R NAs for (r in 1:R) { data <- rF(N, ...) # generate N data from F(...); you'll need to change F means[r] <- mean(data) # store the rth mean }  c. Based on the parameter(s) you choose for your distribution, calculate and store the expectation of the distribution as a variable named m. Feel free to look up on Wikipedia the expectation/mean of your distribution. d. Based on the parameter(s) you choose for your distribution, calculate and store the standard deviation as a variable named s. Feel free to look up on Wikipedia the standard deviation of your distribution. e. Estimate the standard error of the sample mean as $s / \sqrt{N}$ and store this into a variable named se. f. Using the R function hist(means, breaks = "fd", freq=FALSE), make a histogram of your means. Describe the shape, location, and spread of this distribution. g. Generate a sequence of x-axis values with a range that matches the x-axis of your histogram into a variable named x. Use the R function dnorm(x, m, se) to evaluate the approximate Normal distribution's density function appropriately matching $F$. Store the evaluations of the Normal density function into a variable named fx. h. Use the R function lines(x, fx) to draw the Normal density function over your histogram. i. Increase $N$ by a factor of $100$ and repeat a. through h. What happened to the standard error when you increased $N$? i. Note that theoretically, one does not need to have $R$ data sets to use this Normal approximation (known as ~~the~~ a Central Limit Theorem for the sample mean). One data set will do. We only need $R$ data sets to visualize this approximation. ## 4. A single data set Continue to you use the distribution $F$ from 3. a. Generate $N$ (same value as above) observations from $F$ with the same parameter values as above. Store these data into a variable named data. b. Estimate the expectation and the standard deviation of $F$ with the R functions mean(...) and sd(...). Store these as variables named m and s. c. Generate a sequence of x-axis values with a range from m - 5 * s / sqrt(N) to m + 5 * s / sqrt(N) into a variable named x. Use the R function dnorm(x, m, s/sqrt(N)) to evaluate the approximate Normal distribution's density function appropriately matching $F$. Store the evaluations of the Normal density function into a variable named fx. d. Use the R function plot(x, fx, type = "l") to draw the Normal density function. e. Comment on the differences between your plot from 3. and the plot produced here.