https://classroom.github.com/a/w0Z7szJw

Submit both your Rmd and compiled HTML files.

You must show all your work/code. If you write any math down on paper, then either reproduce the math in LaTeX or take a picture of your work and include the picture in your compiled HTML file.

You may use any resources you like, so long as you abide by our Academic Integrity Policy which has been in our Syllabus from day 0 of this semester.

Due: 2020-04-10 by 11:59pm

Consider the following dataset about admissions to UC Berkeley’s graduate school. Please read the README file associated with this dataset.

df <- read.csv("https://raw.githubusercontent.com/roualdes/data/master/dslabs/admissions.csv")
df[,c("major", "admitted", "applicants")]
##    major admitted applicants
## 1      A       62        825
## 2      B       63        560
## 3      C       37        325
## 4      D       33        417
## 5      E       28        191
## 6      F        6        373
## 7      A       82        108
## 8      B       25         68
## 9      C       34        593
## 10     D       35        375
## 11     E       24        393
## 12     F        7        341
  1. Explain why the Binomial distribution reasonably represents these data.

  2. Calculate the simplified log-likelihood for \(N = 12\) observations from the Binomial distribution. Hint: Note that the number of trials in this dataset varies by major, such that in your math there should be a subscript \(n\) on the number of trials \(K_n\) for the same reasons that the vector of data has a subscipt on it \(X_n\). Your R code should handle this with vectorization.

  3. Using this dataset, estimate the average proportion of applicants that get admitted to UC Berkeley’s graduate school across all majors. Use the likelihood method and R’s function optim(...) to estimate the population parameter \(p\), store it as a variable named phat.

  4. For full credit, use the bootstrap method to draw R = 1001 samples from the sampling distribution of phat. For half credit, use the bootstrap method to draw R = 1001 samples from the sampling distribution of the sample mean of the number of admitted applicants per major. Show your code and don’t forget to pre-allocate.

  5. Use ggplot2 to make a density plot of your R bootstrap resampled statistics. Describe the shape of this sampling distribution.

  6. If the sample size were to increase, such that we sampled more than 12 majors, would the width of the sampling distribution increase or decrease. Explain.

  7. Use quantile() to calculate a \(90\)% confidence interval of your bootstrap resampled statistics.

  8. Interpret this confidence interval in context of the data. Be specific about what your sampling distribution represents.

  9. The bootstrap procedure has essentially two steps: resample, and on each resample, calculate something.

    1. What do we resample from?, how is the resampling done?, and what is being calculated on each resample?

    2. What is the conceptual goal of the bootstrap procedure? Don’t just say to calculate confidence intervals.

  10. Supppose you calculated the following percentiles from some distribution.

    ##       2.5%        50%      97.5% 
    ##   5.317494  34.343947 110.109267
    1. What percentage confidence interval would the \(2.5\)% and \(97.5\)% percentiles make?

    2. Which direction of skew does this distribution appear to have? Explain.