https://classroom.github.com/a/Jl551RUZ

Due: 2020-05-08 by 11:59pm

Use the dataset hospital, which records various measurements about hospitals, sometimes collapsing a number of measurements about each hospitals’ patients into one observation for that hospital; README.

Use infection_risk as your numeric response variable. Infection risk is an index, not a percentage, not a probability. It’s a unitless representation of risk, in which higher numbers imply a greater risk, and lower numbers imply a lesser risk.

Use region as a categorical explanatory variable. Be careful with the type of region as region corresponds to distinct regions of the US, but is encoded numerically.

Further, pick 1 numerical explanatory variable that you think will help explain infection risk. Use these three variables throughout.

1. Means by region.

1. Make transparent box plots with your response variable on the y-axis and the categorical variable on the x-axis.

2. Write a complete English sentence about the plot in the context of the data.

2. Simple Linear Regression.

1. Make a scatter plot of your two numerical variables, and put a line through the data using geom_smooth(...).

2. Write a complete English sentence about the plot in the context of the data.

3. Unique intercepts by region.

1. Fit a multiple linear regression model with unique intercepts by levels of the categorical explanatory variable region.

2. Make a scatter plot with lines through the data matching your model.

3. Write a complete English sentence about the plot/modelin the context of the data.

4. Unique slopes by region.

1. Fit a multiple linear regression model with unique slopes, and only one intercept, by levels of the categorical explanatory variable region.

2. Which region has the largest (in absolute value) slope?

3. What does the largest slope indicate about that region? Interpret largest slope in context of the data.

5. Unique intercepts and slopes by region.

1. Fit a multiple linear regression model with unique intercepts and slopes by levels of the categorical explanatory variable region.

2. Interpret the intercept for region 2 in context of the data. If the interpretation does not make sense, explain why.

3. Interpret the slope, not just the slope offset, for region 4 in context of the data. You’ll need to do some quick math to determine the appropriate slope for region 4.

4. Choose a value within the range of your numerical explanatory variable along the x-axis, call it xnew. Create bootstrap confidence intervals for predictions at xnew for two different regions.

5. Interpret your two confidence intervals in context of the data.

6. Make an informative conclusion about the hospitals in the two different regions based on your confidence intervals.

7. Choose a value outside the range of your numerical explanatory variable along the x-axis, call it xnew_ex. What do we call this, when we predict outside the range of our data?

8. Make a prediction at xnew_ex and interpret it context of the data. Does your prediction make sense? Why or why not?