## Let's try to predict average hospital infection risk using all the
## variables save infection_risk as explanatory variables. We'll use
## the data set https://roualdes.us/data/hospital.csv
hospital <- read.csv("https://roualdes.us/data/hospital.csv")
library(ggplot2)
library(dplyr)
## Calculate correlations amongst all the appropriate explanatory variables. Pick
## at least one variable to throw out and explain why it is reasonable
## to do so.
cor(hospital[,-6])
## I'd drop nurses or beds, don't care which. Since both beds and
## nurses are so highly correlated, we won't need them both to predict
## infection risk. This also goes along with the assumption of
## collinearity.
## Fit multiple linear regression with the remaining explanatory
## variables, fitting multiple intercepts and multiple slopes across
## the numerical, explanatory variable stay.
fit <- lm(infection_risk ~ factor(region) + stay + age + xray + beds +
stay:factor(region), data=hospital)
summary(fit)
## If you were to drop any explanatory variable(s) from the model,
## which would you drop first and why?
## Check the assumptions of your linear model.
r <- rstandard(fit)
yhat <- fitted(fit)
qplot(yhat, r)
qplot(r, geom="histogram", binwidth=1/3)
## Each coefficient is being tested with a default hypothesis test.
## Write out one example of this test in symbols.
## H0: beta_age = 0
## H1: beta_age != 0
## alpha = 0.05
## Why are only three of the four levels of region output? Interpret
## the coefficient estimate of region 1 in the context of these data.
## region 1 gets absorbed into the intercept and the slope of stay.
## The infection risk for hospitals in region one is on average 1.32.
## Are there are any regions for which an increase in a patient's stay
## does significantly increase the infection_risk? Explain.
## Neither regions 2 nor 4 have statistically significant slopes on
## stay; in these regions an increase in stay does not correlate with
## a statistically significant infection risk.
## Interpret one of the statistically significant slopes in the
## context of these data.
## For each one unit increase in average patient xrays, we epxect
## infection risk to go up 0.02 on average.
## Interpret one of the not statistically significant slopes in the
## context of these data. What does this tell us about this
## variable's ability to predict infection_risk?
## For every extra year of average patient age, we expect infection
## risk to decline by 0.015.
## Interpret the adjusted $R^2$ value in the context of these data.
## 43% of the variation in infection risk is accounted for by this
## multiple regression model.
## Why is the $R^2$ value larger than the adjusted $R^2$.
## Because adjusted R^2 penalizes models with more explanatory
## variables in it.
## If you were to drop any explanatory variable(s) from the model,
## which would you drop first and why?
## Age would be dropped from this model first, because it has the
## largest p-value.
## Calculate confidence intervals for a slope and interpret it in
## context.
confint(fit)
## For every extra bed in a hospital the average increase in infection
## risk is between 0.0002 and 0.0025.
## Predict the value for infection_risk for the following values of
## the explanatory variables:
## stay age xray beds region nurses
## 8.34 56.9 74 107 3 54
## Try using vectors c(...), *, and sum()
## Try using a data.frame data.frame(stay=8.34, age=56.9, ..., nurses=54) and
## predict.lm; you can confidence intervals from this
x <- c(1, 0, 1, 0, 8.34, 56.9, 74, 107, 0, 8.34, 0)
sum(x * coef(fit))
predict(fit, newdata=hospital[3,])
## The above data is the third row of the hospital data. Did the
## model under or over predict?
hospital[3,"infection_risk"] - predict(fit, newdata=hospital[3,])
## overpredict
## Calculate the residuals for the third observation. Does your
## answer match the third element of the residuals?
residuals(fit)[3]