## Let's try to predict beak width using middle toe length and island
## the data set https://roualdes.us/data/finches.csv
library(tidyverse)
finch <- read.csv("https://roualdes.us/data/finches.csv")
## First, ensure there is statistically different mean beak widths
## amongst the islands. Write out the hypotheses and conclude the
## test, using alpha=0.05.
H0: mu_f = mu_scz = mu_scl
H1: at least one mean is different
alpha = 0.05
fit <- lm(beakwidth ~ island, data=finch)
anova(fit)
Because p-value = 0.04751 < 0.05, we reject H0. At least one mean
appears to be different from the rest.
## Calculate the correlation between beak width and middle toe length.
## Based on this calculation, what do you expect the sign of the slope
## coefficient to be?
with(finch, cor(beakwidth, middletoelength))
## from http://r4ds.had.co.nz/pipes.html#other-tools-from-magrittr
Because the correlation is positive, the sign of the slope coefficient
will be positive.
## Fit multiple linear regression using intercepts for each island and
## a shared (amongst all the islands) slope. Write the fitted
## regression equation.
fit <- lm(beakwidth ~ island + middletoelength, data=finch)
summary(fit)
hat{beakwidth} = -8.8 + 0.85*SCL + 0.24*SCZ + 0.97*middletoelength
## Check the assumptions of your linear model.
r <- rstandard(fit)
yhat <- fitted(fit)
qplot(yhat, r) # linearity looks, OK at best, because there is not entirely random scatter, but not an obvious pattern. constant variation is just so so, but it's hard to tell with the linearity not looking great.
qplot(r, geom="histogram", binwidth=1/3) # the residuals look fine because there is no extreme skew.
## Why are only two of the three levels of island obviously displayed
## in the output?
Only two of the levels of island are displayed because the first
level, floreana, was dumped into the intercept term.
## Interpret the coefficient estimate of floreana in the context of
## these data.
When middle toe length is equal to zero, the average beak width for
finches on the island Floreana is expected to be -8.8 mm. This
obviously cant happen, but has a literal interpretation nonetheless.
## Each coefficient is being tested with a default hypothesis test.
## Set up one such hypothesis test, using appropriate subscripts, and
## conclude the hypothesis test in statistical language.
H0: beta_SCZ = 0 H1: beta_SCZ != 0 alpha = 0.05 Because the p-value =
0.476803 > alpha = 0.05, we fail to reject H0. The offset for the
island Santa Cruz does not appear to be different from zero.
## Interpret the slope in the context of these data. Is this variable
## significantly different from zero?
For every 1 mm increase in middle toe length, the average beak width
increases by 0.97 mm for birds from all islands.
## Interpret the adjusted $R^2$ value in the context of these data.
55.87 percent of the varition in beak width is explained by this multiple linear regression model.
## Why is the $R^2$ value larger than the adjusted $R^2$?
Adjusted R^2 is lower than R^2 because adjusted R^2 penalizes the
model for having more variables, especially extra variables that dont
explain the response well.
## Predict average beak width for the island San Cruz when middle
## toe length is equal to 18.6.
## ?predict.lm # the examples within should help
estimates <- coef(fit)
sum(estimates*c(1, 0, 1, 18.6)
## or
newdf <- tibble(island="santacruz", middletoelength=18.6)
predict(fit, newdata=newdf)
## Calculate the residuals for the third observation of your original
## data. Does your answer match the third element of the residuals?
residuals(fit)[3]
finch[3,"beakwidth"] - sum(estimates*c(1, 0, 1, 18.6))
## Did you over/under predict the third residual?
Since the residual is negative we over-predicted.