MATH 314 Exam 02

Only these imports are allowed on Exam 02.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import patsy as pt
from scipy.optimize import minimize

Here's some functions you can use during Exam 02.

def lm(theta, X):
	return X @ theta

def mse(y, yhat):
	d = y - yhat
	return np.mean(d * d)

def normal_ll(theta, data):
	y = data['y']
	X = data['X']
	yhat = lm(theta, X)
	return mse(y, yhat)

def windows(N, K):
	w = N // K
	ws = []
	for k in range(K):
		start = k * w
		end = (k + 1) * w
		if k == K - 1:
			end = N
		ws.append((start, end))
	return ws
  1. In general, we should prefer models that predict better on data the model has never seen / been trained on. A common way to measure such prediction accuracy, when one only has one dataset, is to perform KK-fold cross validation. If K=5K=5, then we will randomly split the dataset up into 55 (roughly equal) folds. Then, cycling across all folds, train the model on 44 out of 55 folds, and measure prediction accuracy on the last (test) fold. Whichever model has lower average (across all test folds) prediction error is considered better.

    Use the hospital dataset and KK-fold cross validation to determine which model is a better predictor of infection_risk. Consider a model that uses stay versus a model that uses stay and stay ** 2. Use K=5K=5. Here's some starter code:
    ...
    
    rng = np.random.default_rng()
    N = ...
    idx = rng.permutation(np.arange(N))
    
    K = 5
    mses = np.zeros(K)
    mse2s = np.zeros(K)
    ws = windows(N, K)
    
    for k in range(K):
    	b, e = ws[k]
    	jdx = idx[b:e] # test indices 
    	ndx = np.setdiff1d(idx, jdx) # train indices 
    
    	...
    
    (np.mean(mses), np.mean(mse2s))
  2. Write a class named LinearRegression that implements the following API
    lr = LinearRegression('infection_risk ~ stay', df)
    lr.fit()
    betas = lr.bootstrap()
    
    plt.scatter(betas[:, 0], betas[:, 1]);
    Your code should work for the example above. Further, for full credit, I will test your code during the exam with a simple change to the code above. Here's the structure of the class I recommend
    class LinearRegression():
    	def __init__(self, ptsystr, df):
    		...
    	def _lm(self, theta, X):
    		...
    	def _mse(self, y, yhat):
    		...
    	def fit(self, y = None, X = None):
    		...
    	def bootstrap(self, R = 1_000):
    		...
    During Exam 02, I'll ask you to recreate just one of the methods from the above class, and I'll supply the code for the other methods.
  3. Recreate the following plot. The black lines are bootstrapped 95%95\% confidence intervals and the black dots are medians (50%50\% quantiles) which are also calculated from the bootstrap resampled coefficients. There are three distinct steps to recreate this plot:
    1. Bootstrap the appropriate model
    2. Calculate confidence intervals from the RR estimates of the bootstrapped coefficients β\beta
    3. make the plot -- hint: learn that you can loop over the output of df.groupby(...)
    During Exam 02, I'll ask you to recreate just one of these parts, and I'll supply the code for the other parts.
  4. The Rayleigh distribution has density function f(xσ)=xσ2ex2/(2σ2)f(x | \sigma) = \frac{x}{\sigma^2}e^{-x^2 / (2\sigma^2)} for σ>0\sigma > 0. Use the following 2121 data to find the maximum likelihood estimator of σ\sigma and to produce a 95%95\% confidence interval using the bootstrap method. [2.55949718, 4.52940383, 3.97518473, 3.38191934, 1.40503558, 1.52411053, 6.91676848, 1.45665845, 2.3602429 , 1.60873959, 3.94374552, 6.22249149, 1.89894113, 5.20759437, 7.24625107, 2.80471966, 4.81060617, 3.08315807, 4.72002917, 1.2751267 , 6.66914935]
  5. Show, empirically (with code), that the procedure to create a 95%95\% confidence interval does indeed capture the expectation value in roughly 95%95\% of confidence intervals. Your code should repeat the following R=1_000R = 1\_000 times
    1. generate N=501N = 501 data from the exponential distribution with rate parameter λ=1\lambda = 1
    2. calculate confidence interval (not bootstrap)
    3. determine if the confidence interval captures the expectation, call this value c
    4. increment a running mean with c using the method defined in Homework 03