pseudocode for bootstrap method

```
for s in S:
generate random
if condition:
count += 1
count / sigma
```

Let's say for the sample mean, to get a confidence interval for a pop mean from the sample mean,

picks a lower bound for some percentile and pick unpper bound for some percentile in which we are 95%.

For the sample means most people construct confidence interals by +/- 2 standard errors.

IS symmetric confidence interval a useful quality? Edward: most popele would assume the CLT holds, it's symmetric done. But I would argue it's not the most useful because you're not guaranteed to have symmetric sampling distribution.

**Announcements: HW12 is graded.**

We are now starting official modeling in this class, likelihood mthod via minimization, and the bootstrap method. USing the simplified log likelihood for the normal distribution for the next 6 weeks of this class.

Quick review of the techniques.

Statistical modeling:

$N$ random variables

$X1, ..., X2 ~Normal(\mu, \sigma^{2})$

IF we don't need to estimate sigma, pretend all we;re estimated is the pop mean.

Entirley ignoring simga, after the deriv those pieces will fall away.

Simplified log likelihood template:

$-$ Sum of elements `n=1`

to $N$ of $
(x_n - \mu)^{2}$ The $\mu$ will get changed

In [8]:

```
import numpy as np
import bplot as bp
from scipy.stats import norm
from scipy.optimize import minimize
import pandas as pd
```

No negative 1, because we want to maximize, instead of minimize

In [9]:

```
def ll_normal(mu, X):
d = X - mu
return np.sum(d * d)
```

Powers on computers are computationally expensive, se multiplication instead. Be efficient. I don't care whatever languag eyour in.

pandas allows us to access real data

`github.com/roualdes/data`

is the repo for the data sets.

**data frame** is the standard name for a 2-D table matrix that holds your data in python. The difference between a 2-D array in python and a data frame, is a data frame has *named columns*. Similar to *SQL* tables.

73 rows, each row is a new book

Any time you want to download datasets, download the `*.csv`

file , is a database. Read it into the python.

In [10]:

```
df = pd.read_csv("https://raw.githubusercontent.com/roualdes/data/master/books.csv")
```

head prints the first 5 observations int eh data set, where each row is a book.

In [11]:

```
df.head(8) # 8 books first in the data set
```

Out[11]:

In python, function named `dir()`

, I dont know what it stands for. MAkes a list of all the members and methods of the objects you call it on.

Konstantin: perhaps it is similar to `ls`

in linux it is a listing of directories, so it is a listing of the members of objects.

In [ ]:

```
dir(df)
```

In [ ]:

```
help(pd.read_csv)
```

returns a padnas series object, but it is similar to columns. A series is wrapping an numpy array in a collumn.

In [13]:

```
df['uclaNew']
```

Out[13]:

You can perform vectorized math on columns in data frames.

In [14]:

```
df['uclaNew'] + 2
```

Out[14]:

Can you think of dataframes as holding multiple columns of variables? Yes. Are they stored in column major order? I believe they are actually stored in row manjoy order. `numpy`

is row major. Other mathematical libraries use column major order.

In [15]:

```
df.columns
```

Out[15]:

Take this new array of vars, and throw it into the bootstrap procedure.

In [16]:

```
df['uclaNew']
df.shape
```

Out[16]:

Dataframes - 2-D arrays, shape is a 73 rows by 3 columns. Rows first, then columns. Returns a tuple.

In [18]:

```
N = df.shape[0] # firs eelment of tuple
N
```

Out[18]:

Resample from the original data $N$, the length of the original data $N$, with replacement `replace=True`

In [20]:

```
R = 1001
mus = np.full(R, np.nan)
for r in range(R):
idx = np.random.choice(N, N, replace=True)
# N twice, is intentional
tmp = minimize(ll_normal, (50), args=(df['uclaNew'][idx]), method ="BFGS")
mus[r] = minimize(ll_normal, (50), args=(df['uclaNew'][idx]), method ="BFGS").x # don't forget the x
# 50 is a reasonable starting guess for the price of a textbook, an optimization
# our array of random variables is a collumn of the dataframe, returns a numpy array
# mu can be -inf to +inf, no bounds on that param.
# BFGS can be used when there are no bounds.
tmp
```

Out[20]:

`df['uclaNew']`

is a numpy array `df['uclaNew'][idx]`

returns the idx'th element

`minimize`

returns a `scipy.optimize.OptimizeResult`

, it is actually a dictionary.

`x`

in that dictionalry contains the result of the best guess of the computation.

There are also other fields in the `OptimizeResult`

`help(minimize)`

In [21]:

```
R = 1001
N = df.shape[0] # firs eelment of tuple
mus = np.full(R, np.nan)
for r in range(R):
idx = np.random.choice(N, N, replace=True)
mus[r] = minimize(ll_normal, (50), args=(df['uclaNew'][idx]), method ="BFGS").x # don't forget the x
```

In [22]:

```
bp.density(mus)
bp.percentile_h(mus, y=0)
```

Out[22]:

In [23]:

```
mus.mean()
```

Out[23]:

The true price of a textbook at the *UCLA*.

In [24]:

```
bp.density(df['uclaNew'])
bp.percentile_h(mus, y=0)
```

Out[24]:

This curver, the data set of the original sample. Density plot tells us about the ***individual** books prices.

The confidence interval is tells us about the **mean** book price.

The confidence interval in this plot is coming form these *sample statistics* it is **NOT** a confidence interval for individual books prices. Do not confuse these two!

95% confident that the mean is aroudn that price.

Coudl we develop a function `bootstrap()`

that will work with `minimize()`

?

`initval`

is a random guaess. It is a default parameter. Repeat the bootstrap with te new best guaess to avoid a local minimum.

In [26]:

```
def min(data, initval = None):
if not initval: # if no provided argument
initval = np.random.normal()
return minimize(ll_normal, (initval), args=(data), method ="BFGS").x # don't forget the x
# [x] is also valid syntax
def bootstrap(data, R, fun):
N = data.size
thetas = np.full(R, np.nan)
for r in range(R):
idx = np.random.choice(N, N, replace = True)
thetas[r] = fun(data[idx])
return np.percentile(thetas, [25, 75])
R = 1001
bootstrap(df['uclaNew'], R, min)
```

Out[26]:

HEre is a 95% confidence interval!

To repeat the analysis:

In [27]:

```
R = 1001
bootstrap(df['amazNew'], R, min)
```

Out[27]:

Amazon's book prices are cheaper than the University's bookstore.

`min()`

is generally although a bad choice of a name, since it is common and it can be used in different contexts.