MATH 385 Week 11 Worksheet

Please submit one Python file (worksheet11_solutions.ipynb) by 11:59pm Pacific time on Friday, November 8, in Week 11 to your Week 11 GitHub repository.

Welford's online algorithm for computing the mean and variance (and therefore standard deviation, if desired) work with just one data point at a time, as if the data were streaming in and you didn't have the memory or ability to store past data.

I prefer to write the math a bit differently than the Wikipedia page linked above, but to the same effect. Initialize $N = 0, \; m = 0, \; v = 0$ . Update these variables with a new observation $x$
$\def\pluseq{\mathrel{+}=}\begin{align*}N & \pluseq 1\\w &= 1 / N\\d &= x - m\\m & \pluseq d * w\\v & \pluseq -v * w + d ^ 2 * w * (1 - w)\end{align*}$
Write a Python class called OnlineMeanVar, which implements the following API:
```
om = OnlineMeanVar()
om.update(1)
om.update(2)
om.update(3)
om.mean()
om.var()
om.var_biased()
om.count() # the number of times update() has been called
om.size()
```
The method om.var_biased() should return the value of v. The method om.var() should return
$v * \text{om.count()} / (\text{om.count() - 1})$
The constructor should accept a size argument, which defaults to $1$ , that sets the number of means and variances to be tracked. If size is bigger than $1$ , then om.update(x)'s argument x should be a $1$ -dimensional array of size size. Further, if size is bigger than $1$ , then om.mean(), om.var(), and om.var_biased() should all return $1$ -dimensional arrays of size size.
Use the class above to construct a convergence path plot of $101$ paths for each of two different calculations of the variance, var() and var_biased(). Thus, your convergence path plot should have $202$ total paths. Each path should consist of $100$ data points.
- You can either initialize a dataframe with the appropriate columns and then use plotnine, or use matplotlib, like I did in our class notes.
- Generate fake data from whichever distribution you want.
- Proper design of the plot above ensures you should never had nested Python loops to answer this question.
- Color the paths based on which variance calculation is used.
Do you notice any differences in the convergence paths between the two estimates of the variance? If so, what do you notice?