MATH 385 Week 11 Worksheet

  1. Welford's online algorithm for computing the mean and variance (and therefore standard deviation, if desired) work with just one data point at a time, as if the data were streaming in and you didn't have the memory or ability to store past data.

    I prefer to write the math a bit differently than the Wikipedia page linked above, but to the same effect. Initialize N=0,  m=0,  v=0N = 0, \; m = 0, \; v = 0. Update these variables with a new observation xx

    N+=1w=1/Nd=xmm+=dwv+=vw+d2w(1w)\def\pluseq{\mathrel{+}=}\begin{align*}N & \pluseq 1\\w &= 1 / N\\d &= x - m\\m & \pluseq d * w\\v & \pluseq -v * w + d ^ 2 * w * (1 - w)\end{align*}

    Write a Python class called OnlineMeanVar, which implements the following API:

    om = OnlineMeanVar()
    om.update(1)
    om.update(2)
    om.update(3)
    om.mean()
    om.var()
    om.var_biased()
    om.count() # the number of times update() has been called
    om.size()
    

    The method om.var_biased() should return the value of v. The method om.var() should return

    vom.count()/(om.count() - 1)v * \text{om.count()} / (\text{om.count() - 1})

    The constructor should accept a size argument, which defaults to 11, that sets the number of means and variances to be tracked. If size is bigger than 11, then om.update(x)'s argument x should be a 11-dimensional array of size size. Further, if size is bigger than 11, then om.mean(), om.var(), and om.var_biased() should all return 11-dimensional arrays of size size.

  2. Use the class above to construct a convergence path plot of 101101 paths for each of two different calculations of the variance, var() and var_biased(). Thus, your convergence path plot should have 202202 total paths. Each path should consist of 100100 data points.

    • You can either initialize a dataframe with the appropriate columns and then use plotnine, or use matplotlib, like I did in our class notes.
    • Generate fake data from whichever distribution you want.
    • Proper design of the plot above ensures you should never had nested Python loops to answer this question.
    • Color the paths based on which variance calculation is used.

    Do you notice any differences in the convergence paths between the two estimates of the variance? If so, what do you notice?