MATH 385 Week 05 Worksheet

  1. Write a Python function mean(x) which accepts one Numpy array as the only argument, calculates, and returns the mean of the array 1Nn=1Nxn\frac{1}{N} \sum_{n=1}^N x_n
  2. Write a Python function std(x) which accepts one Numpy array as the only argument, calculates, and returns the standard deviation of the array 1Nn=1N(xnmuhat)2\sqrt{\frac{1}{N} \sum_{n=1}^N (x_n - muhat) ^ 2} where muhat=muhat = mean(x).
  3. Write a Python function median(x) which accepts one Numpy array as the only argument, calculates, and returns the median of the array. A simple version of the median can be calculated as follows. If x has an even number of values, return the mean of the two numbers in the middle of a sorted copy of x. If x has an odd number of values, return the number in the middle of a sorted copy of x.
  4. Write a Python function mad(x) which accepts one Numpy array as the only argument, calculates, and returns the median absolute deviation of the array. The median absolute deviation of an array can be calculated as follows. Calculate and store the median of x. Return the median of the absolute value of the difference between x and its median.
  5. Write a Python function dot(x, y) which accepts two Numpy arrays, calculates, and returns the following calculation. xy=n=1Nxnynx \cdot y = \sum_{n=1}^N x_n * y_n
  6. Write a Python function norm(x) which accepts one Numpy array, calculates, and returns the following calculation. x=n=1Nxn2||x|| = \sqrt{\sum_{n=1}^N x_n^2}
  7. Write a Python function cosine_similarity(x, y) which accepts two Numpy arrays, calculates, and returns the following calculation. cosine_similarity(x,y)=xyxy\text{cosine\_similarity}(x, y) = \frac{x \cdot y}{||x|| * ||y||} This is an increasingly popular function in modern large language models. Here's a relatively simple read about this calculation, albeit written in Javascript: How does cosine similarity work?
  8. Uniform sampling from a stream of numbers, where you don't have the availability (memory or otherwise) to store more than one of the numbers at a time, is a clever trick.

    Write a Python class called OnlineUniformSampler, which implements the following API:

    ou = OnlineUniformSampler(95928)
    ou.update(1)
    ou.update(2)
    ou.update(3)
    ou.sample()
    ou.count()

    The class method sample() should return one, uniformly chosen, of the values input to the possibly many calls of update().

    The class method update() should implement an algorithm based on the following description. For the first value passed to update(), store it. For ii-th call to update(x_i), replace the stored value with passed in value x_i with probability 1/i1/i. Not all calls to update() will successfully replace the stored value. To replace a stored value uu with the ii-th value of xx with a probability pp, use code like

    rng = np.random.default_rng(seed)
    if rng.uniform() <= p:
        u = x
    

    where the variable rng should be re-used within this class.

  9. Use Pandas to read the following URL to a CSV dataset about penguins https://raw.githubusercontent.com/roualdes/data/master/penguins.csv to create a DataFrame of these data.
  10. Print the top 6 rows of this DataFrame.
  11. Print the bottom 8 rows of this DataFrame.
  12. How many columns are in this DataFrame? How many rows? Instead of writing out numbers, show me the answers to these questions with Python code.
  13. What type is your DataFrame? What type are each of the columns? What are the types of the elements of each colum? Instead of writing out numbers, show me the answers to these questions with Python code.
  14. What is type object all about?
  15. Do your Numpy functions above work on Pandas Series? Why?
  16. Calculate the median bill_length_mm of all penguins whose bill_length_mm is within plus/minus one standard deviation of the mean bill length.
  17. Calculate the cosine similarity between bill_length_mm and bill_depth_mm.