MATH 385 Project Ideas

  • combine at least 3 Worksheet topics into a single analysis of a data set we didn't' consider in class;
  • make a personal website, with at least three webpages, using GitHub pages;
  • perform a data analysis using new (not covered in our class) data science related tools not previously covered in class, using a data set we didn't consider in class;
  • write and host a tutorial of a new (not covered in our class) data science related topic;
  • learn and host an introduction to a new (not covered in our class) data science related piece of software;
  • Explore and make a Python package using the ctypes library.
  • Write a tutorial for the relatively new dataframe library Ibis. Why is it different / better / worse than Pandas?
  • Look into SIMD processing and write a tutorial about the idea, how Python/Numpy uses it, and the speed gains derived from it.
  • SQL vs NoSQL...
  • Python 3.11 ditched TimSort in favor of PowerSort. Teach us about these algorithms.
  • Learn about Writing custom array containers and Subclassing ndarray in Numpy, write a blog post with relatively simple ideas, and then explain how these ideas carry over to Pandas pandas.Series when np.mean() is called on them.
  • Explore and develop some novel plots using either of the more popular dynamic/interactive plotting packages in Python: Bokeh and/or Plotly.
  • Compete in the MTA Open Data Challenge and share your analysis.
  • Perform a speed comparison between pandas.Series.apply(...) and/or pandas.Series.str.x and treating the series as a numpy array.
  • The plotnine documentation is lacking examples across many pages, e.g. scale_y_log10(), scale_color_brewer(), geom_jitter(), and geom_spoke(). The plotnine package is just an example here. The project idea is to go contribute to at least two of the packages we've used in class, in not insignifcant ways. Please consult with Edward about what counts as an insignificant contribution to a package.
  • A/B testing
  • Learn Stan
  • Assocation Rules
  • Get a local LLM running on your machine, train it to some new task, show how effective your traning was.
  • DVC
  • PySpark
  • Dash
  • H2O Python Module
  • DuckDB
  • Resume filtering via embeddings: Here's a reasonable introduction to embeddings -- Embeddings: What they are and why they matter
  • Polars for Python.
  • Great Tables for making tables look pretty enough for publication.
  • An alternative to Jupyter Lab, and reactive, Marimo.