MATH 385 Project Ideas
- combine at least 3 Worksheet topics into a single analysis
of a data set we didn't' consider in class;
- make a personal website, with at least three webpages, using
GitHub pages;
- perform a data analysis using new (not covered in our class)
data science related tools not previously covered in class,
using a data set we didn't consider in class;
- write and host a tutorial of a new (not covered in our
class) data science related topic;
- learn and host an introduction to a new (not covered in our
class) data science related piece of software;
- Explore and make a Python package using the
ctypes library.
- Write a tutorial for the relatively new dataframe library
Ibis.
Why is it different / better / worse than Pandas?
- Look into
SIMD
processing and write a tutorial about the idea, how
Python/Numpy uses it, and the speed gains derived from it.
- SQL vs NoSQL...
- Python 3.11 ditched TimSort in favor of
PowerSort.
Teach us about these algorithms.
- Learn about Writing custom array containers and
Subclassing ndarray in Numpy, write a blog post with relatively simple ideas, and then explain how these ideas carry over to Pandas
pandas.Series
when np.mean()
is
called on them. - Explore and develop some novel plots using either of the more popular dynamic/interactive plotting packages in Python:
Bokeh and/or Plotly.
- Compete in the MTA Open Data Challenge and
share your analysis.
- Perform a speed comparison between
pandas.Series.apply(...)
and/or pandas.Series.str.x
and
treating the series as a numpy array. - The plotnine
documentation is lacking examples across many pages, e.g.
scale_y_log10(),
scale_color_brewer(),
geom_jitter(), and
geom_spoke(). The plotnine package is just an example here. The project idea is to go contribute to at least two of the packages we've used in class, in not insignifcant ways. Please consult with Edward about what counts as an insignificant contribution to a package.
- A/B testing
- Learn Stan
- Assocation Rules
- Get a local LLM running on your machine, train it to some
new task, show how effective your traning was.
- DVC
- PySpark
- Dash
- H2O Python Module
- DuckDB
- Resume filtering via embeddings: Here's a reasonable introduction to embeddings -- Embeddings: What they are and why they matter
- Polars for Python.
- Great Tables for making tables look pretty enough for publication.
- An alternative to Jupyter Lab, and reactive, Marimo.