In [1]:
import pandas as pd # DataFrames
import numpy as np

In [2]:
# https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies
df = pd.read_csv("https://roualdes.us/math314/uk_movies.csv")

You should think of DataFrames as two dimensional arrays of heterogenous types. 

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,datetime,duration,title,genres,release_date,movie_id,user_id
0,58773,2017-01-01 01:15:09,0.0,"Angus, Thongs and Perfect Snogging","Comedy, Drama, Romance",2008-07-25,26bd5987e8,1dea19f6fe
1,58774,2017-01-01 13:56:02,0.0,The Curse of Sleeping Beauty,"Fantasy, Horror, Mystery, Thriller",2016-06-02,f26ed2675e,544dcbc510
2,58775,2017-01-01 15:17:47,10530.0,London Has Fallen,"Action, Thriller",2016-03-04,f77e500e7a,7cbcc791bf
3,58776,2017-01-01 16:04:13,49.0,Vendetta,"Action, Drama",2015-06-12,c74aec7673,ebf43c36b6
4,58777,2017-01-01 19:16:37,0.0,The SpongeBob SquarePants Movie,"Animation, Action, Adventure, Comedy, Family, ...",2004-11-19,a80d6fc2aa,a57c992287


The property `shape` describes the number of rows and the number of columns.

In [8]:
df.shape

(671736, 8)

  A bit more specifically, DataFrames are nearly a `dict` of numpy `array`s, where each array can have a different data type (`dtype`).  DataFrame keys are the column names.

In [5]:
df.columns

Index(['Unnamed: 0', 'datetime', 'duration', 'title', 'genres', 'release_date',
       'movie_id', 'user_id'],
      dtype='object')

There are lots of less-than-optimal things online about Pandas, and thus built into LLMs.  You can extract a single column, which has type something like `pd.Series`, with code like

In [11]:
df["title"]

0                Angus, Thongs and Perfect Snogging
1                      The Curse of Sleeping Beauty
2                                 London Has Fallen
3                                          Vendetta
4                   The SpongeBob SquarePants Movie
                            ...                    
671731          Oprah Presents When They See Us Now
671732                                 HALO Legends
671733                                  Pacific Rim
671734    ReMastered: The Two Killings of Sam Cooke
671735                                   Chopsticks
Name: title, Length: 671736, dtype: object

But if you are trying to simultaneously index into rows and columns of a DataFrame, use the method `loc`.

In [13]:
df.loc[:10, ["title", "user_id"]]

Unnamed: 0,title,user_id
0,"Angus, Thongs and Perfect Snogging",1dea19f6fe
1,The Curse of Sleeping Beauty,544dcbc510
2,London Has Fallen,7cbcc791bf
3,Vendetta,ebf43c36b6
4,The SpongeBob SquarePants Movie,a57c992287
5,London Has Fallen,c5bf4f3f57
6,The Water Diviner,8e1be40e32
7,Angel of Christmas,892a51dee1
8,Ratter,cff8ea652a
9,The Book of Life,bf53608c70


There is all sorts of built functionality for working with columns of `str`s.  From the code below, we learn that 28 `user_id`s watched (at least some amount of) movies with the word "Vendetta" contained within the title.

In [16]:
idx = df["title"].str.contains("Vendetta")
np.sum(idx)

28

But using the `bool`ean `pd.Series` named `idx`, we can find the titles and users corresponding to the rows of the column title which contains the word "Vendetta".

In [17]:
df.loc[idx, ["title", "user_id"]]

Unnamed: 0,title,user_id
3,Vendetta,ebf43c36b6
27,Vendetta,ebf43c36b6
3479,Vendetta,a2aeded38e
5739,Vendetta,93e9369e81
8334,Vendetta,3062e6ffc2
9595,Vendetta,95332496bb
17330,Vendetta,7b33f5ae6c
17904,Vendetta,058f6c24e9
17935,Vendetta,058f6c24e9
21454,Vendetta,93de7fca39


The proportion of people who watched a movie that contains the word "Vendetta" is

In [19]:
p = np.sum(idx) / df.shape[0]
p

4.168304214751033e-05

Or we could say, based on these data, the probability that someone in the UK watches a movie with the word "Vendetta" in the title is 

In [20]:
p

4.168304214751033e-05

We could also find the users that watched both a movie that contains the word "Vendetta" and a movie that contains the world "World".

In [38]:
jdx = df["title"].str.contains("London")
sv = set(df.loc[idx, "user_id"])
sw = set(df.loc[jdx, "user_id"])
set(sw).intersection(sv)

{'ebf43c36b6'}


Based on these data, the probability that someone in the UK who watched a movie with the word "Vendetta" in the title also watched a movie with the word "London" in the title is

In [37]:
1 / df.shape[0]

1.4886800766967976e-06