Marks Notes

pandas

Pandas is a way to easily analyze data in a tabular format. It is composed of Series that are put together to create DataFrames.

Installation

Pandas is available in pypi as pandas. It can be installed with uv

uv add pandas

Import

Below is the convention for importing pandas.

import pandas as pd

Series

To create a series, simply pass any iterable to the pd.Series() command.

my_series = pd.Series(["one", "two", "three"])

Access elements of a series

Series can be accessed the same way most iterables are accessed in python.

To get a single value.

third_element = series[3]

To get a range of values.

third_to_sixth = series[3:7]

One advantage of using a custom index is that you can then fetch elements through that index.

get_value_for_jack = series["jack"]
Warning

Fetching a value in a series only functions similarly to a list if using the default index. Otherwise, it will fetch based on the index provided. The snippet below would only fetch the second element if the label 2 has not been assigned to a value elsewhere.

get_second_value = series[2]

Additionally, if there are two or more values attached to a single label, pandas will return all of them, not just the first. There is no requirement for a index to only contain unique values

DataFrames

Creation

DataFrames are created by either:

  1. Passing a list of iterables, where each iterable represents a row
my_dataframe = pd.DataFrame([[0, 1, 2], [3, 4, 5]])

Column names can be passed in with the columns= parameter as a list.

  1. Passing a dictionary, where each key is the column name, and the values are an interable of values to create a column.
my_dataframe = pd.DataFrame({"col_name": [0, 3], "col_name2": [1, 4]})

Selection

Columns

df["a"]  # Returns a series
df[["a"]]  # Returns a dataframe
df[["a", "b"]]  # Returns multiple columns as a df

The order in which the columns are specified are the order in which they are created in the returned dataframe.

Rows

df[:3]  # Selects the first three rows of data

To select by position, use .iloc not [] as this can get confusing with the index system

ser.iloc[4]  # Selects the 5th element, 0-indexed

Both Columns and Rows

.iloc can be used to find either entire rows, entire columns, or a particular point. The below will all return a series.

df.iloc[2, 2]  # Returns the value in the 3rd row 3rd column
df.iloc[:, 2]  # Returns the third column
df.iloc[2, :]  # Returns the third row

To return a dataframe, place [] around the value.

df.iloc[[2], :]  # Returns the third row as a df

If you are using the non-default index, it most likely makes sense to select on label instead. This is done with .loc

ser.loc["Vancouver"]  # Returns the row with index Vancouver

Iteration

There are a couple of ways to iterate over rows in a dataframe:

for index, row in df.itterows():
    y = row["col"]

However, this is slow. A better way, if possible, is to use a list comprehension.

result = [f(x) for x in df["col"]]  # For a single column
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]  # Two columns

Uniqueness

To check if all values in a column are unique, there are a couple methods:

result = df["column"].is_unique
# or
result = df["column"].duplicated().any()