pandas
Pandas is a way to easily analyze data in a tabular format. It is composed of Series that are put together to create DataFrames.
Installation
Pandas is available in pypi as pandas. It can be installed with uv
uv add pandas
Import
Below is the convention for importing pandas.
import pandas as pd
Series
To create a series, simply pass any iterable to the pd.Series() command.
my_series = pd.Series(["one", "two", "three"])
Access elements of a series
Series can be accessed the same way most iterables are accessed in python.
To get a single value.
third_element = series[3]
To get a range of values.
third_to_sixth = series[3:7]
One advantage of using a custom index is that you can then fetch elements through that index.
get_value_for_jack = series["jack"]
Fetching a value in a series only functions similarly to a list if using the default index. Otherwise, it will fetch based on the index provided. The snippet below would only fetch the second element if the label 2 has not been assigned to a value elsewhere.
get_second_value = series[2]
Additionally, if there are two or more values attached to a single label, pandas will return all of them, not just the first. There is no requirement for a index to only contain unique values
DataFrames
Creation
DataFrames are created by either:
- Passing a list of iterables, where each iterable represents a row
my_dataframe = pd.DataFrame([[0, 1, 2], [3, 4, 5]])
Column names can be passed in with the columns= parameter as a list.
- Passing a dictionary, where each key is the column name, and the values are an interable of values to create a column.
my_dataframe = pd.DataFrame({"col_name": [0, 3], "col_name2": [1, 4]})
Selection
Columns
df["a"] # Returns a series
df[["a"]] # Returns a dataframe
df[["a", "b"]] # Returns multiple columns as a df
The order in which the columns are specified are the order in which they are created in the returned dataframe.
Rows
df[:3] # Selects the first three rows of data
To select by position, use .iloc not [] as this can get confusing with the index system
ser.iloc[4] # Selects the 5th element, 0-indexed
Both Columns and Rows
.iloc can be used to find either entire rows, entire columns, or a particular point. The below will all return a series.
df.iloc[2, 2] # Returns the value in the 3rd row 3rd column
df.iloc[:, 2] # Returns the third column
df.iloc[2, :] # Returns the third row
To return a dataframe, place [] around the value.
df.iloc[[2], :] # Returns the third row as a df
If you are using the non-default index, it most likely makes sense to select on label instead. This is done with .loc
ser.loc["Vancouver"] # Returns the row with index Vancouver
Iteration
There are a couple of ways to iterate over rows in a dataframe:
for index, row in df.itterows():
y = row["col"]
However, this is slow. A better way, if possible, is to use a list comprehension.
result = [f(x) for x in df["col"]] # For a single column
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])] # Two columns
Uniqueness
To check if all values in a column are unique, there are a couple methods:
result = df["column"].is_unique
# or
result = df["column"].duplicated().any()