Pandas

Pandas is a way to easily analyze data in a tabular format. It is composed of Series that are put together to create DataFrames.

Import

Below is the convention for importing pandas.

import pandas as pd

Series

To create a series, simply pass any iterable to the pd.Series() command.

my_series = pd.Series(["one", "two", "three"])

Access elements of a series

Series can be accessed the same way most iterables are accessed in python.

To get a single value.

third_element = series[3]

To get a range of values.

third_to_sixth = series[3:7]

One advantage of using a custom index is that you can then fetch elements through that index.

get_value_for_jack = series["jack"]

Warning

Fetching a value in a series only functions similarly to a list if using the default index. Otherwise, it will fetch based on the index provided. The snippet below would only fetch the second element if the label 2 has not been assigned to a value elsewhere.

get_second_value = series[2]

Additionally, if there are two or more values attached to a single label, pandas will return all of them, not just the first. There is no requirement for a index to only contain unique values

DataFrames

Creation

DataFrames are created by either:

Passing a list of iterables, where each iterable represents a row

my_dataframe = pd.DataFrame([[0, 1, 2], [3, 4, 5]])

Column names can be passed in with the columns= parameter as a list.

Passing a dictionary, where each key is the column name, and the values are an interable of values to create a column.

my_dataframe = pd.DataFrame({"col_name": [0, 3], "col_name2": [1, 4]})

Selection

Columns

df["a"]  # Returns a series
df[["a"]]  # Returns a dataframe
df[["a", "b"]]  # Returns multiple columns as a df

The order in which the columns are specified are the order in which they are created in the returned dataframe.

Rows

df[:3]  # Selects the first three rows of data

To select by position, use .iloc not [] as this can get confusing with the index system

ser.iloc[4]  # Selects the 5th element, 0-indexed

Both Columns and Rows

.iloc can be used to find either entire rows, entire columns, or a particular point. The below will all return a series.

df.iloc[2, 2]  # Returns the value in the 3rd row 3rd column
df.iloc[:, 2]  # Returns the third column
df.iloc[2, :]  # Returns the third row

To return a dataframe, place [] around the value.

df.iloc[[2], :]  # Returns the third row as a df

If you are using the non-default index, it most likely makes sense to select on label instead. This is done with .loc

ser.loc["Vancouver"]  # Returns the row with index Vancouver

Iteration

There are a couple of ways to iterate over rows in a dataframe:

for index, row in df.itterows():
    y = row["col"]

However, this is slow. A better way, if possible, is to use a list comprehension.

result = [f(x) for x in df["col"]]  # For a single column
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])]  # Two columns

Uniqueness

To check if all values in a column are unique, there are a couple methods:

result = df["column"].is_unique
# or
result = df["column"].duplicated().any()