Pandas
Pandas is a way to easily analyze data in a tabular format. It is composed of Series that are put together to create DataFrames.
Import
Below is the convention for importing pandas.
import pandas as pd
Series
To create a series, simply pass any iterable to the pd.Series()
command.
my_series = pd.Series(["one", "two", "three"])
Access elements of a series
Series can be accessed the same way most iterables are accessed in python.
To get a single value.
third_element = series[3]
To get a range of values.
third_to_sixth = series[3:7]
One advantage of using a custom index is that you can then fetch elements through that index.
get_value_for_jack = series["jack"]
Warning
Fetching a value in a series only functions similarly to a list if using the default index. Otherwise, it will fetch based on the index provided. The snippet below would only fetch the second element if the label 2
has not been assigned to a value elsewhere.
get_second_value = series[2]
Additionally, if there are two or more values attached to a single label, pandas will return all of them, not just the first. There is no requirement for a index to only contain unique values
DataFrames
Creation
DataFrames are created by either:
- Passing a list of iterables, where each iterable represents a row
my_dataframe = pd.DataFrame([[0, 1, 2], [3, 4, 5]])
Column names can be passed in with the columns=
parameter as a list.
- Passing a dictionary, where each key is the column name, and the values are an interable of values to create a column.
my_dataframe = pd.DataFrame({"col_name": [0, 3], "col_name2": [1, 4]})
Selection
Columns
df["a"] # Returns a series
df[["a"]] # Returns a dataframe
df[["a", "b"]] # Returns multiple columns as a df
The order in which the columns are specified are the order in which they are created in the returned dataframe.
Rows
df[:3] # Selects the first three rows of data
To select by position, use .iloc
not []
as this can get confusing with the index system
ser.iloc[4] # Selects the 5th element, 0-indexed
Both Columns and Rows
.iloc
can be used to find either entire rows, entire columns, or a particular point. The below will all return a series.
df.iloc[2, 2] # Returns the value in the 3rd row 3rd column
df.iloc[:, 2] # Returns the third column
df.iloc[2, :] # Returns the third row
To return a dataframe, place []
around the value.
df.iloc[[2], :] # Returns the third row as a df
If you are using the non-default index, it most likely makes sense to select on label instead. This is done with .loc
ser.loc["Vancouver"] # Returns the row with index Vancouver
Iteration
There are a couple of ways to iterate over rows in a dataframe:
for index, row in df.itterows():
y = row["col"]
However, this is slow. A better way, if possible, is to use a list comprehension.
result = [f(x) for x in df["col"]] # For a single column
result = [f(x, y) for x, y in zip(df["col1"], df["col2"])] # Two columns
Uniqueness
To check if all values in a column are unique, there are a couple methods:
result = df["column"].is_unique
# or
result = df["column"].duplicated().any()