4. Python data science modules

4. Python data science modules#

Now that we have a grasp on the fundamentals of Python, we’re ready to use Python for data science.

Unlike some more math-oriented programming languages (like R or MATLAB), Python relies on external packages to provide most data science functionality.

The Python community has standardized on the following family of related packages, all of which we’ll be covering in this module, and will be using extensively for the rest of the course:

numpy - math tools, plus working with 1D & multi-dimensional arrays
pandas - working with tabular data
matplotlib - making plots and visualizations

Numpy#

Numpy is a Python package for representing array data, and comes with a large library of tools and mathematical functions that operate efficiently on arrays, mostly through its ndarray class.

Numpy is by far the most popular Python package for data science, and is one of the most-downloaded python packages overall. It’s so useful and reliable, that most of the mathematical functionality of the other packages covered in this module (pandas, seaborn, matplotlib) is provided by numpy under the hood.

Because numpy is used a lot, it’s convention to import it with the np abbreviation:

import numpy as np

Why numpy?#

A numpy array is similar to a Python list: they can both serve as containers for numbers.

python_list = [0, 2, 4, 6]
print(python_list)

[0, 2, 4, 6]

numpy_array = np.array([0, 2, 4, 6])
print(numpy_array)

[0 2 4 6]

So why use numpy instead of lists?

Speed
- Although numpy is a Python package, most of the functionality is written in fast C or Fortran code.
Memory efficient
- Numpy uses less memory to store numbers than Python, so you can work on larger datasets.
Functionality
- Numpy comes with a huge range of modules with fast and thoroughly-validated algorithms from interpolation to fourier transforms. (We’ll cover many of these functions shortly!)
Manipulation syntax
- Numpy’s syntax makes it clear and easy to perform common array operations, like slicing, filtering, and summarization.

But there are some usecases where lists make more sense

Storing different kinds of data together
- Numpy arrays are homogeneous, all the elements must be the same type
Working with non-numerical data
- Only some numpy functionality works with strings and other types

Creating arrays#

One way to create an array is from a Python sequence like a list using the array() function

days_per_month_list = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
days_per_month = np.array(days_per_month_list)
days_per_month

array([31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31])

Nesting lists will create higher dimensional arrays.

array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
array_2d

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

extensions

As well as converting Python lists to arrays, numpy can create its own arrays!

You can create an array that’s filled with zeros (helpful for creating an empty array we’ll fill up later)

np.zeros(5)

array([0., 0., 0., 0., 0.])

or ones (for 2D arrays we specify the number of rows, then the number of columns for a 2D array)

np.ones((2, 5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

Numpy has it’s own version of the Python range() function:

np.arange(2, 9, 2)

array([2, 4, 6, 8])

and a related linspace() function to create an array with evenly-spaced elements.

np.linspace(0, 10, num=5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

There’s a whole random module that can create arrays from randomly sampling various distributions

np.random.uniform(low=0, high=10, size=(2, 5))

array([[8.84041009, 0.02064896, 8.58415525, 0.55068929, 8.73000293],
       [8.26429282, 9.21427896, 7.0886935 , 3.15084759, 7.18397826]])

Size and shape#

We can use len() to get the length of a 1D array

len(days_per_month)

while the .shape attribute will give the length of each dimension (remember the 2D order? rows then cols!)

array_2d.shape

(3, 4)

Array arithmetic#

Arrays let you express mathematical equations without for loops. This makes your code much faster, plus numpy code tends to read more like a math formula than programming.

Operations with single numbers are applied to the whole array

weeks_per_month = days_per_month / 7
print(weeks_per_month)

[4.42857143 4.         4.42857143 4.28571429 4.42857143 4.28571429
 4.42857143 4.42857143 4.28571429 4.42857143 4.28571429 4.42857143]

whereas operations between arrays of the same size are applied element-wise

days_per_month + days_per_month

array([62, 56, 62, 60, 62, 60, 62, 62, 60, 62, 60, 62])

Comparing arrays results in an array of the same size with True/False values.

days_per_month >= 31

array([ True, False,  True, False,  True, False,  True,  True, False,
        True, False,  True])

To perform element-wise boolean logic between arrays, numpy uses & instead of and, and | instead of or.

(np.arange(12) < 3) & (days_per_month >= 31)

array([ True, False,  True, False, False, False, False, False, False,
       False, False, False])

These boolean arrays can be stored as variables like any other. Creating intermediary variables can make arithmetic logic easier to read

is_long_month = days_per_month >= 31
is_q1_month = np.arange(12) < 3
is_long_q1_month = is_q1_month & is_long_month
print(is_long_q1_month)

[ True False  True False False False False False False False False False]

Array slicing and indexing#

Like a Python list, arrays can be sliced

days_per_month[0:3]

array([31, 28, 31])

and individual elements can be index out

print(days_per_month[1])

You index a 2D array using the same notation: first your row slicing, then a comma ,, then your column slicing.

Here’s our array as a reminder

array_2d

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

For example, in the 2nd row, and the 1st column

print(array_2d[1, 0])

or the first three rows of the last column

print(array_2d[0:3, -1])

[ 4  8 12]

Note how our column has lost its “verticalness”: once we’ve sliced it out, it’s just a regular 1D array.

Assigning new values to an array uses the same slice syntax: arrays are mutable!

A single value will be repeated to all elements (a : without numbers means all elements):

array_2d[0, :] = 99
array_2d

array([[99, 99, 99, 99],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

while an equal-length sequence will be assigned elementwise

array_2d[:, 3] = [-1, -2, -3]
array_2d

array([[99, 99, 99, -1],
       [ 5,  6,  7, -2],
       [ 9, 10, 11, -3]])

If we make a slice then modify it, the original array is modified too!

array_2d_slice = array_2d[1:2, 2:3]
array_2d_slice[0] = -9999
array_2d

array([[   99,    99,    99,    -1],
       [    5,     6, -9999,    -2],
       [    9,    10,    11,    -3]])

To keep numpy’s speedy performance, making a slice doesn’t copy any data, just provides a “view” to the original array.

If we want to modify a subset of the data independently, we can use the copy method.

q1_months = days_per_month[:3]
q1_months_leap_year = q1_months.copy()
q1_months_leap_year[1] = 29
q1_months_leap_year

array([31, 29, 31])

Data types#

All values in a numpy array are converted to be the same type. By default, numpy will take an educated guess about what type we want, and we can see its choice with the dtype attribute

array_default = np.array([1, 2, 3])
array_default.dtype

dtype('int64')

We can override the dtype though

array_manual = np.array([1, 2, 3], dtype=np.float64)
array_manual.dtype

dtype('float64')

Other than ints and floats, the boolean dtype is frequently used in numpy

is_positive = array_2d > 0
is_positive.dtype

dtype('bool')

Arrays can store strings, though of course some numpy mathematical tools won’t work

states = np.array(["Washington", "Oregon", "California"])
states.dtype

dtype('<U10')

The U in <U10 stands for “Unicode string”. The < and 10 characters represent internal numpy storage attributes (endianess and number of characters respectively).

An array can also store arbitrary python objects. Lets make an array containing a number as well as the print() function

# This is a terrible idea.
mixed = np.array([42, print])
mixed.dtype

dtype('O')

Numpy gives up and applies the O for “object” dtype (everything in Python is an object).

If you see this in the real world, it’s usually a sign that you’re number parsing has gone wrong! Otherwise, you may as well use a regular list.

Boolean indexing#

Here are some state names, with corresponding mean annual precipitation.

states = np.array(["Washington", "Oregon", "California"])
precip = np.array([38.67, 43.62, 22.97])

To pick out California we can use == to create a boolean array

states == "California"

array([False, False,  True])

Passing this boolean array as indexing to another array of the same length will slice out only the elements matching True

precip[states == "California"]

array([22.97])

This is really powerful for working between corresponding arrays!

Negation can be done with !=

precip[states != "California"]

array([38.67, 43.62])

or by using ~ to flip all the booleans in an array

~(states == "California")

array([ True,  True, False])

match = states == "California"
precip[~match]

array([38.67, 43.62])

We can also filter the other way

high_rainfall = precip > 30
states[high_rainfall]

array(['Washington', 'Oregon'], dtype='<U10')

Mathematical functions#

Numpy comes with array versions of the functions in the builtin math module, and many more.

Many are element-wise transformations

nums = np.linspace(0, 100, 4)

print(nums)
print(np.floor(nums))  # Round down to next integer.

[  0.          33.33333333  66.66666667 100.        ]
[  0.  33.  66. 100.]

while others summarize

np.median(nums)

np.float64(50.0)

For multidimensional arrays, summarizing functions still give a single overall result

array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
np.mean(array_2d)

np.float64(6.5)

But we can specify which dimension the function should summaraize using the axis argument. This will be useful later when dealing with observation data that might have at least x, y, and time dimensions.

To get the mean of each row:

np.mean(array_2d, axis=1)

array([ 2.5,  6.5, 10.5])

Here’s a few of the more common numpy functions

Element-wise
- sin, arcsine, deg2rad: trigonometry
- round, floor, ceil: rounding
- exp, log, log10: logarithms
- clip: limit the elements in an array
Summarizing
- mean, median, mode: averages
- std, var: skew
- sum, prod: sum/product of all elements
- percentile, quantile: order

Because numpy is so foundational to scientific Python, if you can’t find a common math algorithm in numpy, there’s likely to be another package that provies the function with full support for numpy arrays.

Good places to hunt for extended math functions include

scipy has many submodules including functions for interpolation, statistics, and linear algebra
statsmodels for summary statistics
pandas has functions for dealing with time series and strings

NaNs#

NaN stands for “Not a Number”.

They can come from importing missing data

values = ["1", None, "3"]
np.array(values, dtype=np.float64)

array([ 1., nan,  3.])

or as the result of mathematical operations

# You see a warning message after running this cell!
#
# Numpy can perform mathematical operations on NaN values without errors.
# But often this is a sign that something has gone wrong, so it gives you
# a helpful heads up.
np.arange(5) / np.arange(5)

/var/folders/df/w347rlh96tx42h1kl69_4r8w0000gn/T/ipykernel_73578/2500404429.py:6: RuntimeWarning: invalid value encountered in divide
  np.arange(5) / np.arange(5)

array([nan,  1.,  1.,  1.,  1.])

but most commonly come from other NaNs!

NaNs propagate. The result of an operation where any of the inputs is NaN, usually results in a NaN output, which can spread to your entire dataset!

array_2d = np.array([[np.nan, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
array_2d

array([[nan,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])

array_2d_normalized = array_2d / np.max(array_2d)
array_2d_normalized

array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

There are a few things we can do to manage and embrace NaN values.

First, if you can’t discard invalid numerical data, then explicitly convert it to NaN.

flowrate_over_time = np.array([None, 1.9, 2.1, 3.0, -1, 4.3, 4.8], dtype=np.float64)

# Remove invalid data. The sensor returns -1 when broken.
flowrate_over_time[flowrate_over_time < 0] = np.nan

flowrate_over_time

array([nan, 1.9, 2.1, 3. , nan, 4.3, 4.8])

If you want to ignore NaNs, many numpy functions have nan-ignoring siblings.

np.nanmax(array_2d)

np.float64(12.0)

You’ll still get a NaN result if your entire input array is NaN though

np.nanmean(array_2d_normalized)

/var/folders/df/w347rlh96tx42h1kl69_4r8w0000gn/T/ipykernel_73578/2690959602.py:1: RuntimeWarning: Mean of empty slice
  np.nanmean(array_2d_normalized)

np.float64(nan)

If you’re not expecting any NaNs, enforce that in the code. Numpy has some helpful functions for this.

np.isnan() returns a boolean array that is True for NaN values

np.isnan(flowrate_over_time)

array([ True, False, False, False,  True, False, False])

np.isfinite() is similar: it will be False for both NaN values as well as Inf (infinity) values

np.isfinite(flowrate_over_time)

array([False,  True,  True,  True, False,  True,  True])

Numpy also has np.any() and np.all() that will return a single boolean value that is true of any/all of the elements are True

np.all([True, False])

np.False_

np.any([True, False])

np.True_

With all that in mind, a common quick data validation statement might look like

# All days of the month should be real integers!
assert np.all(np.isfinite(days_per_month)), "Invalid data found!"

Exercise: numpy flow#

Let’s generate some synthetic data, filter some outliers, then perform an analysis on the remaining data.

Generate data for the depth (in meters) of 100 different wells.
- Build a numpy array containing 100 random numbers between -20 and 20.
- You may want to use the numpy random tool.
Negative well depths represent sensor errors.
- Set all the negative values to NaN.
Remove all the NaN values.
- Print the length of the array before and after the NaN removal step, so you can see how your data has changed.
Calculate the median of the remaining values.

Things to think about

What should my variables be called?
Do the results change when you rerun the code?
Extra: What other summary statistics does numpy have that could be interesting to know?
Extra: How can I “seed” the random number generation, so that the numbers are still drawn from a random distribution, but the array will have the same values every time the code is run (for reproducability)?
Extra: What would happen if you got unlucky and all 100 values were negative? How could you modifiy the code to try out this scenario? How could you handle that possibility in code?

Pandas#

Pandas is a package for working with tabular spreadsheet-like data.

What is tabular data?#

Tabular data is anything in a table form!

Common analytical examples include spreadsheets, CSV files, and database tables.

Tabular data consists of rows and columns:

Each row represents an item, and each column represents a common feature of all the items.
Each row has the same columns as the other rows, in the same order.
A single column holds data of the same type, but different columns can have different types.
The order of rows sometimes matters, while the order of columns doesn’t matter.

Tabular data isn’t just work spreadsheets either: for example, a music playlist is tabular data (for each song, we know the title, genre, etc) and your text message inbox is tabular data (for each conversation, we know the participants, unread status, date of most recent message, etc)

Because pandas is so frequently used, it’s standard to import with the pd abbreviation.

import pandas as pd

Loading data#

Pandas comes with many functions for reading lots of different kinds of data. We’ll cover most of them by the end of the course, but here’s a list of the main ones for your later reference

CSVs (read_csv)
SQL databases (read_sql)
Excel files (read_excel)
Parquet files (read_parquet)
Any of the above, as a URL

To start off, we’ll load this CSV file of ground water stations.

url = "https://raw.githubusercontent.com/py4wrds/py4wrds/refs/heads/module-4/data/gwl/stations.csv"
df = pd.read_csv(url)

Note that we’re loading the data directly from GitHub! This won’t work for private repositories, there you’ll have to download the file and use the filepath instead of the url.

DataFrames#

The code above (df = pd.read_csv(url)) has loaded our tabular data into an object called a DataFrame:

type(df)

pandas.core.frame.DataFrame

The DataFrame is one of the two core classes that pandas gives us (the other is Series which represents a column).

DataFrame has a number of attributes (variables) and methods (functions) for inspecting our data, which is the first step in any analysis!

head() shows us the first five rows of data

df.head()

	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	ZIP_CODE
0	51445	320000N1140000W001	NaN	Bay Ridge	35.5604	-121.755	USGS quad	Unknown	NaN	NaN	Monterey	NaN	Residential	Part of a nested/multi-completion well	NaN	92154
1	25067	325450N1171061W001	19S02W05K003S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154
2	25068	325450N1171061W002	19S02W05K004S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154
3	39833	325450N1171061W003	19S02W05K005S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154
4	25069	325450N1171061W004	19S02W05K006S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154

Because our table has so many columns, not all are visible. We can use the .columns attribute to get all the names on one screen

df.columns

Index(['STN_ID', 'SITE_CODE', 'SWN', 'WELL_NAME', 'LATITUDE', 'LONGITUDE',
       'WLM_METHOD', 'WLM_ACC', 'BASIN_CODE', 'BASIN_NAME', 'COUNTY_NAME',
       'WELL_DEPTH', 'WELL_USE', 'WELL_TYPE', 'WCR_NO', 'ZIP_CODE'],
      dtype='object')

How much data are we working with? .shape gives us both the row and column count, which is always in row, col order

print(df.shape)

(43807, 16)

but it’s more clear to take the length directly:

print("n_rows: {}".format(len(df)))
print("n_cols: {}".format(len(df.columns)))

n_rows: 43807
n_cols: 16

dtypes shows the data type of each column.

This is important to check! Pandas makes some guesses about what data type to use, and often gets things wrong. Common pitfalls to be wary of include

Dates might be loaded as strings instead of rich datetime objects.
Numerical columns like $145 or 78% might be loaded as strings instead of as numbers
A single row with a typo (32.111!) or a non-numeric placeholder (using Unknown instead of NaN) will turn an other-wise numeric column into a string type.
Pandas defaults to using 64 bit integers and floats. If your dataset is maxing out your memory, you can specify 32 bit (or smaller) dtypes to reduce the size once loaded.
ZIP codes should be parsed as strings not integers, to avoid stripping ZIPs that begin with zero.

print(df.dtypes)

STN_ID           int64
SITE_CODE       object
SWN             object
WELL_NAME       object
LATITUDE       float64
LONGITUDE      float64
WLM_METHOD      object
WLM_ACC         object
BASIN_CODE      object
BASIN_NAME      object
COUNTY_NAME     object
WELL_DEPTH     float64
WELL_USE        object
WELL_TYPE       object
WCR_NO          object
ZIP_CODE         int64
dtype: object

We can see pandas has done a pretty good job here! (the object type is what pandas uses to represent strings).

Fixing some dtype issues can involve more complex analysis. But for simple cases, we can simply tell pandas what to do when loading the data:

df = pd.read_csv(url, dtype={"ZIP_CODE": str})

describe() gives a summary of our numerical columns. With the reloaded dataframe, ZIP_CODE is no longer considered numeric!

df.describe()

	STN_ID	LATITUDE	LONGITUDE	WELL_DEPTH
count	43807.000000	43807.000000	43807.000000	11646.000000
mean	25718.430753	35.957077	-119.233081	470.534003
std	14618.354046	1.795673	1.850751	470.431239
min	2000.000000	32.545000	-124.330000	0.000000
25%	13455.500000	34.610600	-120.568000	148.000000
50%	24625.000000	35.919700	-119.413000	308.000000
75%	35987.500000	37.007250	-117.423000	615.000000
max	55326.000000	42.012700	-114.134000	3000.000000

Finally, we’re not always interested in the whole dataset for every analysis. You can load a subset of the columns to speed up loading, reduce memory pressure, and to just keep your workspace tidier:

df_county = pd.read_csv(url, usecols=["STN_ID", "WELL_DEPTH"])

df_county.tail()

	STN_ID	WELL_DEPTH
43802	37755	125.0
43803	25155	600.0
43804	38726	95.0
43805	26004	380.0
43806	26051	300.0

Based on the output of df_county.tail(), what do you think the tail() method does?

As well as reading DataFrames from files, they can also be constructed using a dict Python sequences. Each column must be the same length (it’s OK to mix lists, arrays, tuples, etc) or be a single value (which will be repeated for the whole column)

df_from_dict = pd.DataFrame(
    {
        "state": ["Washington", "Oregon", "California"],
        "precip": np.array([38.67, 43.62, 22.97]),
        "country": "USA",
    }
)

Series#

A column of a dataframe is a Series object. You can access a column by using its name in [] brackets, just like a dictionary:

type(df["BASIN_NAME"])

pandas.core.series.Series

df["COUNTY_NAME"]

         Monterey
        San Diego
        San Diego
        San Diego
        San Diego
            ...     
  Klamath, OR
  Klamath, OR
  Klamath, OR
  Klamath, OR
  Klamath, OR
Name: COUNTY_NAME, Length: 43807, dtype: object

You can also access columns with attribute dot notation like df.COUNTY_NAME.

df.COUNTY_NAME

         Monterey
        San Diego
        San Diego
        San Diego
        San Diego
            ...     
  Klamath, OR
  Klamath, OR
  Klamath, OR
  Klamath, OR
  Klamath, OR
Name: COUNTY_NAME, Length: 43807, dtype: object

That’s shorter and clearer! But there are some catches.

The dot notation won’t work when the column name isn’t a valid Python name (like if it contains spaces).
It also won’t work if the column name would override a DataFrame attribute or method (like head).
Finally, you can’t use the dot notation for assignment (the code df.WELL_DEPTH = np.nan won’t work).

For those reasons, it can be simpler to just use df["column_name"] syntax always.

We can convert a Series to a list with to_list()

df["COUNTY_NAME"].to_list()[:5]

['Monterey', 'San Diego', 'San Diego', 'San Diego', 'San Diego']

or to a numpy array

df["WELL_DEPTH"].to_numpy()[-5:]

array([125., 600.,  95., 380., 300.])

Series supports boolean slicing with []

is_san_diego = df["COUNTY_NAME"] == "San Diego"
san_diego_ids = df["STN_ID"][is_san_diego]

san_diego_ids

     25067
     25068
     39833
     25069
     38479
        ...  
  10637
  31281
  31280
  31295
  10634
Name: STN_ID, Length: 2474, dtype: int64

However, to slice by positional integer, you need to use .iloc[] (for integer locaction)

series_head = df["COUNTY_NAME"].iloc[0:5]
series_head

   Monterey
  San Diego
  San Diego
  San Diego
  San Diego
Name: COUNTY_NAME, dtype: object

Otherwise a pandas Series is very similar to a numpy array (often a numpy array is being used by pandas to store a Series!). You can do slicing, indexing, arithmetic, and boolean logic just like in numpy.

Series methods#

Pandas columns come with a range of helpful methods.

There are methods for most of the same math equations as numpy: min max, mean mode median quantile, sum, etc. Unlike numpy, pandas methods tend to ignore NaN values.

print(df["WELL_DEPTH"].sum())

5479839.0

There are methods for unique values. If a column is all the same, that could signify a data issue, or perhaps mean we we don’t need to load that column.

assert df["STN_ID"].is_unique, "There should be no duplicated IDs"
assert df["COUNTY_NAME"].nunique() > 1, "Ensure we're not using a single-county subset"

A particularly useful method is value_counts. Used for summarizing discreet value distribution, it displays the count of each unique value.

By default, NaN values aren’t included, but for data exploration it’s really important to know where our NaNs are so we add dropna=False!

df["WELL_USE"].value_counts(dropna=False)

WELL_USE
Unknown          29458
Irrigation        6052
Observation       3910
Residential       1858
NaN               1400
Other              392
Public Supply      391
Stockwatering      208
Industrial         138
Name: count, dtype: int64

If only the unique names are needed without the counts, the unique() method provides that

df["WELL_USE"].unique()

array(['Residential', 'Unknown', 'Observation', 'Other', 'Irrigation',
       'Industrial', 'Public Supply', 'Stockwatering', nan], dtype=object)

Column wrangling#

By passing a list of column names, we can slice all rows for only the specified columns.

(Note the double brackets: one set for the indexing, and another pair for the list):

df[["BASIN_NAME", "COUNTY_NAME"]].head()

	BASIN_NAME	COUNTY_NAME
0	NaN	Monterey
1	Coastal Plain Of San Diego	San Diego
2	Coastal Plain Of San Diego	San Diego
3	Coastal Plain Of San Diego	San Diego
4	Coastal Plain Of San Diego	San Diego

New columns can be added as a singular value

df["country_code"] = "US"

or with any list/Series/array of the same length as the DataFrame (pandas will convert it to a Series)

site_code_prefix = []
for code in df.SITE_CODE:
    if code:
        site_code_prefix.append(code.split("N")[0])
    else:
        site_code_prefix.append(None)

df["site_code_prefix"] = site_code_prefix

or by performing calculations between columns (loc is used for non-integer slicing, like by column name)

df["well_depth_m"] = df.WELL_DEPTH / 3.28084
df[["site_code_prefix", "well_depth_m"]].tail()

	site_code_prefix	well_depth_m
43802	419995	38.099999
43803	420008	182.879994
43804	420050	28.955999
43805	420127	115.823996
43806	420171	91.439997

Single columns can be deleted with the del keyword

del df["ZIP_CODE"]

print("ZIP_CODE" in df)

False

and for multiple columns, there’s a drop method.

df = df.drop(columns=["site_code_prefix", "well_depth_m"])

print("site_code_prefix" in df)

False

Filtering and slicing#

We’ve already seen the head() method, which shows the first n rows.

A similar function is the sample() method, which shows n random rows. This can give a better sense of the data, in case the first few rows aren’t representative of the rest.

df.sample(n=7)

	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	country_code
23341	34610	361453N1201199W001	20S17E34E001M	20S/17E-34E01	36.1453	-120.120	USGS quad	2.5 ft	5-022.09	Westside	Fresno	1125.0	Irrigation	Single Well	331888	US
11466	29524	346605N1182276W001	07N13W26Q001S	NaN	34.6605	-118.228	Unknown	Unknown	6-044	Antelope Valley	Los Angeles	NaN	Unknown	Unknown	NaN	US
28159	17628	365805N1192578W001	15S25E33G001M	NaN	36.5805	-119.258	Unknown	Unknown	5-022.08	Kings	Tulare	NaN	Unknown	Unknown	NaN	US
12867	8589	347753N1174123W001	08N05W21C001S	NaN	34.7753	-117.412	Unknown	Unknown	6-041	Middle Mojave River Valley	San Bernardino	NaN	Unknown	Unknown	NaN	US
36131	29201	375207N1208779W001	05S10E09C001M	NaN	37.5207	-120.878	Unknown	Unknown	5-022.03	Turlock	Stanislaus	NaN	Unknown	Unknown	NaN	US
4720	6224	336595N1170887W001	06S02W09K002S	NaN	33.6595	-117.089	Unknown	Unknown	8-005	San Jacinto	Riverside	NaN	Unknown	Unknown	NaN	US
4596	5845	336267N1171204W001	06S02W30A001S	NaN	33.6267	-117.120	Unknown	Unknown	NaN	NaN	Riverside	NaN	Unknown	Unknown	NaN	US

Technically what we’re doing here isn’t just printing some rows of our dataset, but actually creating a new DataFrame with some rows sliced from the old one, and printing that new frame.

There are some other DataFrame methods that return a new dataframe with a subset of rows. drop_duplicates() returns a DataFrame with repeated rows removed. dropna returns a dataframe with only rows that don’t have any NaN values:

df_unique = df.drop_duplicates()
df_clean = df_unique.dropna()
len(df_clean)

Because these are both DataFrame methods that return another DataFrame, we can chain them together to save space

df_clean = df.drop_duplicates().dropna()
len(df_clean)

Most pandas methods return a new DataFrame rather than modifying the original one. We can see that our original still has the same number of rows:

len(df)

.iloc does integer indexing on a DataFrame too, but it’s clearer to use the column names rather than their positional index. For that, .loc allows us to filter by both row and col at once (remember the row, col order).

# Get all the USGS rows, and keep only the positional data.
df_USGS = df.loc[df.WCR_NO == "USGS", ["WCR_NO", "LATITUDE", "LONGITUDE"]]
df_USGS.head()

	WCR_NO	LATITUDE	LONGITUDE
39035	USGS	38.3759	-121.955
39891	USGS	38.6313	-121.894
40978	USGS	38.9552	-122.097
41118	USGS	38.9884	-122.007
41299	USGS	39.0429	-122.054

.loc is also used in pandas for assignment of a subset of rows/cols.

# For every row with COUNTY_NAME equal to "Monterey", set that row's WELL_USE value to Residential.
df.loc[df.COUNTY_NAME == "Monterey", "WELL_USE"] = "Residential"
df.head()

	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	country_code
0	51445	320000N1140000W001	NaN	Bay Ridge	35.5604	-121.755	USGS quad	Unknown	NaN	NaN	Monterey	NaN	Residential	Part of a nested/multi-completion well	NaN	US
1	25067	325450N1171061W001	19S02W05K003S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	US
2	25068	325450N1171061W002	19S02W05K004S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	US
3	39833	325450N1171061W003	19S02W05K005S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	US
4	25069	325450N1171061W004	19S02W05K006S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	US

Looping over DataFrames#

To loop over a single column (Series) you can use it directly in a standard for loop:

for county in df.head()["COUNTY_NAME"]:
    print(county)

Monterey
San Diego
San Diego
San Diego
San Diego

To loop over a few columns at a time, you can use Python’s builtin zip function:

for stn_id, county in zip(df.head()["STN_ID"], df.head()["COUNTY_NAME"]):
    print("{}, {}".format(stn_id, county))

51445, Monterey
25067, San Diego
25068, San Diego
39833, San Diego
25069, San Diego

To loop over each row at a time, the most robust approach is to convert the DataFrame to a list of dicts. (You must use the orient="records" argument as there are many other ways to format the data as a dictionary)

for row_dict in df.head().to_dict(orient="records"):
    print("{}, {}".format(row_dict["STN_ID"], row_dict["COUNTY_NAME"]))

51445, Monterey
25067, San Diego
25068, San Diego
39833, San Diego
25069, San Diego

Pandas also has the functions iteritems and itertuples which can sometimes be useful for looping over rows, but both functions have some tricky blindspots (iteritems converts the whole row to the same dtype, itertuples doesn’t work for column names with spaces). Because of that, we recommend using one of the methods mentioned above.

Pandas index#

In the examples above, you might have noticed that when printing a DataFrame or a Series, it also shows a row number.

df.iloc[[1000, 2000, 3000, 4000]][["STN_ID", "BASIN_NAME"]]

	STN_ID	BASIN_NAME
1000	32211	NaN
2000	11760	Lower San Luis Rey Valley
3000	12002	Temecula Valley
4000	9362	Cahuilla Valley

Every DataFrame and Series has a row ID, which pandas calls an index.

By default the index is the row number, but we could use the actual ID from our dataset for example

df_index = df.set_index(df.STN_ID.to_numpy())
df_index.iloc[[1000, 2000, 3000, 4000]][["STN_ID", "BASIN_NAME"]]

	STN_ID	BASIN_NAME
32211	32211	NaN
11760	11760	Lower San Luis Rey Valley
12002	12002	Temecula Valley
9362	9362	Cahuilla Valley

Indexes provide very fast lookups of individual rows using the .loc attribute

df_index.loc[12002]

STN_ID                       12002
SITE_CODE       334439N1170645W001
SWN                  08S02W35D001S
WELL_NAME                      NaN
LATITUDE                   33.4439
LONGITUDE                 -117.065
WLM_METHOD                 Unknown
WLM_ACC                    Unknown
BASIN_CODE                   9-005
BASIN_NAME         Temecula Valley
COUNTY_NAME              Riverside
WELL_DEPTH                     NaN
WELL_USE                   Unknown
WELL_TYPE                  Unknown
WCR_NO                         NaN
country_code                    US
Name: 12002, dtype: object

However, it’s rare to access individual rows in analytical applications, so for simplicity we won’t be using indexes in this course.

You can always achieve the same result using boolean indexing

df[df.STN_ID == 12002].iloc[0]  # Filter and take the first match.

STN_ID                       12002
SITE_CODE       334439N1170645W001
SWN                  08S02W35D001S
WELL_NAME                      NaN
LATITUDE                   33.4439
LONGITUDE                 -117.065
WLM_METHOD                 Unknown
WLM_ACC                    Unknown
BASIN_CODE                   9-005
BASIN_NAME         Temecula Valley
COUNTY_NAME              Riverside
WELL_DEPTH                     NaN
WELL_USE                   Unknown
WELL_TYPE                  Unknown
WCR_NO                         NaN
country_code                    US
Name: 3000, dtype: object

String series#

As well as numerical data, pandas Series class has methods for working with strings as well.

We’ll demo this with a dataset that has a few more strings: CIWQS NPDES Permits.

df_npdes = pd.read_excel("./data/npdes_data.xlsx", nrows=1000, dtype={"ZIP CODE": str})
df_npdes.head()

	DISCHARGER	DISCHARGER ID	RM STATUS	REGION CODE	REG MEAS ID	ORDER NUMBER	NPDES NUMBER	REG MEAS TYPE	FACILIITY NAME	FACILITY PLACE ID	...	EXPIRATION DATE	TERMINAITON DATE	STREET ADDRESS	FACILITY CITY	ZIP CODE	COUNTY	DISCHARGER STREET ADDRESS	DISCHARGER CITY	DISCHARGER STATE	DISCHARGER ZIP CODE
0	AC Products Inc	623	Active	8	206294	R8-2012-0027	CAG918001	Enrollee	GW Cleanup-Anaheim-AC Product P-2	228176.0	...	07/01/2017	NaN	1601 Placentia	Anaheim	92806	Orange	9930 Painter Avenue	Whittier	CA	90605
1	Action Motorsports of Tahoe	732	Active	6A	379813	R6T-2016-0038	NaN	Enrollee	Timber Cove Marina	264519.0	...	11/01/2026	NaN	3411 Lake Tahoe Boulevard	South Lake Tahoe	96157	El Dorado	Po Box 13356	South Lake Tahoe	CA	96151
2	Aerojet Rocketdyne, Inc.	854	Active	5S	441580	R5-2020-0051	CA0083861	NPDES Permits	Aerojet Interim GW Extraction & Treatment System	232546.0	...	10/31/2025	NaN	Aerojet Road	Sacramento	95813-6000	Sacramento	Po Box 13222	Sacramento	CA	95813
3	AES Alamitos LLC	864	Active	4	437188	R4-2020-0134	CA0001139	NPDES Permits	Alamitos Generating Station	207266.0	...	12/31/2025	NaN	690 North Studebaker Road	Long Beach	90803	Los Angeles	690 North Studebaker Road	Long Beach	CA	90803
4	AES Huntington Beach LLC	866	Active	8	422464	R8-2020-0040	CA0001163	NPDES Permits	Huntington Beach Generating Station	226995.0	...	12/31/2025	NaN	21730 Newland Street	Huntington Beach	92646	Orange	21730 Newland Way	Huntington Beach	CA	92646

5 rows × 24 columns

Say we want to pull out all the permits related to AT&T.

The problem is that there’s inconsistent naming of the facilities (this is often the case with user-entered data)

df_npdes.iloc[27:38]["FACILIITY NAME"]

                        AT&T Statewide Cable System
  Region 3 - Pacific Bell (a.k.a. SBC) Regionwid...
  Region 3 - At&T Regionwide Utility Vaults (NPDES)
                      AT&T Utility Vault Discharges
                                 AT&T Utility Vault
                        AT & T/Statewide Utility GP
                                   At&T Corporation
              AT&T Utility Vaults in SFBay Region 2
  AT&T Facilities Under the General Utilities Pe...
                              Atwater Regional WWTF
  Atwater WWTF Mun WW Disch that Meet Obj/Criter...
Name: FACILIITY NAME, dtype: object

To address this, lets tidy up the name field. We’ll do this in a new column so we don’t loose our original data.

df_npdes["tidy_name"] = df_npdes["FACILIITY NAME"].copy()

Most of python’s builtin string functions have equivalent pandas Series methods. The pandas methods are much faster though, and for advanced users, many can be used with regular expressions.

# Replace numeric NaN values with empty strings.
df_npdes["tidy_name"] = df_npdes["tidy_name"].fillna("")

# Remove leading/trailing whitespace.
df_npdes["tidy_name"] = df_npdes["tidy_name"].str.strip()

# Convert to uppercase.
df_npdes["tidy_name"] = df_npdes["tidy_name"].str.upper()

# Fix spacing.
df_npdes["tidy_name"] = df_npdes["tidy_name"].str.replace("AT & T", "AT&T")

The new tidy_name column can now be used for filtering. Here we use a boolean index (based on the str.contains() method) to get only the AT&T rows:

df_att = df_npdes[df_npdes["tidy_name"].str.contains("AT&T")]

df_att[["FACILIITY NAME", "tidy_name"]]

	FACILIITY NAME	tidy_name
25	AT&T Utility vaults	AT&T UTILITY VAULTS
27	AT&T Statewide Cable System	AT&T STATEWIDE CABLE SYSTEM
29	Region 3 - At&T Regionwide Utility Vaults (NPDES)	REGION 3 - AT&T REGIONWIDE UTILITY VAULTS (NPDES)
30	AT&T Utility Vault Discharges	AT&T UTILITY VAULT DISCHARGES
31	AT&T Utility Vault	AT&T UTILITY VAULT
32	AT & T/Statewide Utility GP	AT&T/STATEWIDE UTILITY GP
33	At&T Corporation	AT&T CORPORATION
34	AT&T Utility Vaults in SFBay Region 2	AT&T UTILITY VAULTS IN SFBAY REGION 2
35	AT&T Facilities Under the General Utilities Pe...	AT&T FACILITIES UNDER THE GENERAL UTILITIES PE...
781	Pacific Bell (AT&T)	PACIFIC BELL (AT&T)

We can also normalize to 5-digit zip codes by splitting on the dash, then taking the first group using the str[] indexing tool pandas provides:

df_npdes["tidy_zip_code"] = df_npdes["ZIP CODE"].str.split("-").str[0]

df_npdes[["ZIP CODE", "tidy_zip_code"]].head()

	ZIP CODE	tidy_zip_code
0	92806	92806
1	96157	96157
2	95813-6000	95813
3	90803	90803
4	92646	92646

The pandas documentation has a Working with text data guide that goes into more details about regular expressions as well as splitting/joining strings, and has a list of all the string methods.

Timeseries data#

Just like pandas groups string functions with a .str prefix, there is also a .dt prefix that contains functions for working with dates, times, and datetimes (timestamps).

Let’s have a look at some of our date columns:

date_cols = ["ADOPTION DATE", "EFFECTIVE DATE", "EXPIRATION DATE"]
df_npdes[date_cols].head()

	ADOPTION DATE	EFFECTIVE DATE	EXPIRATION DATE
0	NaN	09/23/2002	07/01/2017
1	11/01/2016	11/01/2016	11/01/2026
2	10/16/2020	11/01/2020	10/31/2025
3	11/12/2020	01/01/2021	12/31/2025
4	12/04/2020	01/01/2021	12/31/2025

df_npdes[date_cols].dtypes

ADOPTION DATE      object
EFFECTIVE DATE     object
EXPIRATION DATE    object
dtype: object

The object dytpe indicates our dates weren’t parsed correctly. There’s also a numeric NaN mixed in there.

To fix this we’re going to have to go back to the data loading. In this case it’s enough to tell python which columns to treat as dates with the parse_dates argument.

df_npdes = pd.read_excel("./data/npdes_data.xlsx", nrows=1000, parse_dates=date_cols)
df_npdes[date_cols].head()

	ADOPTION DATE	EFFECTIVE DATE	EXPIRATION DATE
0	NaT	2002-09-23	2017-07-01
1	2016-11-01	2016-11-01	2026-11-01
2	2020-10-16	2020-11-01	2025-10-31
3	2020-11-12	2021-01-01	2025-12-31
4	2020-12-04	2021-01-01	2025-12-31

df_npdes[date_cols].dtypes

ADOPTION DATE      datetime64[ns]
EFFECTIVE DATE     datetime64[ns]
EXPIRATION DATE    datetime64[ns]
dtype: object

Now that our dates have the correct type, we can use pandas date/time functionality! Missing/invalid timestamps are specified with NaT (not a timestamp).

For more complex cases, you can use the pd.to_datetime function.

# Convert dates from a dd/mm/yyyy string, halting with an error in the case of any failed conversion.
df["ADOPTION DATE"] = pd.to_datetime(df["ADOPTION DATE"], format="%d/%m/%Y", errors="raise")

The .dt prefix has functions for accessing different parts of the timestamp

# English day of week name. Then replace any None or NaNs with an empty string.
df_npdes["ADOPTION DATE"].dt.day_name().fillna("")

             
      Tuesday
       Friday
     Thursday
       Friday
         ...    
  Wednesday
  Wednesday
     Friday
  Wednesday
           
Name: ADOPTION DATE, Length: 1000, dtype: object

as well as functions for manipulating timestamps

# Round to the nearest hour (looks like our data is already rounded!).
df_npdes["ADOPTION DATE"].dt.round("h")

          NaT
   2016-11-01
   2020-10-16
   2020-11-12
   2020-12-04
         ...    
 2012-12-12
 2018-08-08
 2019-03-08
 2023-11-08
        NaT
Name: ADOPTION DATE, Length: 1000, dtype: datetime64[ns]

In addition to timestamps, pandas also has the concept of differences between two timestamps.

A Timedelta is a fixed difference:

# Shift dates 7 days forward into the future).
df_npdes["ADOPTION DATE"] + pd.Timedelta(days=7)

          NaT
   2016-11-08
   2020-10-23
   2020-11-19
   2020-12-11
         ...    
 2012-12-19
 2018-08-15
 2019-03-15
 2023-11-15
        NaT
Name: ADOPTION DATE, Length: 1000, dtype: datetime64[ns]

while an offset can vary on length depending on context.

# 10 working days later.
df_npdes["ADOPTION DATE"] + pd.offsets.BusinessDay(n=10)

          NaT
   2016-11-15
   2020-10-30
   2020-11-26
   2020-12-18
         ...    
 2012-12-26
 2018-08-22
 2019-03-22
 2023-11-22
        NaT
Name: ADOPTION DATE, Length: 1000, dtype: datetime64[ns]

Series mapping#

Series has a map() method that applies a regular python function separately to every element.

This is helpful if your function is too complicated to represent using numpy/pandas arithmetic/methods, or if you’d like to use an existing function that doesn’t support arrays.

def tidy_zip_code(code):
    """Normalize and validate zip codes."""
    # Normalize to string.
    if code is None or pd.isna(code):
        return ""
    code = str(code)

    # Normalize to 5 digits.
    code = code.split("-")[0]
    if len(code) == 4:
        code = "0" + code

    # Validate.
    assert len(code) == 5, f"Wrong zip length {code=}"
    return code


df_npdes["validated_zip"] = df_npdes["ZIP CODE"].map(tidy_zip_code)

df_npdes[["ZIP CODE", "validated_zip"]].tail()

	ZIP CODE	validated_zip
995	94941	94941
996	94941	94941
997	94941	94941
998	94941	94941
999	93263	93263

The map() method also accepts a dictionary for more simple cases of input -> output mapping

program_descriptions = {
    "INDSTW": "Industrial Stormwater",
    "NPDMINING": "National Pollutant Discharge  - Mining",
    "NPDNONMUNIPRCS": "National Pollutant Discharge  - Non-Municipal - Permit-required Confined Space",
    "NPDMUNILRG": "National Pollutant Discharge  - Large Municipal",
}
df_npdes["full_program_name"] = df_npdes.PROGRAM.map(program_descriptions)

df_npdes[["DISCHARGER", "PROGRAM", "full_program_name"]].head()

	DISCHARGER	PROGRAM	full_program_name
0	AC Products Inc	NPDNONMUNIPRCS	National Pollutant Discharge - Non-Municipal ...
1	Action Motorsports of Tahoe	INDSTW	Industrial Stormwater
2	Aerojet Rocketdyne, Inc.	NPDNONMUNIPRCS	National Pollutant Discharge - Non-Municipal ...
3	AES Alamitos LLC	NPDINDLRG	NaN
4	AES Huntington Beach LLC	NPDNONMUNIPRCS	National Pollutant Discharge - Non-Municipal ...

Concatenating and joining DataFrames#

To stack multiple DataFrames that have the same columns, pandas has the concat function.

df_placer = pd.read_csv("data/gwl/Placer.csv")
df_sacramento = pd.read_csv("data/gwl/Sacramento.csv")

df_all = pd.concat([df_placer, df_sacramento])

print(f"{df_placer.shape=}")
print(f"{df_sacramento.shape=}")
print(f"{df_all.shape=}")

df_placer.shape=(184, 14)
df_sacramento.shape=(494, 14)
df_all.shape=(678, 14)

concat stacks rows together by default. But with the axis="columns" argument, we can merge columns from different datasets with the same rows.

df_location = df_placer[["LATITUDE", "LONGITUDE"]]
df_basin = df_placer[["BASIN_CODE", "BASIN_NAME"]]

pd.concat([df_basin, df_location], axis="columns")

	BASIN_CODE	BASIN_NAME	LATITUDE	LONGITUDE
0	5-021.64	North American	38.7626	-121.365
1	5-021.64	North American	38.7005	-121.614
2	5-021.64	North American	38.7005	-121.614
3	5-021.64	North American	38.7216	-121.247
4	5-021.64	North American	38.7222	-121.292
...	...	...	...	...
179	6-108	Olympic Valley	39.1980	-120.230
180	6-108	Olympic Valley	39.1980	-120.230
181	6-067	Martis Valley	39.3006	-120.112
182	6-067	Martis Valley	39.3072	-120.132
183	NaN	NaN	39.3079	-120.245

184 rows × 4 columns

For more complex database-style joins, we have the pandas.merge function.

Joins combine DataFrames by linking rows based on matching column values.

Lets start by making a DataFrame from a dict of sequences

df_program = pd.DataFrame(
    {
        "PROGRAM": ["INDSTW", "NPDMINING", "NPDNONMUNIPRCS", "NPDMUNILRG"],
        "program_name": [
            "Industrial Stormwater",
            "National Pollutant Discharge  - Mining",
            "National Pollutant Discharge  - Non-Municipal - Permit-required Confined Space",
            "National Pollutant Discharge  - Large Municipal",
        ],
        "program_launch_date": pd.to_datetime(["2024-01-01", "2023-12-20", "2016-01-20", "2024-01-01"]),
    }
)

df_program

	PROGRAM	program_name	program_launch_date
0	INDSTW	Industrial Stormwater	2024-01-01
1	NPDMINING	National Pollutant Discharge - Mining	2023-12-20
2	NPDNONMUNIPRCS	National Pollutant Discharge - Non-Municipal ...	2016-01-20
3	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01

and perform a join on the “PROGRAM” column.

# A simple inner join.
df_joined = pd.merge(df_npdes, df_program, on="PROGRAM")

df_joined[["DISCHARGER", "PROGRAM", "program_name", "program_launch_date"]].tail()

	DISCHARGER	PROGRAM	program_name	program_launch_date
814	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
815	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
816	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
817	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
818	Shafter City	NPDNONMUNIPRCS	National Pollutant Discharge - Non-Municipal ...	2016-01-20

Our joined DataFrame has all the collumns from both input DataFrames!

You might have noticed that the while df_npdes started off with 1000 rows, the last row number on df_joined is 818!

By default, merge() does an inner merge, which means any rows with a value for PROGRAM that isn’t in both DataFrames is discarded.

Sometimes we do only care about the intersection of two datasets, in which case an inner merge is perfect! But more often in analysis we want to keep our main DataFrame intact, and just enhance with more columns wherever we have data. This is a left join.

SQL joins — Fig. 1 Source: Pratik Joshi#

There’s a couple more tweaks we can make to our join for improved resilience.

Often the joining column will have different names between our dataset. We can specify this with the left_on and right_on arguments.
If our incoming DataFrame has multiple rows with the same join column value, we can end up with extra data post-join. Passing validate="many_to_one" enforces the lookup style of join.

Here’s a more typical left join:

# A typical data science left join.
df_joined = pd.merge(
    df_npdes,  # Left dataframe
    df_program,  # Right dataframe,
    how="left",
    left_on="PROGRAM",
    right_on="PROGRAM",
    validate="many_to_one",
)


df_joined[["DISCHARGER", "PROGRAM", "program_name", "program_launch_date"]].tail()

	DISCHARGER	PROGRAM	program_name	program_launch_date
995	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
996	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
997	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
998	Sewerage Agency of Southern Marin	NPDMUNILRG	National Pollutant Discharge - Large Municipal	2024-01-01
999	Shafter City	NPDNONMUNIPRCS	National Pollutant Discharge - Non-Municipal ...	2016-01-20

Now all 1000 rows are there, and we can clearly identify the added columns.

The pandas documentation has a join guide that goes into more detail about the different join types, validation options, and merge arguments.

Grouping#

Grouping is another operation that will be familiar to those who have worked with databases.

The idea is to categorise rows of a DataFrame into groups, then calculate summary statistics for each group.

You first create a group by passing one or more columns to groupby(). Then define each aggregate with three pieces of information

the name of the result column is the name of the input argument
the first value of the tuple is the group’s column to use
the second value of the tuple is the function to apply to the group. It can be the name of a pandas function as a string, or an actual function.

def active_frac(group):
    """The fraction of the group that has the value 'Active'"""
    return np.mean(group == "Active")


df_npdes.groupby(["DISCHARGER STATE"]).agg(
    n_permits=("DISCHARGER ID", "size"),
    first_adoption=("ADOPTION DATE", np.min),
    active_frac=("RM STATUS", active_frac),
)

/var/folders/df/w347rlh96tx42h1kl69_4r8w0000gn/T/ipykernel_73578/81745869.py:6: FutureWarning: The provided callable <function min at 0x10a3f3880> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  df_npdes.groupby(["DISCHARGER STATE"]).agg(

	n_permits	first_adoption	active_frac
DISCHARGER STATE
CA	950	1990-05-11	0.851579
CO	11	2015-08-13	1.000000
IL	1	NaT	0.000000
KS	1	NaT	1.000000
MI	1	NaT	1.000000
MN	1	NaT	0.000000
Multiple Addresses	8	2012-12-12	0.875000
NJ	8	2006-07-19	1.000000
NV	3	2014-07-16	0.666667
PA	2	NaT	1.000000
TX	3	NaT	1.000000
VA	1	2012-02-08	1.000000
WA	1	NaT	1.000000

Plotting with matplotlib#

Because the module name is extremely long, it’s customary to import using

import matplotlib.pyplot as plt

To begin, we’ll load in this DataFrame of groundwater measurements at a 10 different stations over time

df_10 = pd.read_csv("data/gwl/10-sites.csv", usecols=["SITE_CODE", "MSMT_DATE", "GSE_WSE", "COUNTY_NAME"], parse_dates=["MSMT_DATE"])
assert df_10.SITE_CODE.nunique() == 10

df_10.sample(5)

	SITE_CODE	MSMT_DATE	GSE_WSE	COUNTY_NAME
4538	388943N1214335W001	2005-04-07	49.90	Placer
6059	384121N1212102W001	1996-01-13	141.74	Sacramento
6057	384121N1212102W001	1996-01-27	141.21	Sacramento
1360	385567N1214751W001	1988-01-21	26.00	Sacramento
2178	387511N1213389W001	2000-12-04	154.40	Placer

We start with an empty plot using the subplots() function. This returns two things

a Figure object, which has methods for adjusting the plot layout and saving your plot
one or more Axes objects, which have methods for plotting data

To plot a simple line for a single site, we need the Axes.plot() method, and pass first an array of our x-axis data, then our y-axis data.

df_site = df_10[df_10.SITE_CODE == "384121N1212102W001"]

fig, ax = plt.subplots()
ax.plot(df_site.MSMT_DATE, df_site.GSE_WSE)

[<matplotlib.lines.Line2D at 0x14df57ed0>]

_images/3334c42caac61160eb41d4344ef5707bf028ce7a08a5446d0d7b634e3a0ce8ff.png

It’s a basic plot but it works!

Let’s improve the readability of this plot a bit. Can you see in the code below where each of these changes was made?

Increase the resolution of the image (the default 100 dpi)
Label each axis
Label the entire figure

fig, ax = plt.subplots(dpi=200)
ax.plot(df_site.MSMT_DATE, df_site.GSE_WSE)
ax.set_xlabel("Date")
ax.set_ylabel("Groundwater depth [ft]")
fig.suptitle("Site 384121N1212102W001 (Sacremento)")

Text(0.5, 0.98, 'Site 384121N1212102W001 (Sacremento)')

_images/12c77c67c74fc1735893d2da284ff0e09db794df5359506813988d84850cc163.png

We can visualize the same data using a scatter plot. To show the density of overlapping points, add an alpha keyword where 1 is fully opaque and 0 is fully transparent

fig, ax = plt.subplots(dpi=200)
ax.scatter(
    df_site.MSMT_DATE,
    df_site.GSE_WSE,
    alpha=0.5,  # Each marker is 50% transparent.
    color="black",
    s=5,  # Marker size.
)

<matplotlib.collections.PathCollection at 0x14df63cb0>

_images/85f8dcfe30c97a5c5add2260ad877a98db9a2ba77aaf453b6ae882395cd9ee55.png

Multiple things can be plotted on the same Axes object. For example, it can be helpful to have both the interpolated line from plot() as well as the observation locations from scatter(). The most recent plot will be on top.

fig, ax = plt.subplots(dpi=200)
ax.scatter(
    df_site.MSMT_DATE,
    df_site.GSE_WSE,
    alpha=0.2,
    color="black",
    s=6,
)
ax.plot(df_site.MSMT_DATE, df_site.GSE_WSE, color="red", alpha=0.75)

[<matplotlib.lines.Line2D at 0x14e979090>]

_images/673869dab50ba356c756b0e44bdf6d00770f2be745bcc1c8a138ca852e6cd675.png

Going back to our original dataset of 10 different stations, lets plot all 10 in a loop onto the same Axes object. To tell the lines apart, we add a label= argument to each plot call, then add a legend box to the axis with the legend() method

fig, ax = plt.subplots(dpi=200)

for site_code in df_10.SITE_CODE.unique():
    df_site_code = df_10[df_10.SITE_CODE == site_code]
    ax.plot(df_site_code.MSMT_DATE, df_site_code.GSE_WSE, alpha=0.75, label=site_code)
ax.legend(
    fontsize="small",
    bbox_to_anchor=(1, 1),  # Shift legend outside of plot.
)

<matplotlib.legend.Legend at 0x14e9041a0>

_images/14a97c023d681dfa0bfff23e25c524936946e00bd9122d3bd17de3ba025e7398.png

Saving plots#

You can save a plot using the Figure’s savefig method. For instance, to save a pdf version of the most recent plot

fig.savefig("10-site-water-depth.pdf")

Even quicker in a noteook environment: right click on the plot, and copy/save as a PNG!

Plotting with seaborn#

With data that requires aggregation to plot (such as grouping or splitting), the seaborn package can handle much of the complexity for you.

Seaborn is normally imported with the sns abbreviation.

import seaborn as sns

Using seaborn with DataFrames is a little different than with matplotlib directly. You typically pass the whole DataFrame as a data argument, then the column names (rather than the data) as the x and y arguments.

Seaborn has it’s own scatter function (scatterplot):

fig, ax = plt.subplots(dpi=200)
sns.scatterplot(
    ax=ax,
    data=df_site,
    x="MSMT_DATE",
    y="GSE_WSE",
    alpha=0.5,
    color="black",
    s=6,
)

<Axes: xlabel='MSMT_DATE', ylabel='GSE_WSE'>

_images/b7d81c77b2c157b9ce3d5e0953e22ed17d2341dd89acaea434a2e3130f78379a.png

With seaborn, we still use a matplotlib Figure and Axes. But we automatically get axis labels!

Most seaborn functions perform some data processing before plotting. For example, histplot calculates and plots a histogram

fig, ax = plt.subplots(dpi=200)
sns.histplot(ax=ax, data=df_site, x="GSE_WSE")

<Axes: xlabel='GSE_WSE', ylabel='Count'>

_images/13bd6e97a73d7bfb532b760369ceff234561c0051da1f72a98d500efaf757b74.png

The thin lines of a KDE distribution plot (kdeplot()) make it easier to compare in the distribution of all 10 sites

fig, ax = plt.subplots(dpi=200)

sns.kdeplot(ax=ax, data=df_10, x="GSE_WSE", hue="SITE_CODE")

<Axes: xlabel='GSE_WSE', ylabel='Density'>

_images/15ba37044e6684d4e0565624f0cb0216e1137eaf8e6a2b012e32160fab965086.png

It’s still a little messy though, a box plot might be clearer still.

fig, ax = plt.subplots(dpi=200)
sns.boxplot(ax=ax, data=df_10, x="GSE_WSE", y="SITE_CODE")

<Axes: xlabel='GSE_WSE', ylabel='SITE_CODE'>

_images/df6f3e41555c11d1208f10962fb29adebedd372014bf9326c101694f03166bd3.png

Seaborn has dozens of plot functions, mostly invoking comparing distributions of data.

The seaborn gallery is the place to go for a good overview of what seaborn can do.