3. Python fundamentals II#

As your programs get larger, you’ll want to get organized. This module covers the hierarchy of concepts in Python for grouping and reusing code:

  • Functions

  • Classes

  • Modules

  • Packages

Functions#

A function is a chunk of code that takes inputs, does some processing on them, then returns an output.

Functions allow us to

  • Reuse code without having to write it out again

  • Ensure consistency and reproducibility by confining logic to a single implementation

  • Group related code to keep things organized

  • Use code developed by others

Anatomy of a python function#

Let’s dive right in and look at a function. Here’s a function that converts volume amounts from acre-feet to cubic meters.

def acre_feet_to_m3(volume_acre_feet):
    """Converts volume in US acre-feet to SI m³."""
    volume_m3 = volume_acre_feet * 1_233.482
    return volume_m3

A python function has the following bits

  • Functions are created using the def keyword

  • Then the name of the function

  • Zero or more input arguments inside parentheses ()

  • A colon :

  • A docstring documenting the function

    • Technically these are optional, but they really help with understanding code!

  • The code of the function, indented one level (like a for loop)

  • An output, signified by the return keyword

Nothing is displayed when defining a function: the code within the function runs only when it is called.

Running this function works just like we’ve already seen with builtin python functions like print and str:

lake_tahoe_volume_acre_feet = 120_000_000
acre_feet_to_m3(lake_tahoe_volume_acre_feet)
148017840000.0

Note

Python lets you use underscores _ to group digits. So writing 120_000_000 is the same as 120000000. This really helps reading large numbers!

Scoping#

Any variables created inside the function are only available inside the function. Try to use volume_m3 now, you will see an error. This scoping of variables inside functions is one of the benefits of functions that keep your workspace clean of variables.

print(volume_m3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 volume_m3

NameError: name 'volume_m3' is not defined

Default arguments#

Input arguments can have default values.

Normally, running a function without specifying all of its arguments results in an error.

acre_feet_to_m3()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: acre_feet_to_m3() missing 1 required positional argument: 'volume_acre_feet'

But when default values are given for arguments using an = sign in the function definition, Python will use that default for any missing arguments.

def say_hello(user="anonymous user"):
    print("Hi {}!".format(user))
say_hello("Andrew")
Hi Andrew!
say_hello()
Hi anonymous user!

It’s nice to give your arguments defaults when you know what the value will be most of the time, but still allow the possibility of using a different input when you need to override that common case.

Named arguments#

A function will receive the arguments in the order you provide them when calling the function

def bounds_area(left, bottom, right, top, force_positive=True):
    """Area of rectangular bounding box."""
    height = top - bottom
    width = right - left
    area = height * width
    if force_positive:
        area = abs(area)
    return area


bounds_area(4137299, 606008, 4137399, 606009)
100

Using argument names lets you control the order of the arguments. Plus, will make the code much easier to understand when you’re trying to read it later!

bounds_area(left=4137299, right=4137399, bottom=606008, top=606009)
100

In science and geospatial coding, we frequently work between different lat/lon x/y row/col ordering customs, as well as just very complicated algorithms with a large level of parameters.

By writing out argument names in full wherever they’re not completely obvious, we help others and our future selves to read our code, and make our code more resilient to bugs.

Outputs#

Technically, Python can only return a single variable as output.

But if we want zero output we can just return None (this is also what happens when we have a function without a return statement):

from pathlib import Path


def delete_temp_run_files(run_id):
    """Tidy up any temporary files from the last run."""
    tmp_folder_path = Path("./runs/{}/tmp/".format(run_id))
    if tmp_folder_path.exists():
        tmp_folder_path.unlink()


result = delete_temp_run_files(run_id="2025-02-01_v2")
print(result)
None

To pack multiple outputs in one variable, we can use a tuple

def extract_lat_lon(latlon):
    """Extract lat,lon from a string like '37.364,-122.010'"""
    parts = latlon.split(",")
    lat = float(parts[0])
    lon = float(parts[1])
    return (lat, lon)


result_lat, result_lon = extract_lat_lon("37.364,-122.010")

result_lat
37.364

Excercise: create a function#

Imagine you have data with precise elevation readings in meters like 4421.01237.

You want to produce a report that’s easy to read: with elevation values in feet and rounded to the nearest whole number (14505).

Write a function that takes an elevation in metres, and returns a rounded reading in feet.

Think about

  • What should the function be called?

  • What variable names should I use?

  • How should I document the function?

  • Does the order of operations (rounding and converting) make a difference? If you’re not sure, try it both ways and see!

  • Hint: python has a round function.

  • Extra credit: add an optional paramter to control the number of decimal places to round do.

Function decorators#

A decorator (signified with a @ symbol) modifies the functionality of a function.

Mastering decorators is a more advanced topic. But there’s one super useful decorator you should now about: the cache decorator in the builtin functools library.

Say you have a function that is very slow and always returns the same output when given the same input argument. (This is really common in data science, think loading data from remote servers or performing very complicated but deterministic calculations).

Adding @functools.cache before your slow deterministic function means that the first time it’s called, the result is stored by Python. Any subsequent times the function is called with the same input arguments, that stored result is returned instantly!

In the example below, load_webpage is our slow function.

import functools
import requests


@functools.cache
def load_webpage(url):
    requests.get(url)

The first time the function is run, it takes over 50ms to query the server and download the webpage.

(Note the %%time notebook syntax here, which prints out the time taken to run a cell).

%%time
load_webpage("https://example.com/")
CPU times: user 5.51 ms, sys: 2.69 ms, total: 8.2 ms
Wall time: 122 ms

But running the function again with the same input, the result takes a few μs. That’s about as close to instantaneous you get with Python!

%%time
load_webpage("https://example.com/")
CPU times: user 2 μs, sys: 0 ns, total: 2 μs
Wall time: 4.05 μs

Running the function with a new input goes back to the “slow” timing of 50ms+. Results are only cached for the same inputs.

%%time
load_webpage("https://example.com/about.html")
CPU times: user 6.02 ms, sys: 1.49 ms, total: 7.52 ms
Wall time: 805 ms

The caching decorator is great wherever you have a function that is slow, and will be called repeatedly with the same input.

Common applications include URL queries, API requests, SQL queries, and data loading.

Type annotations#

Modern versions of Python (like the one we’re using in our conda environment!) let you document and restrict the types of your input arguments and output using a special syntax.

For the extract_lat_lon function above that would look like this:

def extract_lat_lon(latlon: str) -> tuple[float, float]:
    ...

These type annotations are slowly being adopted by many new software projects. But their use isn’t widespread in scientific computing.

We chose not to use type annotations for this course as many of the core scientific Python packages we’ll be using don’t support them (yet!). But you may see them when viewing Python code in the future.

When to use functions?#

There’s a balance to strike here with functions. Not enough functions makes for code that’s hard to navigate and prone to bugs. But wrapping every little statement in a function slows development and adds counter-productive complexity.

Here are some suggestions for when it’s time to put code in a function:

  • To avoid repetition.

    • If you’ve already written some code to read data from Snowflake, there’s no need to re-write that every time in your code that needs snowflake data. Be kind to yourself, make a read_from_snowflake function!

    • There’s no need to force it though. Sometimes it makes sense to copy and modify code, rather than trying to write a single function that handles two slightly different situations.

  • When consistency is needed.

    • There are two different definitions of the acre-foot unit. By using our acre_feet_to_m3 function instead of dividing by 1,233.4892 throughout your code, we can ensure that all parts of our program are using a consistent definition of the unit.

    • The Mars Climate Orbiter failed because NASA’s calculations were in metric units while Lockheed Martin used US units. If the manufacturers had used a shared library of unit conversion functions, perhaps the orbiter would have made it to orbit!

  • To structure a program.

    • You can use functions to split up your program so that it reads almost like English.

    • A good guideline: all lines of a function should fit on a computer screen.

    • For example, the top-level of this flow rate forecasting program is split into 7 high-level functions:

      def forecast_flow_rate(streamgage_id, forecast_date):
          """Single day ML forecast for flow at a streamgage location."""
          # Inputs.
          historic_flow = load_historic_flow(streamgage_id)
          watershed = load_watershed(streamgage_id)
          dem = load_dem(watershed.bounds)
          precip = load_precipitation_forecast(watershed.bounds, forecast_date)
      
          # Run model.
          forecast_result = run_lstm_flow_simulation(dem, precip, forecast_date)
          validate_single_gage_forecast(forecast_result)
      
          # Save result.
          save_result_to_snowflake(forecast_result)
      

      Skipping through just that function gives the reader a good overview of how the program works, and offers a clear directory of where to look to resolve a Invalid Forecast Result error for example.

Library functions#

Python comes with a large standard (builtin) library of functions for all sorts of different things.

In Python, functions are grouped together for distribution into modules, which are accessed using import. For example, to use the mode function that’s in the builtin statistics module:

import statistics

states = ["CA", "NV", "NV", "AZ"]
statistics.mode(states)
'NV'

We’ll cover modules in more detail towards the end of this unit!

Errors in functions#

By now you’ve probably seen what happens in Python when you trigger an error!

42 / 0
ZeroDivisionError                         Traceback (most recent call last)
Cell In[71], line 1
----> 1 42 / 0

ZeroDivisionError: division by zero

In this simple example Python is telling you a few pieces of information

  • The error name: ZeroDivisionError. This is the bit you should use for googling more information about the error!

  • A plain-english description of the error: division by zero.

  • The line of code that triggered the error: 42 / 0.

  • You also get a line number: line 1. In notebooks this is less helpful, but for Python files this can help you quickly find the problematic line (you also get the filename).

Tracebacks#

Often in Python our code gets deeply nested: we have a function, that calls another function in a different module, which then calls a different function. In the example below we have build_bounds which calls print_bounds_details which calls calculate_aspect_ratio:

def calculate_aspect_ratio(width, height):
    return height / width


def print_bounds_details(bounds):
    print(f"{bounds=}")

    width = bounds[2] - bounds[0]
    height = bounds[3] - bounds[1]
    print(f"{width=} {height=}")

    aspect_ratio = calculate_aspect_ratio(width, height)
    print(f"{aspect_ratio=}")


def build_bounds(left, bottom, right, top):
    bounds = (left, bottom, right, top)
    print_bounds_details(bounds)
    return bounds

Look what happens when we try to build a bounding box with zero width:

build_bounds(1, 100, 1, 101)
ZeroDivisionError                         Traceback (most recent call last)
Cell In[75], line 20
     16     print_bounds_details(bounds)
     17     return bounds
---> 20 build_bounds(1, 100, 1, 101)

Cell In[75], line 16, in build_bounds(left, bottom, right, top)
     14 def build_bounds(left, bottom, right, top):
     15     bounds = (left, bottom, right, top)
---> 16     print_bounds_details(bounds)
     17     return bounds

Cell In[75], line 11, in print_bounds_details(bounds)
      8 height = bounds[3] - bounds[1]
      9 print(f"{width=} {height=}")
---> 11 aspect_ratio = calculate_aspect_ratio(width, height)
     12 print(f"{aspect_ratio=}")

Cell In[75], line 2, in calculate_aspect_ratio(width, height)
      1 def calculate_aspect_ratio(width, height):
----> 2     return height / width

ZeroDivisionError: division by zero

Now python is giving is a Traceback: it shows the line that triggered the error at each level.

This is helpful because often the line that caused Python to fail in the code isn’t the one that is problematic at the conceptual level.

In the example above, Python crashed because it was trying to divide by zero when calculating the aspect ratio of the bounds. But we can’t fix the mathematics of that line: the proper fix might be at the build_bounds level of nesting to detect and reject zero-width bounding boxes before trying to print_bounds_details.


One last note on errors: embrace errors during development! Except in a few circumstances (writing to a database, deleting files), crashing your code has zero negative consequences, and gives you direct valuable feedback via the error message. Instead of wondering “will this code work?” it’s often faster just to run it and see!

As we move into more complex coding, we’ll rely more heavily on errors and tracebacks.

Triggering errors#

You can trigger an error using the raise keyword with an error type (there are many to choose from, ValueError is the most common) and an error message string:

def build_bounds(left, bottom, right, top):
    bounds = (left, bottom, right, top)
    if left == right or bottom == top:
        raise ValueError("Zero-area bounds detected!")
    print_bounds_details(bounds)
    return bounds

Why would you want to crash your code?! Well now when we try to create a bad bounding box, our code fails early (avoiding running code that’s doomed to fail), and we get a clear error message about what went wrong, rather than a long traceback triggered in a utility function:

build_bounds(1, 100, 1, 101)
ValueError                                Traceback (most recent call last)
Cell In[76], line 29
     24     print_bounds_details(bounds)
     25     return bounds
---> 29 build_bounds(1, 100, 1, 101)

Cell In[76], line 22, in build_bounds(left, bottom, right, top)
     20 def build_bounds(left, bottom, right, top):
     21     if left == right or bottom == top:
---> 22         raise ValueError("Zero-area bounds detected!")
     23     bounds = (left, bottom, right, top)
     24     print_bounds_details(bounds)

ValueError: Zero-area bounds detected!

In general, we want our code to fail as early as possible, with the best description possible.

Catching errors#

By default, when Python encounters an error, it immediately quits without running any more code.

When we want Python to continue after an error in some way, we can use the try and except Exception keywords. When any code under try results in an error, the code under except will be run.

For example, we could catch the generic string processing error and give something more useful:

def process_latlon(x):
    """Parse a lat,lon string."""
    try:
        parts = x.split(",")
        lat = float(parts[0])
        lon = float(parts[1])
        return lat, lon
    except Exception:
        print("Unable to parse coordinates, they should be in 'lat,lon' format.")
        return None
process_latlon("37.366, -122.027")
(37.366, -122.027)
process_latlon("45°45’32.4″N 009°23’39.9″E")
Unable to parse coordinates, they should be in 'lat,lon' format.

Another common situation is where we still want Python to crash on an error, but do something before crashing. Common examples include retrying what just failed, cleaning up any work we did do, or logging any information that might help us debug the error.

In the example below we save our data after a crash, which might help with debugging later. When we use except Exception as e, Python stores the error into the variable e which we can raise once we’re done cleaning up.

df = load_data_frame()

try:
    x, y = prepare_ml_features()
    model = RandomForestClassifier(max_depth=2)
    model.fit(x, y)
except Exception as e:
    df.to_csv("./df-that-caused-training-to-fail.csv")
    raise e

Lastly, we can treat different error types differently by providing multiple except blocks.

def load_demand(basin_id):
    """Query snowflake for the mean demand."""
    try:
        df = pd.read_sql_table("basin_demand", snowflake_connection)
        df_basin = df[df.basin_id == basin_id]
        demand = df_basin.demand.mean()
        return demand
    except snowflake.ForbiddenError as e:
        print("Snowflake permission error, check with IT that you have access to 'basin_demand'")
        raise e
    except AttributeError:
        # If the basin table has no demand column, then the demand is zero.
        return 0

You don’t need to know all the ways your code might fail before using try except: code like this is typically built up over time as you encounter different errors.

Assertions#

The assert keyword is a quick way to enforce and document assumptions.

This check of a simulation data

assert simulation_year <= 2027, "We don't have forecast data after 2027."

is the same as writing

if not (simulation_year <= 2027):
    raise AssertionError("We don't have forecast data after 2027.")

When an invalid assertion is encountered, the error message is displayed and Python crashes

simulation_year = 2030

assert simulation_year <= 2027, "We don't have forecast data after 2027."
AssertionError                            Traceback (most recent call last)
Cell In[17], line 3
      1 simulation_year = 2030
----> 3 assert simulation_year <= 2027, "We don't have forecast data after 2027."

AssertionError: We don't have forecast data after 2027.

A common application of this is to perform assertions after loading your data, based on how you expect the data to look, and any assumptions that your model makes about the data.

# Load demand data.
df = pd.read_csv("RUSSIAN_RIVER_DATABASE_2022.csv")
assert len(df) > 0, "Empty csv found"
assert np.all(np.isfinite(df.LATITUDE)), "Missing latitudes"

These assertions give a number of benefits

  • Invalid data gives an immediate failure with a clear message, rather than trying to debug a “Coordinate Transform Error” later in the code.

  • As a reader, I know that there are no missing latitudes. So if use this dataframe I don’t need to handle missing data. And if I add code that modifies the latitudes, it’s important that I don’t introduce missing data.

  • Asserts can turn sneaky bugs into clear errors. For example, code that divides by len(df) might introduce invalid NaN values into your results without raising an error. Better to fail early and clearly.

Common validations include

  • Column data formats (make sure your numbers are floats 7.5 rather than strings "7.5")

  • Any columns that shouldn’t contain NaN or None values

  • Min/max values

  • The format of strings

Again there’s a tradeoff here: assertions take time to write. Times to consider ussing asserts include

  • Interfacing with external systems that could change in the future.

  • Situations where you know a certain input will result in particularly hard-to-debug results.

  • Situations where you’ve had to fix bugs before!

Exercise: error handling#

Here’s a function to extract the ZIP code from a well-formed address.

def extract_zip_code(address):
    """Extract the trailing ZIP code from a space-delimited address."""
    parts = address.split(" ")
    zip_code = parts[-1]
    return zip_code



extract_zip_code("123 Demo St, CA 90210")
'90210'

Add error handling to this function so that when an invalid address is used, it displays a helpful message and/or returns a value.

Some test cases:

extract_zip_code("123 Demo St, CA")
extract_zip_code("")
extract_zip_code(None)

Hints:

  • You could use

    • An if statement to check for validity, and return a special value before any string processing is done.

    • Assert statements that fail with a helpful message.

    • A try/catch block that prints a helpful message or throws a more specific error.

Classes#

The next step on our organizational journey is classes and objects.

Just like functions group related statements together, classes group related variables and functions.

And just like how organizing your code into functions is optional, so is defining your own classes. The programming technique where most things are organised into classes is called Object Oriented Programming (OOP), but you don’t have to go full OOP to selectively enjoy many of the benefits of classes!

Defining classes#

Classes are defined with the class keyword. Here’s a simple class representing a point on a map.

class Point:
    """A lat/lon point."""

We can set create an instance of the class using the () syntax. An instance of a class is also called an object in Python.

point_half_dome = Point()

Object attributes#

Ok so what do we do with classes and instances?

Classes group together variables and functions. Lets start with the variables: you can create variables known as attributes on your instances using the . syntax. Lets give our point a latitude and a longitude.

point_half_dome.lat = 37.745
point_half_dome.lon = -119.533

We can use these object attributes as variables like any other:

print(point_half_dome.lat)
print(round(point_half_dome.lon))
37.745
-120

The __init__ method#

Despite the example above, it’s best not to set attributes directly. It’s too easy to introduce inconsistency: you might have point.lat in one place and point.latitude in another, resulting in chaos.

Instead, we let the class define it’s own attribute names using a function. The functions of a class are called methods.

Python gives us a few special method names we can use that have special functionality. The __init__ method is used to initialize (create) an instance of the class.

Lets rewrite our Point class:

class Point:
    """A lat/lon point."""

    def __init__(self, lat, lon):
        self.lat = lat

        # Wrap longitude to range (-180, 180].
        while lon > 180:
            lon = lon - 360
        while lon <= -180:
            lon = lon + 360
        self.lon = lon


point_half_dome = Point(lat=37.745, lon=-119.533)
print(point_half_dome.lat)
37.745

The initializer now stores the first argument as an attribute called lat: no room for latitude ambiguity! The initializer also takes the opportunity to ensure that the longitude is between -180 and 180: by putting this check in a single place, we can be sure of the longitude convention used by all Point objects.

All class methods get the instance passed as an initial first argument called self.

Custom methods#

Pretty much all of your classes are going to start like this

class SomeClass:
    """What SomeClass does."""

    def __init__(self, arg1, arg2, arg3):
        self.arg1 = arg1
        self.arg2 = arg2
        ...

But as you expand the functionality of each class, you’ll add more attributes and methods.

A big feature of methods is that they can modify the object. Lets add a move method to our Point class. Remember that self is always inserted as the first argument.

class Point:
    """A lon/lon point."""

    def __init__(self, lat, lon):
        self.lat = lat

        # Wrap longitude to range (-180, 180].
        while lon > 180:
            lon = lon - 360
        while lon <= -180:
            lon = lon + 360
        self.lon = lon

    def move(self, delta_lat=0, delta_lon=0):
        """Shifts the point."""
        self.lat = self.lat + delta_lat
        self.lon = self.lon + delta_lon


point_half_dome = Point(lat=37.745, lon=-119.533)
print(point_half_dome.lat)
37.745

Like data attributes, we also access our method functions using the dot . syntax:

point_half_dome.move(delta_lat=0.1)
print(point_half_dome.lat)
37.845

Using the move method on the object has permanently modified the instance’s lat attribute.

The __repr__ method#

The __repr__ method (like all methods beginning with double underscore) is another special method. It lets you control the representation when you print your instance.

Python classes have a default __repr__ method that isn’t very user friendly:

print(point_half_dome)
<__main__.Point object at 0x104feba10>

Adding our own method can help with debugging and logging:

class Point:
    """A lon/lon point."""

    def __init__(self, lat, lon):
        self.lat = lat

        # Wrap longitude to range (-180, 180].
        while lon > 180:
            lon = lon - 360
        while lon <= -180:
            lon = lon + 360
        self.lon = lon

    def move(self, delta_lat=0, delta_lon=0):
        """Shifts the point."""
        self.lat = self.lat + delta_lat
        self.lon = self.lon + delta_lon

    def __repr__(self):
        return f"{self.lat}, {self.lon}"


point_half_dome = Point(lat=37.745, lon=-119.533)
print(point_half_dome)
37.745, -119.533

Inheritance#

Inheritance is used to specialise existing classes.

With inheritance you take an existing class and add new attributes, or add/modify existing methods.

Lets make a new class that represents a streamgage. We can add a new attribute by overriding the __init__ method:

class Streamgage(Point):
    """A USGS streamgage."""

    def __init__(self, lat, lon, usgs_id):
        super().__init__(lat, lon)
        self.usgs_id = usgs_id

OOP sometimes uses a family-tree metaphor to describe the relationship between classes

  • Streamgage is a subclass aka child of Point

  • Point is the superclass aka parent of Streamgage

The line super.__init__(lat, lon) call’s the parent’s __init__ method. It’s a bit like going self = Point(lat, lon).

And we can also add a brand new method that wouldn’t make sense on the generic parent Point class.

class Streamgage(Point):
    """A USGS streamgage."""

    def __init__(self, lat, lon, usgs_id):
        super().__init__(lat, lon)
        self.usgs_id = usgs_id

    def current_flowrate(self):
        """The curent flowrate at the gage, in ft³s⁻¹."""
        return query_usgs_api(site=self.usgs_id, field="streamflow")

Because Streamgage inherits from Point, and because we called Point’s __init__ method inside of our own, we get to keep all the helpful functionality of Point like the longitude wrapping and moving:

gage = Streamgage(lat=35.8494, lon=243.7692, usgs_id="10251300")
gage.move(delta_lat=0.00025)
gage
35.849650000000004, -116.23079999999999

You can see we’ve also inherited the __repr__ method that only displays the coordinates. As an exercise, can you copy the Point and Streamgage classes, then add a __repr__ method to Streamgage that also displays the id?

Dataclassess#

Classes are helpful but as you can see, object oriented programming gets complicated fast!

If you just want a simple structure to hold a few variables together, the builtin dataclasses automates away some of the annoying boilerplate. You create a using the dataclasses.dataclass decorator, and have to specify the type of your attributes:

import dataclasses


@dataclasses.dataclass
class SimulationConfig:
    results_path: str
    num_iterations: int
    num_trials: int = 1

    def total_num_steps(self):
        return self.num_iterations * self.num_trials


SimulationConfig("/tmp/results.csv", num_iterations=5)
SimulationConfig(results_path='/tmp/results.csv', num_iterations=5, num_trials=1)

See how we didn’t need define an __init__ method, and __repr__ was also handled for us!

In data science, consider starting out with dataclasses, and upgrade them to normal classes when you need control over __init__ or complex inheritance.

When to use OOP in data science#

OOP is great for simply representing a complex data structure.

Data science, in contrast, often deals with simple data structures (data tables) that undergo complex transformations.

As a result, functional programming tends to be dominant approach in data science.

That said, there’s lots of places where OOP can help your data science development:

  • Productionizing data science analysis as an external tool or service.

  • Representing configuration and model inputs.

  • Data that’s too complex to be a row in a data table: rasters, experiment results, other domain-specific data structures.

And finally, some of the biggest tools in data science are very object oriented (e.g., pandas, scikit-learn). So even if you’re not writing custom classes, understanding classes will help.

Modules#

A module is just a python file, typically grouping together related functions and classes.

We might put all our unit conversion functions in one file. This keeps them in one place so they’re easier to find, reduces clutter and distraction in our main code, and means these helpful utilities can be reused by multiple parts of our program.

# conversion.py
ONE_DAY_IN_SECONDS = 86_400

def acre_feet_to_m3(volume_acre_feet):
    """Converts volume in US acre-feet to SI m³."""
    volume_m3 = volume_acre_feet * 1_233.482
    return volume_m3

def cms_to_cfs(flow_cms):
    """Converts flow rate in cubic feet per second to cubic metres per second."""
    flow_cfs = flow_cms * 35.3146662
    return flow_cfs

The import statement loads and executes the module, so we can use the functions it defines

import conversion

conversion.cms_to_cfs(10)
353.146662

Import as#

By default, a module is imported as the name of the file (conversion), but you can rename the module using the as keyword:

import conversion as units

units.ONE_DAY_IN_SECONDS
86400

Really the only time you want to do this is for external modules and packages that are frequently used. It’s standard in Python data science to import these modules with short names, typing out matplotlib.pyplot in full would get annoying fast!

import numpy as np  # Array math.
import pandas as pd  # Data tables.
import matplotlib.pyplot as plt  # Plotting.

Import from#

You can import individual functions, variables, and classes out of a module.

from conversion import acre_feet_to_m3

acre_feet_to_m3(1e-3)
1.233482

The makes code shorter by not having to type out the whole module name each time you want to use the function.

The downside is that it’s harder to keep track of where acre_feet_to_m3 came from, so use sparingly.

Module execution#

When a module is imported, all the code in the module is executed.

Avoid doing things that are slow (like reading data) or that have side-effects (like writing data) in the top level of a module:

# badmodule.py

rain_data = load_massive_table_from_snowflake()
rain_data.save(overwrite=True)

Just calling import badmodule now will cause your program/notebook to hang for 30s, then overwite some of your data!

Instead, stick to defining functions, classes and simple variables in your modules. Let the caller decide when to trigger the functionality.

# bettermodule.py

# Quick, simple calculations are fine.
ONE_DAY_IN_SECONDS = 60 * 60 * 24


# Everything else, wrap in a function
#
# Remember, the code inside functions only runs when called (not when imported/defined!)
def refresh_rain_data():
    """Load the latest rainfall and save it locally."""
    rain_data = load_massive_table_from_snowflake()
    rain_data.save(overwrite=True)

Modules and notebooks#

For the most part, using import in a notebook is not different from inside a Python file.

One potential catch is that an imported module loads only once. Repeated imports just return the previously-imported module.

That means if we add a new function to conversion.py and import it again (either with a new import statement, or by rerunning the import cell), the new function still won’t be available.

We can change this by adding the following to the first cell of our notebook:

%load_ext autoreload
%autoreload 2

Now, VS Code will monitor all our modules, and if one changes, reload it for us in the background.

Finally, you can’t import a notebook file as a module, and you generally wouldn’t want to anyway, as executing the full notebook on import would be slow!

Packages#

A package is a collection of modules. It’s the top of the Python organizational food chain: packages can also contain subpackages.

We’re going to be relying on external packages extensively for the remainder of the course. They’re used much like modules but with more dots representing the extra level of organization.

Check out this import statement:

from numpy.random import uniform

Here

  • numpy is the package name (providing extended mathematical tools)

  • random is a module inside the numpy package (that groups together random distribution samplers)

  • uniform is a function inside the random module (which samples randomly from the uyniform distribution)

You can also make your own packages for distributing code. For more details, this tutorial has a great overview of Python packaging.

VSCode#

So far, we’ve mostly used Jupyter notebook as our text editor and development environment. Jupyter is great for working with notebooks, but VSCode can also work with notebooks and has many more tools for dealing with non-notebook files.

Try both and see which you prefer!

Exercise: VSCode#

Going through the items below should give you a good sense of what VSCode can do, and how that compares with Jupyter.

  • Open VSCode

  • Open the py4wrds folder

  • Browse the list of files in the folder

  • Open an .ipynb file (or create a new one!)

  • Select the py4wrds conda environment

  • Add some commands and run them. Can you see the outputs in VSCode?

  • Open a CSV file in VSCode.

  • Open an image in VSCode

  • Run the black code formatter in VSCode.