5. Geospatial python

5. Geospatial python#

Because of the wide array (pun intended) of open source data science modules that are accessible with python, it has emerged as a flagship language for processing, analyzing, and visualizing geospatial data. There are several key geospatial packages available in python:

Geopandas - processing vector data (e.g. shapfiles, .GeoJSON, .KML, see here for a full list of geospatial vector filetypes)
Rasterio - processing raster data (e.g. satellite imagery in .tif or .tiff format)
Xarray - multidimensional array analysis (e.g. time series gridded or satellite datasets in .nc or .hdf format)
GDAL - the key library (originally written in C and C++) that underscores most of the raster and vector processing operations in the above libraries, as well as GUI-based GIS software (ArcMap, QGIS, Google Earth/Maps, etc)

We will begin with (1) an (abridged) overview of map projections and coordinate systems, and (2) a brief discussion of the two fundamental types of geospatial data to motivate the topic.

Coordinate Systems and Projections#

A coordinate system is a framework used to determine position on the surface of the earth. Coordinate Systems are practical solutions to the challenge of depicting a 3D object (the earth) on a 2D surface (a screen).

Example of a coordinate system for a 2-D plane is the Cartesian Coordinate System (i.e. X and Y)

You’ve probably seen the Geographic Coordinate System (GCS), the three-dimensional coordinate system commonly used to define locations on the earth’s surface.

The unit of measure in the GCS is degrees,
locations are defined by latitude and longitude

A Map Projection is how we transform geographic data from a three-dimensional, spherical coordinate system to a two-dimensional planar system. Map projections introduce distortions in distance, direction, and area. They are useful to convert data with units of degrees (in a GCS) into more practical length units such as meters or feet. There are several types of projected coordinate systems and projection types that can be used, the most common being:

Planar
Conical
Cylindrical

projections

Vector and Raster data#

Geospatial data can be broadly categorized as either Vector or Raster:

Vector #

Vector data consit of points, lines, and polygons
Stored as tabular data files with geographic metadata
Lines are made up of points, have a starting and ending point
Polygons are made up of points, and the starting point = the ending point.
Can have a topology, or an organizational scheme to e.g. dictate the order or direction of data

Raster #

Raster data consist of grids of numbers - ‘Raster’ means ‘screen’ in Latin
Can contain multiple channels / bands - a true color image has Red, Green, Blue channels
The Spatial Resolution of a raster image refers to the size of a pixel (e.g. 1mm, 1cm, or 1m)
Satellite / drone / camera data is natively raster

The below image shows vector data on the left, and raster data on the right: vec-ras

Now that we have some foundational Geospatial knowledge under our belt, let’s cover some of the python tools that we can use to process, analyze, and visualize geospatial data:

Geopandas#

Remember pandas from last module? Enter geopandas - a natural geospatial extension to store tabular vector data (think points, lines, and polygons) and simplify operations by keeping with pandas-like syntax.

Let’s load the same stations.csv file and convert it from a pandas.DataFrame to a geopandas.GeoDataFrame:

import pandas as pd

url = "https://raw.githubusercontent.com/py4wrds/py4wrds/refs/heads/module-4/data/gwl/stations.csv"
df = pd.read_csv(url)

df.head()

	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	ZIP_CODE
0	51445	320000N1140000W001	NaN	Bay Ridge	35.5604	-121.755	USGS quad	Unknown	NaN	NaN	Monterey	NaN	Residential	Part of a nested/multi-completion well	NaN	92154
1	25067	325450N1171061W001	19S02W05K003S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154
2	25068	325450N1171061W002	19S02W05K004S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154
3	39833	325450N1171061W003	19S02W05K005S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154
4	25069	325450N1171061W004	19S02W05K006S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154

# Create a GeoDataFrame from the DataFrame

import geopandas as gpd

gdf_stns = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["LONGITUDE"], df["LATITUDE"]),
    crs="EPSG:4326",  # Set the coordinate reference system (CRS) - more on this later
)

GeoDataFrame objects are a subclass of pandas.DataFrame, so you can use all of the same methods. In fact, a GeoDataFrame is just a pandas.DataFrame with a single extra column (geometry) and some extra features that make use of that column.

plot() is a special geopandas method that will plot the geometry / latlon of the df:

# Plot our results:
gdf_stns.plot()

<Axes: >

_images/85a298841598037e7d79310f2696273ce01462dd0f485675d15000453c4ee06e.png

When printed, the tabular data is identical, and all pandas-like syntax can be used to sort and filter data:

# Get unique county names
print(gdf_stns["COUNTY_NAME"].unique())
# Plotting all stations in SLO county only
gdf_stns[gdf_stns["COUNTY_NAME"] == "San Luis Obispo"].plot()

['Monterey' 'San Diego' 'Imperial' 'Riverside' 'Orange' 'Kern'
 'Los Angeles' 'San Bernardino' 'Ventura' 'Santa Barbara' 'Alameda'
 'San Luis Obispo' 'Amador' 'Alpine' 'Tulare' 'Inyo' 'Fresno' 'Kings'
 'Merced' 'San Benito' 'San Joaquin' 'Glenn' 'Madera' 'Santa Cruz'
 'Santa Clara' 'Plumas' 'Stanislaus' 'San Mateo' 'Mono' 'Mariposa'
 'San Francisco' 'Contra Costa' 'Tuolumne' 'Marin' 'Calaveras' 'Sonoma'
 'Sacramento' 'Sutter' 'Solano' 'Napa' 'Placer' 'Yolo' 'Lake' 'Colusa'
 'El Dorado' 'Mendocino' 'Yuba' 'Butte' 'Nevada' 'Sierra' 'Lassen'
 'Tehama' 'Shasta' 'Humboldt' 'Modoc' 'Siskiyou' 'Del Norte' 'Klamath, OR']

<Axes: >

_images/f349a6527ab234661d63bfb18a4bc6347672dcd359a74375c51e4747ac6cb3dc.png

Reading and writing vector files#

Geopandas excels at making it easy to read and write vector files of several formats, which facilitates conversions and interoperability.

Simply import geopandas as gpd and then read any common vector file format with the gpd.read_file('filename') function, which works for common formats such as .shp (and supporting .dbf and .shx files), kml, kmz, geojson, etc..

# Read a geojson file
url = "https://raw.githubusercontent.com/scottpham/california-counties/refs/heads/master/caCountiesTopo.json"
gdf_counties = gpd.read_file(url)
gdf_counties.head()

	id	name	fullName	areaLand	geometry
0	06091	Sierra	Sierra County	2468686345	POLYGON ((-120.14666 39.70738, -120.13527 39.7...
1	06067	Sacramento	Sacramento County	2499176690	POLYGON ((-121.14147 38.7118, -121.14043 38.71...
2	06083	Santa Barbara	Santa Barbara County	7083926262	MULTIPOLYGON (((-119.46758 34.06296, -119.4862...
3	06009	Calaveras	Calaveras County	2641819811	POLYGON ((-120.07212 38.50985, -120.07212 38.5...
4	06111	Ventura	Ventura County	4773381212	MULTIPOLYGON (((-119.63631 33.27314, -119.6363...

gdf_counties.plot()

<Axes: >

_images/7d6e09df646bdd4ce016efb6056097285f72fd9879c7c2dedce06b64f2cd901c.png

To write files, simply use the syntax: gdf.to_file("path/to/out_file.geojson") syntax. We recommend storing files as .geojson due to readability and simplicity to store and retrieve.

gdf_counties.to_file("ca_counties.geojson")

/Users/andrew/opt/miniconda3/envs/py4wrds2/lib/python3.13/site-packages/pyogrio/geopandas.py:662: UserWarning: 'crs' was not provided.  The output dataset will not have projection information defined and may not be usable in other systems.
  write(

We received a warning that our CRS was not defined. We can eliminate this warning by defining the geographic WGS 84 coordinate reference system (CRS) through its EPSG code using gdf.set_crs('epsg:4236').

Similar to other DataFrame and pandas operations, we must specify the inplace = True argument:

gdf_counties.set_crs("epsg:4326", inplace=True)
gdf_counties.to_file("ca_counties.geojson")

# verify that file was written in our current directory:
import os

[x for x in os.listdir(os.getcwd()) if "geojson" in x]

['ca_counties.geojson']

We can inspect the coordinate reference system by using the gdf.crs syntax:

# Geographic CRS example:
gdf_stns.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Reprojections#

Often, we’ll want to change the coordinate system of some data. This can be achieved through reprojection.

UTM zones:#

Universal Transverse Mercator (UTM) is often an appropriate coordinate system for measuring distances, areas, etc

Units in meters, expressed in Easting / Northing

UTM coordinate systems

Let’s reproject the geodataframe into a projected CRS for California - UTM Zone 10 N, which is epsg:32611:

gdf_rpj = gdf_stns.to_crs("epsg:32611")
gdf_rpj.crs

<Projected CRS: EPSG:32611>
Name: WGS 84 / UTM zone 11N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 120°W and 114°W, northern hemisphere between equator and 84°N, onshore and offshore. Canada - Alberta; British Columbia (BC); Northwest Territories (NWT); Nunavut. Mexico. United States (USA).
- bounds: (-120.0, 0.0, -114.0, 84.0)
Coordinate Operation:
- name: UTM zone 11N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

the estimate_utm_crs() function may be convenient for going between geographic and projected CRS:

gdf_stns.estimate_utm_crs()

<Projected CRS: EPSG:32611>
Name: WGS 84 / UTM zone 11N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 120°W and 114°W, northern hemisphere between equator and 84°N, onshore and offshore. Canada - Alberta; British Columbia (BC); Northwest Territories (NWT); Nunavut. Mexico. United States (USA).
- bounds: (-120.0, 0.0, -114.0, 84.0)
Coordinate Operation:
- name: UTM zone 11N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Geometric Operations#

We can easily calculate geometry properties such as area, centroids, bounds, and distances using geopandas.

Recall that one of our datasets consists of points (gdf_stns), and the other consists of polygons (gdf_counties)

Calculating Areas#

# Calculate area for each polygon in a UTM CRS
# Note - the units depend on the coordinate system,
# Be sure to reproject appropriately and perform conversions

gdf_rpj["area_sqkm"] = gdf_rpj.area * 1e-6  # Sq m to sq km

gdf_rpj.head()

	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	ZIP_CODE	geometry
0	51445	320000N1140000W001	NaN	Bay Ridge	35.5604	-121.755	USGS quad	Unknown	NaN	NaN	Monterey	NaN	Residential	Part of a nested/multi-completion well	NaN	92154	POINT (68916.162 3945609.103)
1	25067	325450N1171061W001	19S02W05K003S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154	POINT (490047.407 3600852.39)
2	25068	325450N1171061W002	19S02W05K004S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154	POINT (490047.407 3600852.39)
3	39833	325450N1171061W003	19S02W05K005S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154	POINT (490047.407 3600852.39)
4	25069	325450N1171061W004	19S02W05K006S	NaN	32.5450	-117.106	Unknown	Unknown	9-033	Coastal Plain Of San Diego	San Diego	NaN	Unknown	Unknown	NaN	92154	POINT (490047.407 3600852.39)

Centroids#

The centroid is the geographic center point of a polygon (i.e. the mean of all the points that make up its surface)

Calculating centroids can be done with the gdf.centroid function:

print(gdf_counties.centroid.head())
gdf_counties.centroid.plot()

    POINT (-120.5159 39.5804)
  POINT (-121.34431 38.44933)
  POINT (-120.03075 34.53832)
  POINT (-120.55404 38.20466)
  POINT (-119.12598 34.35742)
dtype: geometry

/var/folders/df/w347rlh96tx42h1kl69_4r8w0000gn/T/ipykernel_73601/3389763898.py:1: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  print(gdf_counties.centroid.head())
/var/folders/df/w347rlh96tx42h1kl69_4r8w0000gn/T/ipykernel_73601/3389763898.py:2: UserWarning: Geometry is in a geographic CRS. Results from 'centroid' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  gdf_counties.centroid.plot()

<Axes: >

_images/43dc471a06630e0699949835e84ad8c02935b31f30b5fe32ab249da1ff22c804.png

the alpha keyword here gives a transparency to the plotted points with 0 being fully transparent and 1 being fully opaque

Boundaries#

Similarly, boundaries can be obtained with the gdf.boundary function:

# Get the boundary of each polygon
print(gdf_counties.boundary.head())
gdf_counties.boundary.plot(alpha=0.3)

  LINESTRING (-120.14666 39.70738, -120.13527 39...
  LINESTRING (-121.14147 38.7118, -121.14043 38....
  MULTILINESTRING ((-119.46758 34.06296, -119.48...
  LINESTRING (-120.07212 38.50985, -120.07212 38...
  MULTILINESTRING ((-119.63631 33.27314, -119.63...
dtype: geometry

<Axes: >

_images/d907ff55600eeb553bf32a62d8df2733676cc5d471215e1b7c7d8ac4c2bb4e9c.png

Distances#

Similarly, distances between two points can be computed with the distance function:

Let’s calculate the distance between each of the stations in our dataset

gdf_stns.plot(markersize=1, alpha=0.01)

<Axes: >

_images/ca6ac441ae1cca7165edbf7b087d3843fbed0f4e84004241b630013474e0b975.png

# Reproject to UTM zone 11N
gdf_stns.to_crs(gdf_stns.estimate_utm_crs(), inplace=True)

# calculate distance between the first station and each of the other stations
gdf_stns["dist_from_stn_1"] = gdf_stns.distance(gdf_stns.iloc[0].geometry)

# Plot stations, colored by distance
ax = gdf_stns.plot(column="dist_from_stn_1", legend=True)
# Plot the first station in red
gdf_stns.loc[[0], "geometry"].plot(markersize=50, color="red", ax=ax, label="0th station")
# add a legend a title
ax.legend()
ax.set_title("dist (m) from 0th station")

Text(0.5, 1.0, 'dist (m) from 0th station')

_images/dad9c522502dc755071f5021b12ad2906522d895487de59f70f64f98dd4726d8.png

Plotting#

Just above we see an example of how to color a GeoDataFrame by a column, by specifying the column = 'column_name' argument into gdf.plot().

We can create interactive plots with the gdf.explore() funtion:

gdf_slo = gdf_stns[gdf_stns["COUNTY_NAME"] == "San Luis Obispo"]

gdf_slo.explore(legend=False)

Make this Notebook Trusted to load map: File -> Trust Notebook

Spatial Operations#

Geopandas makes it easy to perform basic spatial operations such as Buffers, Convex Hull, and Spatial Joins

Buffer#

# Apply a 1000m buffer to the first 5 points in the SLO well data
ax = gdf_slo[:5].buffer(1000).plot(alpha=0.3)
# Plot the same 5 but with a smaller (100m) buffer
gdf_slo[:5].buffer(100).plot(ax=ax, color="red")

<Axes: >

_images/b98fa684f09c50683af68b19142f792e263d306f0b2351758712843115f8b96f.png

Convex Hull#

A convex hull gives the minimum bounds (as polygon) to enclose a number of points or another polygon’s points.

gdf_counties["convex_hull"] = gdf_counties.convex_hull
# Plot the convex hulls
ax = gdf_counties["convex_hull"].plot(alpha=0.5, color="lightblue", edgecolor="black")
ax.set_title("Convex Hull of CA Counties")

Text(0.5, 1.0, 'Convex Hull of CA Counties')

_images/239299d7078c37b5adfdfbb38cf1d707bb52537eb3794463b6ae62ee925eb934.png

Dissolving Overlapping Polygons#

Say we want to create a new single polygon consisting of overlapping polygons, we can achieve this with a dissolve operation

# First we set the geometry column of the gdf to be the buffered polygons instead of the points
gdf_slo_subset = gdf_slo.iloc[:5, :]
gdf_slo_subset.loc[:, "geometry"] = gdf_slo_subset.buffer(1000)
gdf_slo_subset.plot()

<Axes: >

_images/bb3816be4ee52b74adfa0aa179f55e5ad5870e24a106f457532b74b19b096a00.png

gdf_slo_subset.head()

	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	ZIP_CODE	geometry	dist_from_stn_1
14732	14515	349075N1195236W001	10N25W35F001S	NaN	34.9075	-119.524	Unknown	Unknown	3-013	Cuyama Valley	San Luis Obispo	NaN	Unknown	Unknown	NaN	92154	POLYGON ((270395.988 3865693.001, 270391.173 3...	215821.092758
14834	32411	349130N1195350W001	10N25W27R001S	NaN	34.9130	-119.535	Unknown	Unknown	3-013	Cuyama Valley	San Luis Obispo	NaN	Unknown	Unknown	NaN	92154	POLYGON ((269406.21 3866328.542, 269401.395 38...	214666.454544
14835	38827	349131N1195309W001	10N25W27R002S	NaN	34.9131	-119.531	Unknown	Unknown	3-013	Cuyama Valley	San Luis Obispo	NaN	Unknown	Unknown	NaN	92154	POLYGON ((269772.007 3866330.38, 269767.192 38...	215005.755313
14924	14511	349178N1195134W001	10N25W25M003S	NaN	34.9178	-119.513	Unknown	Unknown	3-013	Cuyama Valley	San Luis Obispo	NaN	Unknown	Unknown	NaN	92154	POLYGON ((271429.923 3866810.295, 271425.107 3...	216372.474901
14942	32409	349191N1195092W001	10N25W25M004S	NaN	34.9191	-119.509	Unknown	Unknown	3-013	Cuyama Valley	San Luis Obispo	NaN	Unknown	Unknown	NaN	92154	POLYGON ((271799.033 3866945.332, 271794.218 3...	216667.216124

gdf_slo_subset.dissolve().plot()

<Axes: >

_images/28b880074ba98f733426374d0c4d2921c982705472c8be41367aab5e52e8f757.png

# Notice that there is only one row in the gdf after performing the dissolve
# values are inherited from the first row, but the geometry is the unary_union of all geometryes
gdf_slo_subset.dissolve().head()

	geometry	STN_ID	SITE_CODE	SWN	WELL_NAME	LATITUDE	LONGITUDE	WLM_METHOD	WLM_ACC	BASIN_CODE	BASIN_NAME	COUNTY_NAME	WELL_DEPTH	WELL_USE	WELL_TYPE	WCR_NO	ZIP_CODE	dist_from_stn_1
0	POLYGON ((270376.773 3865497.911, 270352.928 3...	14515	349075N1195236W001	10N25W35F001S	None	34.9075	-119.524	Unknown	Unknown	3-013	Cuyama Valley	San Luis Obispo	NaN	Unknown	Unknown	None	92154	215821.092758

Spatial Joins, Queries, and Relations#

Intersections#

To check for spatial intersection between two geographic datasets, we can use the intersection function:

# Check which wells intersect with SLO's geometry
gdf_slo = gdf_counties[gdf_counties["name"] == "San Luis Obispo"].to_crs(gdf_counties.estimate_utm_crs())

gdf_slo.plot()

<Axes: >

_images/48c86beaa07e25a61234e3902c9154b4dba92c911ad8495ae3660cad11f6374b.png

slo_co_stns = gpd.overlay(gdf_stns, gdf_slo, how="intersection")

slo_co_stns.plot()

<Axes: >

_images/32ac27ae47128a2ebc55859cee769bb5c4857ce9b713462d49503f5f6d87b6a7.png

ax = gdf_slo.plot(color="white", edgecolor="red")
slo_co_stns.plot(alpha=0.5, markersize=5, color="gray", label="stations", ax=ax)
ax.legend()

<matplotlib.legend.Legend at 0x16c63cb90>

_images/f1d16ca26f5afa38742a823dd75e7fda201aec406e790338c8dc30002043a8e9.png

Exercise#

Determine how many “Industrial”, “Irrigation”, and “Observation” Wells are in SLO County
Create a basic map showing each category as a different color

Rasterio#

Rasterio is a powerful library for working with raster data (think: images or grids of numbers with pixels) written on top of the Geospatial Data Abstraction Library (GDAL), that makes it easy to perform a wide variety of raster processing and visualization tasks.

Reading and inspecting data#

We can open geospatial raster datasets (e.g. .tif and .tiff files) using the .open function:

import rasterio as rio

# raster_path = "https://github.com/opengeos/datasets/releases/download/raster/dem_90m.tif"
raster_path = "./data/dem_90m.tif"
src = rio.open(raster_path)
print(src)

<open DatasetReader name='./data/dem_90m.tif' mode='r'>

Notice that we are reading a file from a URL here, but this function works identically for local files.

Once a dataset has been opened, we can inspect lots of useful information, including:

Projection information
Bounding box
Spatial resolution
Dimensions - width / height
Other metadata

# Inspect crs
src.crs

CRS.from_wkt('PROJCS["WGS 84 / Pseudo-Mercator",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Mercator_1SP"],PARAMETER["central_meridian",0],PARAMETER["scale_factor",1],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],EXTENSION["PROJ4","+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs"],AUTHORITY["EPSG","3857"]]')

# Get the EPSG code of the CRS
src.crs.to_epsg()

# Inspect metadata
src.meta

{'driver': 'GTiff',
 'dtype': 'int16',
 'nodata': None,
 'width': 4269,
 'height': 3113,
 'count': 1,
 'crs': CRS.from_wkt('PROJCS["WGS 84 / Pseudo-Mercator",GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]],PROJECTION["Mercator_1SP"],PARAMETER["central_meridian",0],PARAMETER["scale_factor",1],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],EXTENSION["PROJ4","+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs"],AUTHORITY["EPSG","3857"]]'),
 'transform': Affine(90.0, 0.0, -13442488.3428,
        0.0, -89.99579177642138, 4668371.5775)}

Discussion: What do the dtype, count, and nodata values in the src.meta dict mean?

# Inspect spatial res
src.res

(90.0, 89.99579177642138)

# Inspect dimensions or width / height
print(src.shape, src.width, src.height)

(3113, 4269) 4269 3113

# Inspect bounding box
src.bounds

BoundingBox(left=-13442488.3428, bottom=4388214.6777, right=-13058278.3428, top=4668371.5775)

Plotting and visualizing#

We can use the read function to read raster bands from .tiff or .tif files as numpy arrays, and then make use the imshow function from the trusty module matplotlib!

import numpy as np
import matplotlib.pyplot as plt

arr = src.read(1)
plt.imshow(arr)
plt.colorbar()

<matplotlib.colorbar.Colorbar at 0x1754dafd0>

_images/771d7b47ae39799a1e29248e765eb45916f419c215bfc3c0845515f32b0b487c.png

In the above code, we are reading the first band from the raster using src.read(1). For multiband rasters, we can read the other bands using the same syntax.

Note that this method does not preserve any projection information in the plot - the x and y axes are just the number of rows / cols in the dataset.

If we want to preserve any geospatial coordinate information, we can use the rasterio plotting module:

import rasterio.plot

rio.plot.show(src)

_images/21a30c77760dadfb6c49ff793744529d926c1311c43515a3237ecbed938fc150.png

<Axes: >

Exercise#

set all values in raster less than 500 equal to np.nan
Set all values greater than 1000 to np.nan
Calculate the mean of the remaining raster cells
Hint - inspect the data type of the array using the arr.dtype syntax, and convert as needed

Formatting Plots#

We can leverage matplotlib’s functionality to style our plots, including adding titles and choosing colormaps:

fig, ax = plt.subplots(figsize=(8, 8))
rasterio.plot.show(src, cmap="terrain", ax=ax, title="Digital Elevation Model (DEM)")
plt.show()

_images/b668a3460b72f0a9ecbb5a741767d87f2ab6802b8c63add838ee3e69fe152c72.png

Plotting vector and raster data#

Tuolumne_co = gdf_counties[
    gdf_counties["name"] == "Tuolumne"
].copy()  # We need to make a copy of this object to avoid a SettingWithCopyWarning
Tuolumne_co.to_crs(src.crs, inplace=True)

fig, ax = plt.subplots(figsize=(8, 8))
rasterio.plot.show(src, cmap="terrain", ax=ax, title="Digital Elevation Model (DEM) and Tuolumne co")
Tuolumne_co.plot(ax=ax, edgecolor="red", facecolor="none", lw=2)

<Axes: title={'center': 'Digital Elevation Model (DEM) and Tuolumne co'}>

_images/992c684586d23d92b529b609d72f24850c5399b7a1bef8fc24d351ce99e290f6.png

Clipping raster to vector data#

A common task involves clipping a raster to a vector dataset. We can use rasterio.mask to clip the DEM above to the boundary of Tuolumne county, defined in the ca_counties.geojson file:

import rasterio.mask

# Read raster
raster_path = "https://github.com/opengeos/datasets/releases/download/raster/dem_90m.tif"
raster_path = "./data/dem_90m.tif"

src = rio.open(raster_path)

# Read shape, isolate Tuolumne
shp_geom = gpd.read_file("ca_counties.geojson")
Tuolumne_co = shp_geom[shp_geom["name"] == "Tuolumne"].copy()  # We need to make a copy of this object to avoid a SettingWithCopyWarning

# Ensure CRS are consistent
Tuolumne_co.to_crs(src.crs, inplace=True)

# Perform clip
out_image, out_transform = rio.mask.mask(src, Tuolumne_co["geometry"], crop=True)

# Verify results:

rasterio.plot.show(out_image[0, :, :]);

_images/192079daafcf6f06c71509bfc4773f76e0f93fb9cac5356e114100d045f3da98.png

Notice that rasterio returns by default an array with 3 dimensions (out_image), for which we have selected the first band manually. This functionality exists to deal with multiband rasters.

Writing rasters#

Let’s save the raster we just clipped

# Set the output file
output_raster_path = "./data/clipped.tif"

# Extract some information from the image we processed
kwargs = src.meta
kwargs.update(height=out_image.shape[1], width=out_image.shape[2], count=out_image.shape[0], dtype=out_image.dtype, crs=src.crs)

# Save
with rasterio.open(output_raster_path, "w", **kwargs) as dst:
    dst.write(out_image)

print(f"Raster data has been written to {output_raster_path}")

Raster data has been written to ./data/clipped.tif

Reprojecting Rasters#

We use the rasterio.warp module to change raster projections.

Let’s reproject our clipped raster to the WGS 84 (EPSG:3857) CRS and save the reprojected raster to a new file.

from rasterio.warp import calculate_default_transform, reproject, Resampling

# Set output filename
raster_path = "./data/clipped.tif"
dst_crs = "EPSG:4326"  # WGS 84
output_reprojected_path = "./data/clipped_reprojected.tif"

with rasterio.open(raster_path) as src:
    transform, width, height = calculate_default_transform(src.crs, dst_crs, src.width, src.height, *src.bounds)

    profile = src.profile
    profile.update(crs=dst_crs, transform=transform, width=width, height=height)

    with rasterio.open(output_reprojected_path, "w", **profile) as dst:
        for i in range(1, src.count + 1):
            reproject(
                source=rasterio.band(src, i),
                destination=rasterio.band(dst, i),
                src_transform=src.transform,
                src_crs=src.crs,
                dst_transform=transform,
                dst_crs=dst_crs,
                resampling=Resampling.nearest,
            )
print(f"Reprojected raster saved at {output_reprojected_path}")

Reprojected raster saved at ./data/clipped_reprojected.tif

Multiband Rasters#

Satellite data is generally composed of a number of spectral bands. Each band measures the light reflected between two wavelengths (hence the term ‘band’). Rasterio allows us to work with these bands individually and in concert.

For example, NASA’s Landsat Satellites record the following bands:

Name	Wavelength	Description
SR_B1	0.435-0.451 μm	Band 1 (ultra blue, coastal aerosol) surface reflectance
SR_B2	0.452-0.512 μm	Band 2 (blue) surface reflectance
SR_B3	0.533-0.590 μm	Band 3 (green) surface reflectance
SR_B4	0.636-0.673 μm	Band 4 (red) surface reflectance
SR_B5	0.851-0.879 μm	Band 5 (near infrared) surface reflectance
SR_B6	1.566-1.651 μm	Band 6 (shortwave infrared 1) surface reflectance
SR_B7	2.107-2.294 μm	Band 7 (shortwave infrared 2) surface reflectance

Let’s plot all the bands in this raster

# Read multiband Landsat image:
raster_path = "https://github.com/opengeos/datasets/releases/download/raster/LC09_039035_20240708_90m.tif"
# raster_path = "./data/LC09_039035_20240708_90m.tif"
src = rasterio.open(raster_path)
print(src)

# Specify band names:
band_names = ["Coastal Aerosol", "Blue", "Green", "Red", "NIR", "SWIR1", "SWIR2"]

<open DatasetReader name='https://github.com/opengeos/datasets/releases/download/raster/LC09_039035_20240708_90m.tif' mode='r'>

# Plot raster:

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(8, 10))
axes = axes.flatten()  # Flatten the 2D array of axes to 1D for easy iteration

for band in range(1, src.count):
    data = src.read(band)
    ax = axes[band - 1]
    im = ax.imshow(data, cmap="gray", vmin=0, vmax=0.5)
    ax.set_title(f"Band {band_names[band - 1]}")
    ax.axis("off")
    fig.colorbar(im, ax=ax, label="Reflectance", shrink=0.5)

plt.tight_layout()
plt.show()

_images/93a4b48aed2f16185355f7893deec441c3d414704fa89afe76d17a856e8012be.png

True color image#

In order to visualize a true color image, we need to visualize band 2 (blue), band 3 (green), and band 4 (red) in the same image:

nir_band = src.read(5)
red_band = src.read(4)
green_band = src.read(3)

# Stack the bands into a single array
rgb = np.dstack((nir_band, red_band, green_band)).clip(0, 1)

# Plot the stacked array
plt.figure(figsize=(6, 6))
plt.imshow(rgb)
plt.axis("off")
plt.title("Bands NIR, Red, and Green combined")
plt.show()

_images/0d14bc938e6a1eff9ea756ff5501eaf4afc5a64e26e7d97812fa1e76d8a96014.png

src.nodata

-inf

Band math: Calculating NDVI#

A common metric that we can derive from multiple bands is Normalized Difference Vegetation Index (NDVI), which is defined as:

\[NDVI = (NIR - Red) / (NIR + Red)\]

# NDVI Calculation: NDVI = (NIR - Red) / (NIR + Red)
ndvi = (nir_band - red_band) / (nir_band + red_band)

/var/folders/df/w347rlh96tx42h1kl69_4r8w0000gn/T/ipykernel_73601/134765836.py:2: RuntimeWarning: invalid value encountered in subtract
  ndvi = (nir_band - red_band) / (nir_band + red_band)

We received a RuntimeWarning about invalid values in our subtract operation. This is likely due to the ‘no data’ areas denoted by the black pixels in the true color image above. We can read the nodata value directly from rasterio and use numpy to change the nodata values in order to fix this issue:

# Set the nodata value to numpy nan for each of the bands
nir_band[nir_band == src.nodata] = np.nan
red_band[red_band == src.nodata] = np.nan

# NDVI Calculation: NDVI = (NIR - Red) / (NIR + Red)
ndvi = (nir_band - red_band) / (nir_band + red_band)
ndvi = ndvi.clip(-1, 1)

plt.figure(figsize=(6, 6))
plt.imshow(ndvi, cmap="RdYlGn", vmin=-1, vmax=1)
plt.colorbar(label="NDVI", shrink=0.5)
plt.title("NDVI")
plt.xlabel("Column #")
plt.ylabel("Row #")
plt.show()

_images/0ef996e1b4de6a136958cc468ff94c28a4b1ff05d2217608ee8dbbc1f5d84594.png

GDAL#

The Geospatial Data Abstraction Library (GDAL) is the magical piece of open source software that underlies all of the above libraries.

While GDAL has python bindings, the syntax are more complicated and less user friendly than the above libraries. Thus, coverage of GDAL syntax explicitly is outside the scope of this course.

For large batch, big data, or highly custom processing tasks, GDAL has a large array of command-line utilities (see here) that can be deployed via the Anaconda Prompt command line that was introduced in module 1.

5. Geospatial python

Contents

5. Geospatial python#

Coordinate Systems and Projections#

Vector and Raster data#

Vector#

Raster#

Geopandas#

Reading and writing vector files#

Reprojections#

UTM zones:#

Geometric Operations#

Calculating Areas#

Centroids#

Boundaries#

Distances#

Plotting#

Spatial Operations#

Buffer#

Convex Hull#

Dissolving Overlapping Polygons#

Spatial Joins, Queries, and Relations#

Intersections#

Exercise#

Rasterio#

Reading and inspecting data#

Plotting and visualizing#

Exercise#

Formatting Plots#

Plotting vector and raster data#

Clipping raster to vector data#

Writing rasters#

Reprojecting Rasters#

Multiband Rasters#

True color image#

Band math: Calculating NDVI#

Xarray#

Xarray data representation#

Loading Data and inspecting attributes#

Filtering and subsetting#

Plotting and visualization#

GDAL#

Vector #

Raster #