Spatial analysis is a type of GIS analysis that uses math and geometry to understand patterns that happen over space and time, including patterns of human behavior and natural phenomena.
When performing spatial analysis or spatial data science, the right tools can open a world of free and collaborative analytics capabilities without costly software licenses.
We are going to give you a quick tour of some of the open source Python libraries available for geospatial analysis. All of these libraries can be easily integrated with JupyterLab and scale to large datasets.
Let’s get to it:
Vector Data Libraries
In GIS, the term “vector” describes discrete geometries (points, lines, polygons) with related attribute data (e.g. name, county identifier, population). We can use different geometries to represent the same phenomena depending on our scale and level of measurement. For instance, we can represent the White House as either a point, line, or polygon depending on whether we want to look at a building point-of-interest, building outline, or building footprint.
GeoPandas is all about making it easy to work with geospatial data in Python. It expands on the built-in pandas data types within a new data structure called the GeoDataFrame. GeoPandas wraps the foundational Python packages Shapely and Fiona, both great packages created by Sean Gillies.
Moving down in the stack from GeoPandas, Shapely wraps GEOS and defines the actual geometry objects (points, lines, polygons) and the spatial relationships between them (e.g. adjacency, within, contains). You can use shapely directly without GeoPandas, but in a dataframe-centric world, Shapely is less of a direct tool and more a dependency for higher-level packages.
Fiona can read and write many kinds of geospatial vector data and easily integrates with other Python GIS libraries. It relies on OGR / GEOS for reading shapefiles, geopackages, geojson, topojson, KML, GML from both the local filesystem and cloud services like Amazon S3 by wrapping Python’s boto3 library.
Spatialpandas supports Pandas and Dask extensions for vector-based spatial and geometric operations. It is a good tool for working with vectorized geometric algorithms using Numba or Python. The library was first used for polygon rasterization with Datashader and since has become its own standalone project.
PySAL is a good tool for developing high level applications for spatial regression, spatial econometrics, statistical modeling on spatial networks and spatio-temporal analysis, as well as hot-spots, clusters and outliers detection analysis.
It consists of four packages of modules that focus on different aspects of spatial analysis:
Lib for computational geometry problems
Explore for exploratory analysis of spatial and spatio-temporal data
Model for confirmatory analysis through estimations of relationships
Viz to create geovisualisations and link to other visualization tool-kits in the Python ecosystem
Rasters are regularly gridded datasets like GeoTIFFs, JPGs, and PNGs. Regular grids are useful in representing continuous phenomena that are not cleanly represented by points, lines, and polygons. For instance, in analyzing weekly rainfall for Seattle, we would first start with weather station rainfall measurements (points), and interpolate values to create a raster (continuous-surface) to represent rainfall over the entire city.
GDAL is the Geospatial Data Abstraction Library which contains input, output, and analysis functions for over 200 geospatial data formats. It supports APIs for all popular programming languages and includes a CLI (command line interface) for quick raster processing tasks (resampling, type conversion, etc.).
Rasterio, another creation from the prolific Sean Gilles, is a wrapper around GDAL for use within the Python scientific data stack and integrates well with Xarray and Numpy. It can read, write, organize and store several raster formats like Cloud-optimized GeoTIFFs (COG). A key goal is to provide high-performance and reduced cognitive load for Python developers by using a familiar syntax.
Datashader is a general-purpose rasterization pipeline. We’ve mentioned the difference between vector and raster. What if you want to convert from a vector type to a raster type? This is where Datashader comes in and allows you to intelligently grid your data. Datashader has tools that make it easy to create graphics pipelines with a little bit of code and is an ideal tool for a principled approach to data science. It can handle large datasets and allows users to generate meaningful visualizations. It allows for a stepwise process that eliminates the need for trial and error in visualizing large datasets. As a side note, the makepath team includes core developers on Datashader.
Xarray-Spatial implements common raster analysis functions using Numba and provides a codebase that is easy to install and extend. It originated from the Datashader project and includes tools for surface analysis (e.g. slope, curvature, hillshade, viewshed), proximity analysis (e.g. euclidean distance, great circle distance), and zonal / focal analysis (summary statistics by region or neighborhood). It is not dependent on GDAL or GEOS and was created to support core raster analysis functions that GIS developers and analysts need.
RTree wraps the C library libspatialindex for building and querying large indexes of rectangles. Most times rectangles represent the bounding boxes of polygons which makes the RTree library essential for fast point-in-polygon operations. One downside of this library is that the underlying C/C++ code is not thread-safe. This can cause problems when trying to access the same index from different threads or processes, but still a very useful tool which Geopandas also wraps.
Uber came up with a hexagonal index grid analysis system for more targeted exploration and visualization of their spatial data. H3 indexes with hexagons which better accounts for the mobility of data points and minimizes errors in quantization (than other shapes, say a square).
Hexagons are also a good choice for quick and easy radii approximations. The hierarchical approach used allows you to truncate the precision/resolution of an index without losing the original indexes.
H3 was written in C, and there is also a Python binding, to “hexagonify your world”.
NetworkX comes into play for analysis of graphs and complex networks. A nice plus is the flexibility to work with a variety of data types from text and images to XML records as well as large volumes of data, up to tens of millions of nodes and edges.
PyProj is useful for map projections, which define how we distort a 3D world converting to a 2D map. PyProj wraps the Proj4 library and performs cartographic transformations between coordinate reference systems like WGS84 (longitude / latitude) and UTM (meters west / meters north). Map projections can be difficult to understand and PyProj does a great job.
It supports GeoJSON, TopoJSON, image and video overlays.
There are many tools at our disposal to do geospatial data analysis and visualizations. Although we just highlighted some tools in the Python stack, geospatial analysis is not limited to Python.
Python’s motto is “Programming for Everybody” and this certainly holds true for the geo community. Python is the first language for many aspiring data scientists and we hope this list will help you on your geo-journey.
We would love to hear from you!
Do you have any questions, suggestions, or Python/non-Python stacks you love doing your spatial analysis with? Share your ideas in the comments.