Blog

Superpowered GIS: ESRI’s ArcGIS + Open Source Spatial Analysis Tools.

The 2020 Esri User Conference is happening next week.  It’s virtual this year due to COVID-19, but I’m sure the event will be full of great presentations and announcements.

The UC is a great time to expand the depth and breadth of your GIS toolkit.  I’ve always found the technical tracks great, but there are some topics that are too hot for the UC.

Let’s talk about one of those too hot for UC topics…Esri + Open Source.

This topic isn’t actually as hot as you’d think:

  1. Esri has done a great job at integrating open source tools and standards directly into their proprietary products. Some of these tools include:
    1. Python
    2. GeoJSON
    3. OGC Services
    4. NetCDF
    5. Conda
    6. Numpy
    7. Jupyter Notebook

These integrations make it much easier to interoperate with 3rd-party analysis tools outside of ArcGIS. With additional tools, it’s possible to expand the breadth and depth of analysis questions.

  1. Esri has open-sourced many of its internal projects which gives the larger open source ecosystem a huge boost. Some example include:
    1. Shapefile spec
    2. Java geometry toolkit
    3. Multiple Web APIs (Esri JS API, Esri-Leaflet, not to mention Flex and Silverlight APIs which were awesome…shout out to the Flex Team circa 2010…Adobe Flash will never die…)
    4. Beautiful map books and cartography guides
    5. Great algorithm documentation
  1. Esri is a major sponsor and contributor to FOSS4G, PyData, and a number of meetups which help promote geoscience and open source software.

But it wouldn’t be 2020 if we didn’t get a little spicier…

Let’s talk about how you can keep the Esri stack, while adding some non-Esri open source projects to round out your toolkit.

File Geodatabases are handy within ArcGIS, but are rarely supported outside of the Esri ecosystem.  QGIS is an open source desktop GIS capable of reading and writing to Esri’s File Geodatabase format. QGIS also has over 400 unique geoprocessing tools and can export data in interesting formats like geopackages and cloud-optimized GeoTIFFs.

ArcGIS Online and ArcGIS Server allow you to interact with feature services via their REST APIs. 

You can start by constructing a query url to feature services which returns GeoJSON (`service/query?f=geojson`). 

Geopanda’s `read_file()` method can use this url directly to query the service and return a Geopandas GeoDataFrame. Geopandas wraps shapely, pyproj, and rtree, which are keystone tools for geospatial vector analysis in Python.

One of the superpowers within ArcGIS is Arcpy’s RasterToNumpyArray and NumpyArrayToRaster.  This IO function enables you to convert between ArcGIS rasters and NumPy arrays.  Many other libraries speak “NumPy array” because NumPy is a foundational library for the Scientific Python and serves as a connector library to an entire ecosystem of tools like  Scikit-Learn, Datashader, Xarray-Spatial, TensorFlow, Pandas, Rasterio!

Examples of using ArcGIS with Xarray-Spatial + Datashader

Do you want Superpowered GIS? Open source projects can expand your toolbox as an Esri user. If you want to get started, here is a resource with some open source spatial analysis tools for Python.

At makepath, we sponsor and contribute to Xarray-Spatial to extend your raster analysis capabilities while still using the powerful Esri toolset.

Do you have your own ways to supercharge your GIS analysis? We’d love to hear from you in the comments or at contact@makepath.com.

In-Depth Video: Pharmacy Desert Identification with Open Source Spatial Analysis Tools

Our co-founder Brendan Collins did an in-depth walkthrough of how we used spatial analysis tools to identify pharmacy deserts.

This video is a companion to the original blog post which was published on June 9th, 2020.

Do you have any thoughts comments, or ideas about this post?

We would love to hear from you in the comments or through contact@makepath.com.

Seniors at Risk: Using Spatial Analysis to Identify Pharmacy Deserts

10 US Counties identified as pharmacy deserts with high percentages of senior citizens.

Seniors are especially vulnerable during the COVID-19 pandemic.

At makepath, we build and use open source spatial analysis tools such as Xarray-Spatial and Datashader.

In this post, we show you how to use those tools to identify and quantify pharmacy deserts where seniors are at especially high risk.

We will walk through our analysis to show who is at risk. For readers who want to dive deeper, we provide a link to an example notebook showing the implementation using Xarray-Spatial and Datashader.

With analysis results in hand, we called people who live in these areas to get their perspectives and put faces and voices to the problem of pharmacy deserts.

You can apply these free and open source spatial analysis tools to your own problem domains while using this case study and accompanying notebook as a starting place.

At makepath we specialize in custom implementations to tackle big data questions, especially those with a focus on geospatial analysis. If you have any questions, please email us at contact@makepath.com.

What is a Pharmacy Desert?

A pharmacy desert is an area with reduced access to prescription and over-the-counter medications. Older residents are more vulnerable to pharmacy deserts because they are more dependent on these essential supplies.

Understanding pharmacy deserts is important to businesses looking to expand and to government agencies looking to guide development, provide services, and save lives.

Using spatial analytics and publicly available data, we identified 10 counties with high percentages of residents 65 and older who live in pharmacy deserts.

These counties include:

  • Hooker County, Nebraska
  • Catron County, New Mexico
  • Harding County, New Mexico
  • Wheeler County, Oregon
  • Jeff Davis County, Texas
  • Prairie County, Montana
  • Edwards County, Texas
  • Real County, Texas
  • Idaho County, Idaho
  • Valley County, Idaho

We identified these counties using the raster-based spatial analysis tools within Xarray-Spatial. Our conclusion is that in counties like Hooker, Catron and Harding, pharmacies have an opportunity to serve these communities.

In counties like Hooker, Catron and Harding, pharmacies have an opportunity to serve these communities.

Let’s now understand how we arrived at these 10 counties. The following analysis can be performed on consumer-grade computers using free and open source tools with publicly available data.

Step 0. Get Publicly Available Data

Our analysis of pharmacy deserts starts by obtaining pharmacy locations from the Department of Homeland Security (DHS). The DHS updates pharmacy locations regularly with precise latitude/longitude locations via their ArcGIS Online organization.

With pharmacy locations downloaded, we now need age information. ArcGIS Online also makes census information available pre-joined with age and sex demographics. Here we are able to obtain census blockgroups with our desired attributes as a shapefile. Let’s also grab county boundaries from the US Census Bureau.

In our analysis notebook, we load data into memory using pandas (pharmacy locations), and geopandas (block groups, counties).

pharmacies_df = pandas.read_csv("pharmacy_locations.csv"))
blockgroup_df = GeoDataFrame(geopandas.read_file("bg_wgs84.shp"))
counties_df = GeoDataFrame(geopandas.read_file("counties_wgs84.shp"))

These three layers (pharmacies, blockgroup ages, and counties) are all we need to begin an exploratory spatial analysis of pharmacy deserts. We will be performing raster-based analysis to align data to a regular grid overlay with layers on top of one another.

Step 1. Define Study Area

1600×800 pixel mask of lower-48 United States Study Area

We start our analysis by defining the study area. We’ll spatially align our data to this study area’s resolution. The precision of our analysis is controlled by the resolution of this layer.

Step 2. Create Distance to Pharmacy Layer

Proximity grid to nearest pharmacy classified into 5 categories using natural breaks. Lighter color means farther away from pharmacy.
distance = proximity(pharmacy_raster, distance_metric='GREAT_CIRCLE')
classified_distance = natural_breaks(distance, k=5)

Using pharmacy latitude / longitude locations, we can generate a proximity grid which shows the distance of every pixel to the nearest pharmacy location. Since we are dealing with geographic coordinates, we use great circle distance to compute distance over a spheroid. Then we classify the distances into 5 categories using natural breaks.

Step 3. Create Percent Older than 65 Layer

Percent of residents over 65-years-old classified into 5 categories using natural breaks. Lighter means “great percentage of +65 residents”.
classified_age = natural_breaks(percent_pop_65_raster, k=5)

Using US Census Block Groups, we can create a raster based on the percentage of 65+ residents in the block group. We then take this age raster and classify it into 5 categories using natural breaks.

Step 4. Combine Layers to Find Areas of Interest

This binary raster (values either 0 or 1) shows areas of interest defined as locations of high percentages of senior citizen AND far from nearby pharmacy.
is_pharmacy_desert = binary(classified_distance, [5]) # only category 5
is_high_percent_above_65 = binary(classified_age, [5]) # only category 5
area_of_interest = is_pharmacy_desert * is_high_percent_above_65

With distance and age layers defined, we can create binary (0 / 1) layers for both and multiply to get our areas of interest. When we use the Xarray-spatial binary classification tool, we get back a new layer of only ones and zeros. When we multiply these two 0/1 layers together, all locations with a zero drop out and only locations with 1 in both age and distance categories remain. We’ve now been able to isolate our counties of interest: those that are far from pharmacies and have a high percentage of senior citizens.

Step 5. Summarize Areas of Interest by County

County boundaries from US Census Bureau which are used to summarize our pharmacy desert layer.
results = zonal_stats(counties_raster, area_of_interest, zonal_funcs)
results.nlargest(10, 'pharmacy_desert_score')

Zonal statistics allow us to summarize our Areas of Interest by county. The results are individual pharmacy_desert_scores for each county. We can then take the top 10 score which is our final list of counties to target our outreach efforts.


COUNTY NAME
SENIOR CITIZEN PHARMACY DESERT SCORE
Hooker1.00
Catron0.81
Harding0.64
Wheeler0.51
Prairie0.47
Jeff Davis0.47
Edwards0.44
Real0.41
Idaho0.34
Valley0.33
Output table from Zonal Statistics operation which summarizes our pharmacy desert score raster for each county in the lower-48 United States. We then take the top 10 counties by pharmacy_desert_score.

Picking Up the Phone

We used tools available in Xarray-Spatial and Datashader to visualize and quantify pharmacy deserts by county, with a focus on areas comprised of older residents at risk from reduced access to medications.

Then it was time to pick up the phone. From this simple analysis, we set up phone calls with residents of these communities to learn more about their lives. These interviews helped us ground-truth the analysis and verify our results.

  • Jennifer Peters lives in Hooker County, Nebraska. She confirmed the challenging reality of living 72 miles from the nearest pharmacy. “Hooker County residents rely on delivery services for prescriptions, and connect on social media to share meds,” said Peters.
  • Scott Johnston is the County Assessor in Catron County, New Mexico. He shared that a large percentage of county residents are above 65 years-old and are negatively impacted by the two-hour round trip drive to get prescriptions filled.
  • Virginia Smith is the Senior Citizen Program Coordinator in Harding County, New Mexico. She described the stress of the three-hour round trip to fill prescriptions. Many Harding County residents do not have a credit card, which is necessary to get medications by mail.

This analysis highlights areas with higher percentages of senior citizen that exist in pharmacy deserts. Our analysis is limited to the contiguous United States, but the techniques can be easily applied to different study areas and topics.

You can find the full analysis in the Xarray-Spatial github repo.

Conclusion

Spatial analysis can be a powerful tool to answer real world questions. Using open source tools, we were able to quickly identify pharmacy deserts. We then confirmed our results by calling people and listening to their challenges.

This was an exploratory analysis. If we were to go further, our next step would be to do a more localized study of the problem areas we initially identified.

This would include zoomed-in maps and a more thorough site selection analysis for future pharmacy builds. We would enrich the layers with additional characteristics of the population and average distance to pharmacies overlaid with available commercial real estate parcels.

We would also conduct a larger number of interviews to ground-truth the conclusions with a larger degree of confidence.

Have you used open source tools to answer this kind of question? What has been your experience with pharmacy deserts or other similar problems? Do you need help developing custom GIS solutions to answer complex geospatial questions?

Please let us know in the comments, or reach out at contact@makepath.com.

The State of Spatial Data Science: A Perspective from CARTO.

This guest post was written by Javier de la Torre, CSO and Founder of CARTO, a makepath partner company.

Spatial data science (SDS) is a subset of Data Science that focuses on the unique characteristics of spatial data, moving beyond simply looking at where things happen to understand why they happen there. SDS treats location, distance & spatial interactions as core aspects of the data using specialized methods & software to analyze, visualize & apply learnings to spatial use cases.

What’s the difference between Spatial Data Science & GIS?

Geographic information systems (GIS) applies to a wide range of users & use cases, yet is one of those strange anomalies that, despite its value spanning many industries, has remained a niche field – often siloed from other business units. GIS typically refers to varied types of information systems such as websites, apps, or databases that store different types of spatial data.

With new types of users such as Data Scientists, GIS is starting to happen more outside of traditional GIS tools – allowing more sophisticated spatial analyses to take place in connection with new Data Science & Big Data solutions. This shift is allowing Spatial Data Science to emerge as a discipline with greater interactivity with Open Source & Cloud technologies.

GIS is exploding, and our industry has never been bigger than it is now – with a growing number of players not only providing cross-industry platforms, but also niche industry geospatial specialists. This means that GIS is happening where we’re not used to it happening – outside of traditional GIS tools.

What are the consequences of the shift we’re seeing?

  1. GIS specialists will have to adapt and join new communities, such as Data Engineering, Data Science or Development groups. As they upskill, they will need to work across multidisciplinary teams and think of how their work can provide value beyond a single tool, bringing their spatial knowledge to the table.
  2. The days of the full stack GIS expert are over. We cannot expect GIS specialists to also be experts in creating websites, managing servers, spatial modeling, creating data pipelines, making dashboards and telling stories. There are totally different skill sets and entire industries devoted to these disciplines, so GIS experts will have to choose where they want to focus their professional development.
  3. Geospatial products will have to adapt or die. GIS is no longer an island where you start and finish with GIS. It is part of a bigger ecosystem, which means that many of our capabilities will need to be connected or blended with other products, rather than competing with them. It is our responsibility as an industry to ensure that newcomers to GIS, who don’t even know what it is, can do great spatial analysis, using good cartography, and ensuring that they learn how not to lie with maps.
  4. Spatial data infrastructures have to change. Infrastructures will need to play well with key cloud solutions and databases, as well as Open Source technologies that are now widely used in the enterprise.Connectivity will be key, which is why our integration with Google BigQuery is so important to many of our users.
  5. Expand awareness of good GIS among different communities. There should always be a track on spatial analysis at Data Science, Data Engineering, and Developer conferences, as we have at the Spatial Data Science Conference.

How does the adoption of spatial analytics vary across industries?

Retail, OOH, Real Estate and Private Equity have been growing rapidly in terms of SDS capability, which is something our Data Science team here at CARTO discussed in this video:

However, we are also seeing huge growth in verticals like Pharmaceuticals, Healthcare and Logistics who are turning to location data. You can see more specifics on the use cases here.

One thing we see in the market is that it is very competitive to hire talent in the SDS space, so very often we see more lucrative industries (such as management consulting or private equity) attracting some of the top candidates. 

What are the Biggest challenges facing spatial data science?

One big challenge is that there are still only a limited number of Spatial Data Scientists out there, with only 1 in 3 Data Scientists claiming to be experts in spatial analysis (according to our recent report). Raising awareness and ensuring there are enough great programs out there, like the ones provided by the Center for Spatial Data Science (CSDS) at the University of Chicago and The Bartlett Centre for Advanced Spatial Analysis at UCL. This is why we’ve been collaborating with them and others to provide resources like our recent Becoming a Spatial Data Scientist ebook.

The other huge challenge we see is that companies and their Data Science professionals are wasting too much time on everything but analysis:

Only 20% of their time is being spent on analysis, vs the other 80% being spent on discovery, evaluation and ETLing. This means that companies have Data Scientists working on everything but what they hired them for, with hundreds of different departments waiting to work on projects with them that require spatial analysis muscle. 

Future Outlook

Our mission is to spatially enable every Data Scientist, saving them time and making it easier to access high quality location data for their spatial models, whether it’s foot traffic, financial, housing, geosocial or climate data. That’s why we’re focusing on CARTOframes and our Data Observatory – making it easier for them to carry out such analysis within their Juptyer notebooks without needing switch context all the time.

Lots of GIS experts have already realized the changes we’ve discussed in this article. In fact, at the latest edition of the Spatial Data Science Conference, approximately 20% of our attendees came from a GIS background and were looking to connect with the world of Data Science.

We need to make the work of Spatial Data Scientists more productive and enjoyable, making it easier for Developers and Analysts to collaborate with them. GIS is reincarnating, and the current situation we’re living through in society and our economy is only going to accelerate that evolution.

Open Source Spatial Analysis Tools for Python: A Quick Guide

Spatial analysis is a type of GIS analysis that uses math and geometry to understand patterns that happen over space and time, including patterns of human behavior and natural phenomena.

When performing spatial analysis or spatial data science, the right tools can open a world of free and collaborative analytics capabilities without costly software licenses.

We are going to give you a quick tour of some of the open source Python libraries available for geospatial analysis. All of these libraries can be easily integrated with JupyterLab and scale to large datasets. 

Let’s get to it:

Vector Data Libraries

In GIS, the term “vector” describes discrete geometries (points, lines, polygons) with related attribute data (e.g. name, county identifier, population). We can use different geometries to represent the same phenomena depending on our scale and level of measurement. For instance, we can represent the White House as either a point, line, or polygon depending on whether we want to look at a building point-of-interest, building outline, or building footprint.

GeoPandas

GeoPandas is all about making it easy to work with geospatial data in Python. It expands on the built-in pandas data types within a new data structure called the GeoDataFrame. GeoPandas wraps the foundational Python packages Shapely and Fiona, both great packages created by Sean Gillies.

Shapely

Moving down in the stack from GeoPandas, Shapely wraps GEOS and defines the actual geometry objects (points, lines, polygons) and the spatial relationships between them (e.g. adjacency, within, contains). You can use shapely directly without GeoPandas, but in a dataframe-centric world, Shapely is less of a direct tool and more a dependency for higher-level packages.

Fiona

Fiona can read and write many kinds of geospatial vector data and easily integrates with other Python GIS libraries. It relies on OGR / GEOS for reading shapefiles, geopackages, geojson, topojson, KML, GML from both the local filesystem and cloud services like Amazon S3 by wrapping Python’s boto3 library.

Spatialpandas

Spatialpandas supports Pandas and Dask extensions for vector-based spatial and geometric operations. It is a good tool for working with vectorized geometric algorithms using Numba or Python. The library was first used for polygon rasterization with Datashader and since has become its own standalone project.

PySAL

PySAL is a good tool for developing high level applications for spatial regression, spatial econometrics, statistical modeling on spatial networks and spatio-temporal analysis, as well as hot-spots, clusters and outliers detection analysis.  

It consists of four packages of modules that focus on different aspects of spatial analysis:

  1. Lib for computational geometry problems
  2. Explore for exploratory analysis of spatial and spatio-temporal data
  3. Model for confirmatory analysis through estimations of relationships
  4. Viz to create geovisualisations and link to other visualization tool-kits in the Python ecosystem

PySAL came about through a collaboration between Sergio Rey and Luc Anselin and is available through Anaconda

Raster Data Libraries

Rasters are regularly gridded datasets like GeoTIFFs, JPGs, and PNGs. Regular grids are useful in representing continuous phenomena that are not cleanly represented by points, lines, and polygons. For instance, in analyzing weekly rainfall for Seattle, we would first start with weather station rainfall measurements (points), and interpolate values to create a raster (continuous-surface) to represent rainfall over the entire city.

GDAL

GDAL is the Geospatial Data Abstraction Library which contains input, output, and analysis functions for over 200 geospatial data formats. It supports APIs for all popular programming languages and includes a CLI (command line interface) for quick raster processing tasks (resampling, type conversion, etc.).

Rasterio

Rasterio, another creation from the prolific Sean Gilles, is a wrapper around GDAL for use within the Python scientific data stack and integrates well with Xarray and Numpy. It can read, write, organize and store several raster formats like Cloud-optimized GeoTIFFs (COG). A key goal is to provide high-performance and reduced cognitive load for Python developers by using a familiar syntax.

Datashader

Datashader is a general-purpose rasterization pipeline.  We’ve mentioned the difference between vector and raster. What if you want to convert from a vector type to a raster type?  This is where Datashader comes in and allows you to intelligently grid your data. Datashader has tools that make it easy to create graphics pipelines with a little bit of code and is an ideal tool for a principled approach to data science. It can handle large datasets and allows users to generate meaningful visualizations. It allows for a stepwise process that eliminates the need for trial and error in visualizing large datasets. As a side note, the makepath team includes core developers on Datashader.

Xarray-Spatial

Xarray-Spatial implements common raster analysis functions using Numba and provides a codebase that is easy to install and extend. It originated from the Datashader project and includes tools for surface analysis (e.g. slope, curvature, hillshade, viewshed), proximity analysis (e.g. euclidean distance, great circle distance), and zonal / focal analysis (summary statistics by region or neighborhood). It is not dependent on GDAL or GEOS and was created to support core raster analysis functions that GIS developers and analysts need. 

Xarray-Spatial was pioneered by Brendan Collins, one of the founders of makepath.

Other Data Libraries

RTree (spatial indexing)

RTree wraps the C library libspatialindex for building and querying large indexes of rectangles.  Most times rectangles represent the bounding boxes of polygons which makes the RTree library essential for fast point-in-polygon operations.  One downside of this library is that the underlying C/C++ code is not thread-safe.  This can cause problems when trying to access the same index from different threads or processes, but still a very useful tool which Geopandas also wraps.

H3

Uber came up with a hexagonal index grid analysis system for more targeted exploration and visualization of their spatial data. H3 indexes with hexagons which better accounts for the mobility of data points and minimizes errors in quantization (than other shapes, say a square).

Hexagons are also a good choice for quick and easy radii approximations. The hierarchical approach used allows you to truncate the precision/resolution of an index without losing the original indexes.

H3 was written in C, and there is also a Python binding, to “hexagonify your world”.

NetworkX

NetworkX comes into play for analysis of graphs and complex networks. A nice plus is the flexibility to work with a variety of data types from text and images to XML records as well as large volumes of data, up to tens of millions of nodes and edges. 

PyProj

PyProj is useful for map projections, which define how we distort a 3D world converting to a 2D map. PyProj wraps the Proj4 library and performs cartographic transformations between coordinate reference systems like WGS84 (longitude / latitude) and UTM (meters west / meters north).  Map projections can be difficult to understand and PyProj does a great job.

UI/Notebook Integration

folium

folium runs with the principle of “two is better than one” by merging the benefits of Python (strong data analytics capabilities) and JavaScript (mapping powerhouse). So you can play with your data in Python and then play out your resulting visualizations with an interactive Leaflet map (shout out to Vladimir Agafonkin) via folium.  

It supports GeoJSON, TopoJSON, image and video overlays.


There are many tools at our disposal to do geospatial data analysis and visualizations. Although we just highlighted some tools in the Python stack, geospatial analysis is not limited to Python. 

Python’s motto is “Programming for Everybody” and this certainly holds true for the geo community. Python is the first language for many aspiring data scientists and we hope this list will help you on your geo-journey.

We would love to hear from you! 

Do you have any questions, suggestions, or Python/non-Python stacks you love doing your spatial analysis with? 

Share your ideas in the comments.

Do you have questions about how to implement these free tools?
Comment below or you can reach us at contact@makepath.com.

COVID-19 Public-Private Partnership: The Center for Environmental Medicine and Informatics (CEMI) at SUNY College of Environmental Science and Forestry (SUNY ESF) and makepath Working with Safegraph to Understand Virus Mortality

We are excited to announce the partnership between makepath, Safegraph and the The Center for Environmental Medicine and Informatics (CEMI) at SUNY College of Environmental Science and Forestry (SUNY ESF) to study the impact of pollution on COVID-19 mortality.

We are starting with a study published by the Harvard T.H. Chan School of Public Health that linked air pollution to higher COVID-19 death rates.

We are building on this study by taking the data from R and converting it into Python. 

The teams are then collaborating to add social distancing data from Safegraph to act as an additional control to augment the existing controls of smoking and body mass index (BMI). The goal is to deepen the understanding of how pollution affects COVID-19 outcomes.

We will be also studying additional factors such as race/ethnicity, income and pre-existing chronic health conditions. 

The augmented analysis will help policymakers make better informed decisions as to how to allocate resources to help prevent deaths from the virus.

The team from CEMI-ESF is led by Professor Mary Collins. Mary and her team are experts on using big data to understand how societies interact with the environment. She is particularly interested in equity and justice issues in the context of human health.

We are grateful for the deep expertise in statistical analysis that Mary and her team bring to the project.

The team at makepath will lend its expertise in data visualization and analysis scaling to the project.

We are also expanding our open source spatial analysis library Xarray-Spatial to include analysis tools pertinent to health demographics including hotspot analysis and common choropleth map classifications (quantile, equal-interval, natural breaks).

We look forward to sharing the results of this collaboration as soon as they are available.

For more information about this and other makepath projects, please email us at contact@makepath.com

Nationwide Parcel Data: From Cold, Metal Chains, to the Spatial Foundation of American Society, to your Database.

Note about the author of this guest post: Jerry Paffendorf is co-founder and CEO of Loveland Technologies, makers of landgrid.com. Landgrid and makepath partner together on special projects.

“In the mid-19th century, when the cold tongue of land that is the Michigan peninsula was first being sliced up for development, the surveyors began to discover problems with their measurements, particularly during the winter. The lengths of metal chain they doggedly carried and laid out like giant rulers across the forests and swamps would shrink when the temperature dropped below zero.

“The resulting inconsistencies would only add up to a few inches a day, but over the vast distances of midwestern America the shrinking chains threatened to cause future disputes between landowners. Until a conscientious surveyor called William Burt came up with a solution: every frosty morning, he built a fire and warmed up his chain until it expanded back to exactly its original length.

“Such diligence, respect for figures, and slightly bloody-minded defiance of the elements is a very American combination. So to try to understand the country by describing how it was first surveyed and divided up, as this book does, is likely to be a fruitful enterprise.”

From a book review of Andro Linklater’s book, Measuring America, published by The Guardian

My company specializes in providing nationwide parcel data: the legal boundaries of properties along with addresses and information like ownership, land use, occupancy, buildings, and other data points that can be attached to parcels. It’s been a consuming pursuit as property boundaries underlie everything and form a natural ice cube tray for other data.

Parcels look like this, seen here with building footprints overlaid:

We are based in Detroit, Michigan and originally got into parcel data to help address challenges in the city. At its peak population in 1950, Detroit had about 2 million people and was America’s fourth largest city. Today it has about one-third of that, making its ratio of people to properties much different than the mid-20th century, with all of the attendant stresses to the tax base, maintenance, services, and occupancy you can imagine.

During Detroit’s bankruptcy in 2013 we were hired to assess the current land use, occupancy, and conditions of every single parcel of land in the city for a project called Motor City Mapping. The interactive map, photos, and dataset are archived at motorcitymapping.org.

For that project, 200 Detroiters used our software and mobile app to photograph and describe each property. In the process they identified more than 50,000 vacant buildings and tens of thousands of occupied homes that were at imminent risk of tax foreclosure, among other pressing challenges and opportunities in the landscape.

The data was combined with other datasets and kicked off a wave of innovative data and mapping projects in the city, and provided some much needed insight into the landgrid.

We wanted to be able to do something like that anywhere in the country, which meant we needed nationwide parcel data. Not having any other way to attain it, we set about collecting it from every single county ourselves.

That really sent us down the rabbit hole and got me reading about the history of how and why these parcels came to be in the first place. If you’re looking to read some fascinating history that you may not know much about — I certainly didn’t — do yourself a favor and google the US Public Land Survey or pick up Andro Linklater’s book, Measuring America: How the United States Was Shaped By the Greatest Land Sale in History.

(image credit)

Long story short, at Thomas Jefferson’s urging, and to create a spatial framework for a new nation of citizen farmers, starting in 1790 most of America outside the south and the original colonies was measured out and subdivided into square mile sections by people dragging metal chains through the woods, across rivers, over mountains, you name it. Every six by six squares was called a township. Townships snapped into counties, and counties snapped into states. The land was typically auctioned and then further subdivided over the years, decades, and centuries into the residential, commercial, industrial, agricultural, recreational, and wild parcels we know today.

You can stare at maps and see the straight lines of many states and counties, but it’s easy to overlook that they represent a nested fractal leading down parcels, which are the atomic unit of owned and managed space in society. Within that landgrid are so many accidents and arbitrary happenings that it makes you wonder how we might one day redraw it or return parts of it to nature. (PS if you like pictures like the one from Wikipedia above, check out the Instagram account, thejeffersongrid, which focuses on big square parcels.)

Our dataset currently consists of 144+ million parcels covering 95% of US residents. You can see a coverage map here, and you can see details about the data by clicking through to any county. Sometimes when the work is hard I think about that person warming their chain in a fire before dragging it through the woods another day, and it feels a little less hard to wrangle digital files with a LaCroix next to my keyboard.

We make the parcel data available for other people to use in their own research, apps, projects, and databases. Every day I wake up with a kaleidoscope of customers and partners and curious people in my inbox who span real estate, energy, insurance, agriculture, forestry, marketing, transportation, outdoor recreation, government, planning, and other industries that touch property, land, housing, and spatial analysis.

Sometimes people want to use the data for geocoding other datasets to a map. Sometimes they need to know who owns things. Sometimes they need to tell open land apart from land with buildings, or they need to identify occupied or vacant properties. Sometimes they need to do door-to-door outreach. Sometimes they use the data for business and sometimes for the joy of discovery.

Moving into the future, we’re really excited about the opportunities for combining parcel data with Machine Learning and aerial imagery. With the parcel boundaries as the picture frames, there are many new data fields and insights that will come from training software to identify the features within a parcel and turn that into structured data to give even greater insight into the grid and how we inhabit it. 

All of this is what makes the work we do so exciting, and it’s why we value partnerships with data scientists like makepath who can take a massive, fundamental dataset like this and make new knowledge from it. 

If we can assist you with parcel data, or if you just want to rap about how crazy the history is and what the future of parcels could look like, please reach out to me at jerry@landgrid.com. And please be safe in these unprecedented times!

The Dask Developer Workshop

Last week makepath attended the first-annual Dask Developer Workshop at Capital One Labs in Arlington, VA. This unique workshop brought together 50+ Dask/Python experts to review the state of Dask.

TL;DR:

  • This was the first annual Dask Developer Workshop
  • For those unfamiliar with the project, Dask provides advanced parallelism for scaling Python to multithread, multicore, and multi-machine (cluster) scenarios.  
  • Dask is analogous to Apache Spark, but written in Python instead of Scala/Java.

Dask unites engineers and researchers of various disciplines in the pursuit of scalable analytics.  It was inspiring to see climate scientists working alongside quantitative finance experts, seismologists brainstorming with supply chain management managers, and astronomers helping civil engineers crunch data.  The multidisciplinary nature of the Dask community reinforces Python’s most important motto, “Programming for Everybody.”

New and Popular libraries discussed included:

One of the hot topics of discussion was the desire for heterogeneous Dask workers.  At the moment, Dask only supports a single worker type. By worker type, we mean the profile of a worker machine (number of CPUs/GPUs, RAM, threads, etc.). There are many use cases which could benefit from a cluster composed of various types of workers.  This is one area which I’m sure we’ll see development on in the coming year.

Jim Crist-Harif gave a great talk on dask-gateway, a project anybody using Dask should check out.  Dask-gateway provides Dask Clusters as a Service. 

Some important features of Dask-gateway include:

  • Deployment scripts versioned as part of your code.
  • No need for extra infrastructure, plays well with many deployment backends (YARN, Kubernetes, Slurm)
  • Just python libraries
  • Extensible design. 
  • Provides a REST api for managing clusters
  • SSL / TLS support for scheduler traffic
  • User resource limits
  • Automatic shutdown of idle clusters
  • Strong interoperability with JupyterHub

The Dask Developer workshop also provided an opportunity for members of the Pangeo Group to meet in person. Makepath is a proud contributor to the pangeo project. Pangeo is a consortium of engineers and climate scientists focused on using Open Source tools to better understand threats posed by climate change and the challenges of climate change adaptation.

Captial One Labs provided an amazing workspace for the workshop.  Anaconda and Nvidia sponsored our happy hours which provided some downtime for community members to interact and bond. 

Special thanks to Matthew Rocklin, creator of Dask, founder of Coiled Computing, breaker of chains, father of Pythons, for organizing and leading such a special event.

Want to learn more about Dask? The Dask Examples github repo is a great place to start!

Introducing 🌍 Xarray-Spatial: Raster-Based Spatial Analysis in Python

At makepath, we are committed to the open source community. We are proud to announce the release of Xarray-Spatial, a spatial analysis Python library pioneered by one of our founders, Brendan Collins.

TL;DR? Xarray-Spatial is:

📍 A Fast, Accurate Python library for Raster Operations

⚡️ Extensible with Numba

⏩ Scalable with Dask

🎊 Free of GDAL / GEOS Dependencies

🌍 Designed for General-Purpose Spatial Processing, and is Geared Towards GIS Professionals

Xarray-Spatial implements common raster analysis functions using Numba and provides an easy-to-install, easy-to-extend codebase for raster analysis.

Why this matters

We are building tools to better understand our world and deal with our ever-increasing challenges.

Xarray-Spatial provides free open source spatial analytics for GIS applications.

We think that the best tools to understand our world are open to everyone, and are refined by the collective wisdom of people who care.

We invite you to become a contributor to this project!

Origins

Xarray-Spatial grew out of the Datashader project, which provides fast rasterization of vector data (points, lines, polygons, meshes, and rasters) for use with Xarray-Spatial.

Xarray-Spatial does not depend on GDAL / GEOS, which makes it fully extensible in Python but does limit the breadth of operations that can be covered. Xarray-Spatial is meant to include the core raster-analysis functions needed for GIS developers / analysts, implemented independently of the non-Python geostack.

Raster-huh?

In the GIS world, rasters are used for representing continuous phenomena (e.g. elevation, rainfall, distance), either directly as numerical values, or as RGB images created for humans to view. Rasters typically have two spatial dimensions, but may have any number of other dimensions (time, type of measurement, etc.)

Supported Spatial Functions

Xarray-Spatial and GDAL

Within the Python ecosystem, many geospatial libraries interface with the GDAL C++ library for raster and vector input, output, and analysis (e.g. rasterio, rasterstats, geopandas). GDAL is robust, performant, and has decades of great work behind it. For years, off-loading expensive computations to the C/C++ level in this way has been a key performance strategy for Python libraries (obviously…Python itself is implemented in C!).

However, wrapping GDAL has a few drawbacks for Python developers and data scientists:

  • GDAL can be a pain to build / install.
  • GDAL is hard for Python developers/analysts to extend, because it requires understanding multiple languages.
  • GDAL’s data structures are defined at the C/C++ level, which constrains how they can be accessed from Python.

With the introduction of projects like Numba, Python gained new ways to provide high-performance code directly in Python, without depending on or being constrained by separate C/C++ extensions. Xarray-Spatial implements algorithms using Numba and Dask, making all of its source code available as pure Python without any “black box” barriers that obscure what is going on and prevent full optimization. Projects can make use of the functionality provided by Xarray-Spatial where available, while still using GDAL where required for other tasks.

Where does this library fit into the ecosystem?

Interested in using xarray-spatial?

You can find the library on github here.

Acknowledgements

Thank you to Jim Bednar at Anaconda for reading a draft of this and contributing to the explanation of rasters. Thank you the whole Datadshader team. Some of the work that led to this library was done through an SBIR NASA Grant. This would not have been possible without them. Thank you to all present and future contributors of the library. We look forward to continuing to collaborate!

Questions? Thoughts? Ideas? Let us know in the comments, or email us at contact@makepath.com.