The census-parquet tool downloads and processes the raw data from the US Census Bureau to create a set of files in parquet format. The resulting parquet files include both the redistricting demographic data at the census block level and the census boundary shapefiles for a set of key boundaries.
With this tool, you can create and access your own versions of the Census 2020 redistricting data with population stats in a format that is optimized for high data quantities and works well with big data systems (like Dask or Spark).
Four Reasons to Use Parquet File Format
1. Improved load time of files as the data is stored in a columnar format
2. Improved compression (like Snappy) ability and maximized query performance
3. It is a binary format which means data arrays are fixed bytes
4. Each file is partitionable so that data analysis can run on multiple computers
How Census-Parquet Works
The tool is freely available for download from PyPi or from Github. It accomplishes two main things:
1. It processes a series of 20 different census boundaries ranging from a national-level shapefile down to the census block group level. It then outputs this data in parquet file format, with optimized data types set for each column.
2. It adds population totals to census block shapefiles and partitions the census block files by state and county. It then outputs the data in parquet file format, with optimized data types for each column.
Behind the scenes, the one-command census-parquet tool runs five specialized scripts:
Three shell scripts that download the raw data
Two Python scripts that process the data
With just one command, you will have access to your own, local version of the formatted census data— ready for your own analysis.
Now it’s Your Turn
Are you interested in processing your own Census data?
Do you have any suggestions, questions or comments about the tool makepath created?