Recently for a project, I was able to play around with some of the cool geopy packages out there and was thinking instead of just shelving the code after the project, that I’d share out some geo-snippets. Well…this post is the start of that.
So for the first share out, I want to run through performing spatial queries in python and the benefits of using a spatial index and I wanted to do that via two examples.
I guess to start, we should setup a virtual environment and install the required geo-packages; shapely, fiona and rtree.
virtualenv env
. env/bin/activate
pip install shapely fiona rtree
If you’re having issues installing any of these packages here’s links to their respective install docs.
- shapely: https://github.com/Toblerity/Shapely#installing-shapely
- fiona: https://github.com/Toblerity/Fiona#installation
- rtree: http://toblerity.org/rtree/install.html
Alright now that everything is setup, let’s run through the examples to answer the question of
“how many addresses are within each garbage zone”.
The first example, uses a “brute force” method for finding our answer and the second we’ll be a little smarter about it and take advantage of a spatial index. But before we start, let me quickly introduce our two volunteer datasets…
AddressPoint Shapefile source:
* Geometry Type: Point
* Number of features: 34,084
* Spatial Reference System: [UTM 10][5]
GIS_GarbageZone source:
* Geometry Type: Polygon
* Number of features: 8
* Spatial Reference System: [UTM 10][5]
Methods
1: Brute Force Spatial Query
Honestly this example is only here to show how important it is to use a spatial index when performing spatial queries. Simply put, this method will iterate over every garbage zone and every address point to see which address points are contained in each garbage zone polygon.
def query_no_index(layer1, layer2, group_field):
result = defaultdict(list)
with fiona.open(layer2) as datasrc2:
with fiona.open(layer1) as datasrc1:
for ft1 in datasrc1:
if ft1['geometry'] is None:
continue
shp1 = shape(ft1['geometry'])
for ft2 in datasrc2:
shp2 = shape(ft2['geometry'])
if shp1.contains(shp2):
result[ft1['properties'][group_field]].append(int(ft2['id']))
return result
2: Index Based Spatial Query
Similar to method 1, but we’ll create and use a RTree index. This index allows us to run an initial “simplified and fast” check to determine, in our example, which addresses are near a particular garbage zone. Since the result of the initial check is based on bounding boxes of the address and garbage zone geometries, we’ll still need to do the contains
check on the actual geometries but now the number of checks that we potentially need run is vastly less.
# Create in-memory or file-based r-tree index
def generate_index(records, index_path=None):
prop = rtree.index.Property()
if index_path is not None:
prop.storage = rtree.index.RT_Disk
prop.overwrite = index_path
sp_index = rtree.index.Index(index_path, properties=prop)
for n, ft in enumerate(records):
if ft['geometry'] is not None:
sp_index.insert(n, shape(ft['geometry']).bounds)
return sp_index
def query_with_index(layer1, layer2, group_field):
result = defaultdict(list)
with fiona.open(layer2) as datasrc2:
fname, _ = os.path.splitext(layer2)
# Load index from file
if os.path.exists(fname + '.dat') and os.path.exists(fname + '.idx'):
layer2_index = rtree.index.Index(fname)
# Need to create a new index.
else:
layer2_index = generate_index(datasrc2, fname)
with fiona.open(layer1) as datasrc1:
for ft1 in datasrc1:
if ft1['geometry'] is None:
continue
shp1 = shape(ft1['geometry'])
# --> Do an initial bounds query with the spatial index <--
# Returns back a generator of addresses ids whose bounds intersects
# the garbage zone bounds
for fid in layer2_index.intersection(shp1.bounds):
# Now check if the address point is contained by
# the garbage zone
shp2 = shape(datasrc2[fid]['geometry'])
if shp1.contains(shp2):
result[ft1['properties'][group_field]].append(fid)
return result
Tests
s1 = 'GIS_GarbageZone.shp'
s2 = 'AddressPoint.shp'
def print_results(r):
for k in sorted(r.iterkeys()):
print '%s: %s' % (k, len(r[k]))
# No index used
print('# 1. No index')
print_results(query_no_index(s1, s2, 'NAME'))
# Index not generated
print('# 2a. Index: Generate')
print_results(query_with_index(s1, s2, 'NAME'))
# Index loaded from file
print('# 2b. Index: From file')
print_results(query_with_index(s1, s2, 'NAME'))
Output
# 1. No index
ZONE 1: 5968
ZONE 2: 5345
ZONE 3: 7537
ZONE 4: 8615
ZONE 5: 6616
# 2a. Index: Generate
ZONE 1: 5968
ZONE 2: 5345
ZONE 3: 7537
ZONE 4: 8615
ZONE 5: 6616
# 2b. Index: From file
ZONE 1: 5968
ZONE 2: 5345
ZONE 3: 7537
ZONE 4: 8615
ZONE 5: 6616
Results
Well the outputs are consistent which is a good sign. As far as performance goes…
- Method 1: 40.26 seconds
- Method 2a: 30.13 seconds (~25% faster than Method 1)
- Method 2b: 24.38 seconds (~39% faster than Method 1 and 19% faster than Method 2a)
Comparing number of address checks using no index (brute force) v.s. with an index
Garbage Zone | No Index | With Index | ↓ in Checks |
---|---|---|---|
Zone 1 | 34,084 | 7,494 | 78% |
Zone 2 | 34,084 | 6,811 | 80% |
Zone 3 | 34,084 | 14,461 | 58% |
Zone 4 | 34,084 | 13,418 | 61% |
Zone 5 | 34,084 | 7,620 | 78% |
Additional
For clarity the example above had been simplified, the code below is the generic version for doing spatial queries and includes options for contains, crosses, disjoint, equals, intersects, touches and within.
file: spatialquery.py