Scaling Geospatial Queries: Dragonfly’s Inverted GEO Index

Introduction

Previously, we introduced Dragonfly Search’s inverted indexes for quickly finding documents by keyword, vector, or range fields. We also explored geospatial indexes and how GEO commands can efficiently query geographic data. In this post, we examine how Dragonfly Search is extended to natively support the indexing and querying of geographical data.

What is an R-Tree?

When dealing with geospatial or multi-dimensional data, efficient indexing is crucial for performing fast queries. Traditional data structures like arrays, hash maps, and binary trees are typically not well-suited for geospatial data, particularly when you need to perform range queries (e.g., finding all points within a specific radius) or nearest-neighbor searches. These structures lack the necessary support for efficiently pruning large portions of search space.

This is where spatial data structures like the R-Tree come into play. Designed specifically for multi-dimensional data, R-Trees organize data in a hierarchical manner, optimizing queries such as range searches and k-nearest-neighbor searches in high-dimensional spaces.

B+ Tree is a widely used data structure used for indexing. Both B+ Trees and R-Trees are balanced tree structures, but they are optimized for different types of data and queries.

B+ Tree: internal nodes store individual keys and the leaf nodes store values. It is great for ordered data but doesn’t handle multi-dimensional data well.
R-Tree: Instead of individual values, nodes store bounding rectangles (or other shapes) that enclose sets of data. These rectangles represent the spatial extent of the child objects, allowing for efficient spatial queries.

Example of an R-tree for 2D Rectangles | Wikipedia

In geospatial queries, this ability to prune search space is crucial. For example, when you query for all points within a certain distance from a given location, an R-Tree quickly eliminates large areas of data that fall outside the query radius, resulting in faster response times.

R-Tree is a versatile data structure, but its performance can vary significantly depending on the splitting strategy used to build the tree. The splitting strategy determines how data is partitioned across the nodes of the tree, which impacts both query speed and insertion/deletion performance.

Here are two of the most commonly used R-Tree splitting strategies:

Linear Split is one of the simplest and most widely used methods. When nodes in the tree become full, the linear split algorithm divides the set of data points into two groups along the longest dimension, minimizing the overlap between bounding boxes as much as possible.
Quadratic Split is more complex than the linear approach. It tries to minimize the total area of the bounding boxes by choosing splits that minimize the area overlap between them. The algorithm considers all pairs of objects and selects the one that minimizes overlap.

For more details about the R-Tree data structure, we highly recommend you take a look at this comprehensive and easy-to-follow blog post by Bartosz Sypytkowski, R-Tree: Algorithm for Efficient Indexing of Spatial Data.

Implementing Geographic Index with Boost Geometry

For our geospatial search implementation, we decided to use the Boost Geometry library. This choice was based on its maturity and optimization for geospatial queries. We represent our data as 2D points using geographic traits (latitude, longitude) and pair each point with a unique document ID. Our R-Tree is configured to use a linear split strategy with a maximum of 16 elements per node.

Here’s how we define our data structures in C++:

With this setup, basic operations like insertion and deletion are simple—just pass the relevant index entry to the API. However, implementing an efficient radius-based search was more challenging.

While Boost Geometry provides efficient APIs for spatial indexing, it doesn’t natively support radius search queries out of the box. To implement this functionality, we devised a custom approach:

Constructing a Temporary Polygon: We first do so around the query point by calculating 4 points at a distance equal to the radius in each of the cardinal directions.
Bounding Box for Search Space: We then create a minimum bounding box (MBB) around this temporary polygon, which defines our search area.
Haversine Distance Calculation: For each point within the minimum bounding box, we calculate the Haversine distance (which accounts for the curvature of the Earth) and check if the point lies within the specified radius.

Visualization of a Geospatial Search Query

In the illustration above, the query finds all points in a radius around Mountain View. The temporary polygon around the point has red edges, while the minimum bounding box has gray edges.

Geospatial Search in Action

Now that we’ve implemented geospatial indexing and radius-based search, let’s see it in action in Dragonfly. We start by adding some geospatial data to Dragonfly using the hash data type. Then, we create a search index on these records with TEXT and GEO types for the desired fields, as shown below:

dragonfly$> HSET cities:1 name "Mountain View" location "-122.08, 37.386"
#=> (integer) 2

dragonfly$> HSET cities:2 name "Palo Alto" location "-122.143, 37.444"
#=> (integer) 2

dragonfly$> HSET cities:3 name "San Jose" location "-121.886, 37.338"
#=> (integer) 2

dragonfly$> HSET cities:4 name "San Francisco" location "-122.419, 37.774"
#=> (integer) 2

dragonfly$> FT.CREATE cities_idx ON HASH PREFIX 1 cities: SCHEMA name TEXT location GEO
#=> OK

By convention, geographical data points are formatted as "longitude, latitude" strings. With everything set up, we can now query for all locations within a 20-mile radius of Mountain View.

dragonfly$> FT.SEARCH cities_idx "@location:[-122.08 37.386 20 mi]"
#=> 1) (integer) 3
#=> 2) "cities:1"
#=> 3) 1) "name"
#=>    2) "Mountain View"
#=>    3) "location"
#=>    4) "-122.08, 37.386"
#=> 4) "cities:2"
#=> 5) 1) "name"
#=>    2) "Palo Alto"
#=>    3) "location"
#=>    4) "-122.143, 37.444"
#=> 6) "cities:3"
#=> 7) 1) "name"
#=>    2) "San Jose"
#=>    3) "location"
#=>    4) "-121.886, 37.338"

This returns results that include locations such as Palo Alto and San Jose. You can also combine other search fields in a query, for example:

dragonfly$> FT.SEARCH cities_idx "@name:Francisco @location:[-122.08 37.386 50 mi]"
#=> 1) (integer) 1
#=> 2) "cities:4"
#=> 3) 1) "name"
#=>    2) "San Francisco"
#=>    3) "location"
#=>    4) "-122.419, 37.774"

This query instead only returns San Francisco, since it matches the name and is within the specified radius.

Performance Matters, As Always

We benchmarked Dragonfly’s geographical index against RediSearch^[1] using the geo-bench tool^[2]. The tests ran with datasets of 1M, 5M, and 10M documents, using various thread counts. The test configuration and results are shown below.

Client & Server Instances

Two AWS c8g.2xlarge instances, one for geo_bench and one for Redis/Dragonfly.
Redis and Dragonfly were tested separately.

Server Configurations

Using the configurations below, we tested with NUM_OF_THREADS with 4 and 8.

# Redis
$> ./redis-server --loadmodule /PATH/TO/RediSearch/bin/linux-aarch64-release/search-community/redisearch.so \
        --save "" --appendonly no \
        --io-threads NUM_OF_THREADS \
        --search-workers NUM_OF_THREADS

# Dragonfly
$> ./dragonfly --logtostderr --dbfilename= \
               --proactor_threads=NUM_OF_THREADS

Dragonfly vs. Redis | Geo Search Query Throughput on AWS c8g.2xlarge

Dragonfly vs. Redis | Geo Search Query Throughput with NUM_OF_THREADS=8

The detailed benchmark results^[3] can be found in the references section below, and they highlight several key advantages of Dragonfly for geospatial search:

Performance: Dragonfly consistently achieves higher requests per second than Redis across all tested dataset sizes (1M, 5M, and 10M documents). For example, with NUM_OF_THREADS=8 for both engines, Dragonfly processes 3-4x more requests per second.
Scalability: Dragonfly scales efficiently with an increasing number of threads, maintaining robust performance under heavier loads. In contrast, Redis exhibits diminishing returns past a certain concurrency point.
Geospatial Search: Dragonfly utilizes R-Tree indexing for fast and accurate radius-based queries. A planned future improvement will add the ability to return results sorted by distance.

Final Thoughts

While RediSearch remains a popular choice for geospatial data indexing, Dragonfly emerges as a more performant and scalable alternative in the context of high-volume geospatial search workloads. It leverages techniques like R-Trees and multi-threaded processing to offer more performant and more efficient results. As geospatial data volumes continue to increase, the speed and scalability advantages of Dragonfly make it the preferred solution for mission-critical applications.

References

[1] RediSearch is a Redis module that provides full-text search, secondary indexing, and querying. It is currently bundled with the Redis server by default.

[2] Commands for benchmarking using geo-bench.

Loading documents, where N is the number of documents that are going to be indexed:

$> ./geo-bench load -i documents.json --db redisearch-hash -n N

Running query on 10,000 points:

$> ./geo-bench query -i documents.json --db redisearch-hash -n 10000

[3] Detailed benchmark results.

Results with NUM_OF_THREADS=4

LOAD	1M	5M	10M
Redis (req/s)	76,922	81,966	81,300
Dragonfly (req/s)	76,922	83,332	83,333

QUERY	1M	5M	10M
Redis (req/s)	909	500	227
Dragonfly (req/s)	2,000	1111	500
Redis - LATENCY AVG	51.075ms	96.681ms	218.525ms
Dragonfly - LATENCY AVG	20.632ms	43.627ms	95.741ms
Redis - LATENCY P99	123.135ms	236.031ms	658.943ms
Dragonfly - LATENCY P99	97.599ms	202.751ms	557.055ms

Results with NUM_OF_THREADS=8

LOAD	1M	5M	10M
Redis (req/s)	76,919	80,644	81,301
Dragonfly (req/s)	76,917	83,332	82,644

QUERY	1M	5M	10M
Redis (req/s)	1000	526	238
Dragonfly (req/s)	3,332	2,000	909
Redis - LATENCY AVG	46.249ms	89.351ms	207.980ms
Dragonfly - LATENCY AVG	10.592ms	22.815ms	50.932ms
Redis - LATENCY P99	121.151ms	220.159ms	646.143ms
Dragonfly - LATENCY P99	45.215ms	104.255ms	280.063ms