Geospatial & Location Services • Quadtree & Spatial IndexingHard⏱️ ~3 min
Distributed Sharding and Scaling Quadtrees to Billions of Points
Sharding a quadtree across multiple machines requires partitioning by a coarse cell prefix. For example, using level 6 in an S2 like system yields 6 times 4 to the power of 6 equals 24,576 distinct prefixes. Group multiple prefixes per shard to balance load, aiming for roughly equal point counts per shard. A query fans out only to shards owning prefixes that intersect the query covering. For a 2 km radius query covered by 5 cells, you might touch 2 to 3 shards instead of all shards, bounding fan out and keeping latency predictable.
Keep per shard indexes entirely in memory for sub 10 millisecond index scan latencies. Persist updates to a write ahead log (WAL) or commit log for durability, then apply them to the in memory index asynchronously. Use a log structured store like RocksDB or a distributed log like Kafka for the durable layer. On restart, replay the log to rebuild the in memory index. This architecture is common in systems like Apache Pinot or custom geosearch services at Uber and Airbnb. Monitor hot prefixes (high query or update rates) and reassign or split them adaptively to avoid shard hotspots.
Consistency for moving objects requires careful design. Use optimistic versioning on point records with compare and swap (CAS) semantics to avoid write conflicts when multiple updates arrive concurrently. Tolerate slightly stale index entries (seconds of lag) if business logic allows; for example, showing a driver 2 seconds behind actual position is acceptable in many dispatch scenarios. For stricter freshness, maintain a write through cache of recent updates keyed by entity ID and merge it during query refinement. This keeps 99th percentile staleness under one second without sacrificing throughput.
Monitoring and service level objectives (SLOs) should track key metrics. Average cells per cover indicates query efficiency; spikes suggest boundary issues or covering algorithm problems. Candidates per query after index scan measures false positive rate; high values mean leaf capacity is too large or cells too coarse. Leaf size distribution and index height reveal skew; monitor split and merge rates to detect thrashing. Track replication lag in the durability pipeline so geosearch does not return stale results. Set alerting thresholds: cells per query above 20, candidates per query above 5,000, or replication lag above 5 seconds warrant investigation.
💡 Key Takeaways
•Partition by coarse cell prefix (e.g., level 6 giving 24,576 prefixes); group prefixes per shard for load balance; queries fan out only to shards with intersecting prefixes, bounding fan out to 2 to 3 shards for typical radius queries
•Keep per shard indexes in memory for sub 10 ms scans; persist updates to write ahead log or distributed commit log, replay on restart; common in Pinot and custom geosearch services
•Use optimistic versioning with compare and swap for moving object updates to avoid conflicts; tolerate seconds of staleness if acceptable, or maintain write through cache keyed by entity ID for sub 1 second freshness
•Monitor cells per cover, candidates per query, leaf size distribution, split and merge rates, and replication lag; alert on cells per query above 20, candidates above 5,000, or lag above 5 seconds
•Hot prefix reassignment and adaptive splitting prevent shard hotspots; monitor query and update rates per prefix and rebalance dynamically to maintain even load distribution across shards
📌 Examples
Uber dispatch system shards by H3 cell prefix with millions of moving drivers. Queries touch 2 to 4 shards on average; in memory indexes per shard keep 99th percentile query latency under 50 ms including fan out and merge.
Airbnb geosearch backend uses geohash prefix sharding with level 5 or 6 prefixes. Each shard holds tens of millions of listings in memory; write ahead log in Kafka provides durability with seconds of replication lag tolerance.