Geospatial & Location Services • GeohashingHard⏱️ ~3 min
Implementation Patterns and Performance Tuning
The canonical indexing pattern stores each point as a 64 bit geohash integer as a secondary ordered key alongside the object ID and optional timestamp. For sharding, use a short 3 to 5 character prefix as the partition key to distribute writes (32 cubed equals 32,768 to 32 to the power of 5 equals 33.6 million possible prefixes), with the full geohash plus object ID as the sort key to preserve scan locality within each partition. Monitor per partition item counts continuously; when a partition exceeds capacity thresholds like 10 million items or 10 GB, split by increasing prefix length from 4 to 5 characters, redistributing load across 32 child partitions.
Query optimization starts with precision selection: choose the smallest precision where cell diagonal is less than or equal to query radius. For dynamic precision, precompute and store multiple geohash lengths (5, 6, 7, 8 characters) per point to enable query time selection without recomputation. Execute 9 neighbor range scans in parallel when possible; in systems like DynamoDB or Cassandra, this translates to 9 concurrent partition queries that complete in one network round trip. Merge results in memory, deduplicate by object ID using a hash set, then apply exact haversine distance filtering to remove false positives. Expect 10 to 50 percent candidate overfetch at good precision; if measured overfetch exceeds 2 times, adjust precision finer.
For aggregations and heatmaps, emit metrics keyed by the chosen geohash prefix at appropriate precision per zoom level: precision 5 for country/region view (4.9 kilometer cells), precision 6 for city districts (1.2 kilometer cells), precision 7 for neighborhoods (153 meter cells), precision 8 for street level (38 meter cells). With 100 million points, precision 6 yields at most 1.07 million possible buckets but only 10,000 to 100,000 are actually populated, easily fitting aggregation state in memory. Use sparse datastructures like hash maps or tries to avoid allocating empty cells. Cache aggregated tiles with geographic keys to serve repeated map requests without recomputation.
💡 Key Takeaways
•Store 64 bit geohash integer for compactness (8 bytes vs 16 for two doubles). Use short prefix (3 to 5 characters) as partition key for write distribution, full geohash plus ID as sort key for scan locality. 32 to the power of 4 equals 1.05 million possible 4 character prefixes mapping to hundreds or thousands of physical partitions
•Query pattern executes 9 parallel range scans (center plus 8 neighbors) completing in one network round trip. Merge and deduplicate in memory using hash set, post filter with exact distance. Typical flow: 4 to 10 milliseconds p50, 15 to 25 milliseconds p99 with hot set in memory
•Precompute multiple precisions (5, 6, 7, 8 characters) per point to enable dynamic precision selection at query time without recomputation. Store as separate index entries or packed bit field. Increases storage by 20 to 30 percent but eliminates precision mismatch latency spikes
•Aggregation pattern uses geohash buckets at zoom appropriate precision: precision 5 for region level (4.9 kilometer), 6 for city districts (1.2 kilometer), 7 for neighborhoods (153 meter), 8 for streets (38 meter). With 100 million points, precision 6 yields 10,000 to 100,000 populated buckets fitting in memory
•Monitor continuously: overfetch ratio (target under 2 times), per partition query rate and size, p95 and p99 latency. Auto split partitions when item count exceeds 10 million or query rate exceeds 5 times median. Adjust precision when overfetch consistently exceeds 2 times target
📌 Examples
Key value production setup: 500 million locations sharded by 4 character geohash prefix across 512 partitions (consistent hashing maps prefixes to nodes). Average 1 million items per partition, 10 GB per node. Proximity query scans 9 ranges in parallel, fetches 600 to 1,000 candidates in 5 to 8 milliseconds, filters to 400 results. Achieves 50,000 queries per second cluster wide with p99 under 20 milliseconds.
Aggregation real numbers: Heatmap of 200 million food deliveries uses precision 6 for city view. Emit geohash bucket per delivery, aggregate counts in memory hash map. Only 80,000 buckets populated globally (out of 1.07 million possible), consuming 5 MB for counts plus 20 MB for geohash keys. Single core aggregation completes in 8 to 12 seconds; cache tiles for 5 minutes to serve map requests.
Multi precision index: Rideshare driver locations stored with 5, 6, 7, 8 character geohashes as four separate index entries per driver. Storage increases from 8 bytes to 32 bytes per location (4 times overhead), but query optimizer selects optimal precision dynamically. At 1 kilometer radius, uses precision 7; at 5 kilometers, uses precision 6. Eliminates precision mismatch overfetch, reducing average candidate count from 1,200 to 600 and cutting p99 latency from 30 milliseconds to 12 milliseconds.