Distribution, Sharding, and Caching Strategies

Sharding by Cell
Use cell ID as shard key. Points in the same cell land on the same shard. A query for a cell goes to one shard. This is ideal for single-cell queries. But queries spanning multiple cells may hit multiple shards.
Cell-based sharding preserves locality: nearby points are often on the same shard. This enables single-shard queries for most local searches. The downside is hot shards for popular cells (city centers, airports).
Load Balancing Challenges
Spatial data is not uniform. Manhattan has 1000x more restaurants per square km than rural Montana. Cell-based sharding puts Manhattan on few shards that get most traffic. Solutions:
Sub-cell sharding: Split hot cells into finer cells across multiple shards. More complex routing but better distribution.
Composite keys: Shard by cell ID plus random suffix. Spreads cell data across shards but loses locality. Queries must fan out to all shards holding the cell.
Dedicated hot-cell shards: Identify hot cells, replicate their data to dedicated high-capacity shards.
Caching Strategies
Cache at cell level: key is cell ID, value is list of items in that cell. Queries check cache first. Cache hit avoids database entirely. Invalidate on item updates within the cell.
For nearest neighbor, caching is trickier. Same cell contents yield different results depending on query origin. Options: cache cell contents and compute distance client-side, or cache results for popular query points with TTL.
⚠️ Key Trade-off: Cell-based sharding optimizes for locality but creates hot spots. Spreading data evenly across shards optimizes for load balance but requires cross-shard queries. Choose based on whether locality or load balance matters more.

💡 Key Takeaways

✓Cell-based sharding: good locality, but hot cells cause hot shards

✓Sub-cell sharding splits hot cells across shards with complex routing

✓Composite keys spread load but require fan-out queries

✓Cache at cell level; invalidate on updates within cell

✓Nearest neighbor caching is hard; cache cell contents and compute distance

📌 Interview Tips

1Explain hot shard problem: Manhattan cell gets 1000x more queries than rural Montana cell

2For hot cells, recommend sub-cell sharding: split downtown into 4 sub-cells on different shards

3Design cell-level caching: key by cell ID, store item list, invalidate on any item update in cell

← Back to Proximity Search Overview