Update Strategies: Deletes, Tombstones, and Compaction
WHY SHARDING
Single-machine indexes hit limits: memory (hundreds of GB max), CPU (diminishing returns past 32-64 cores), and fault tolerance (single point of failure). Sharding distributes vectors across multiple machines to scale beyond these limits.
At 1 billion vectors with 768 dimensions, raw storage is 3TB. With HNSW overhead, expect 4-5TB. No single machine holds this. Sharding splits the index across 10-100 machines, each holding 10-100M vectors.
SHARDING STRATEGIES
Random sharding: Hash vector ID to determine shard. Simple, even distribution, but queries must fan out to all shards. Query latency = max shard latency. 100 shards = 100 parallel queries per user query.
Semantic sharding: Cluster vectors, assign each cluster to a shard. Queries route to subset of relevant shards based on query embedding. Reduces fan-out but requires accurate routing. If routing fails, recall drops.
Hybrid: Coarse semantic routing to select shard groups, then fan-out within group. Balances routing accuracy and coverage.
QUERY AGGREGATION
Each shard returns top-K candidates. Aggregator merges results: collect top-K from each shard, re-rank by distance, return global top-K. With 100 shards and K=10, aggregator processes 1000 candidates.
Latency consideration: query latency is dominated by slowest shard (tail latency). Use timeouts and graceful degradation—return partial results if some shards are slow.
REPLICATION FOR FAULT TOLERANCE
Each shard is replicated across 2-3 machines. If one replica fails, others serve traffic. Replication also enables read scaling—distribute queries across replicas.