Embeddings & Similarity SearchIndex Management (Building, Updating, Sharding)Medium⏱️ ~2 min

Update Strategies: Deletes, Tombstones, and Compaction

WHY SHARDING

Single-machine indexes hit limits: memory (hundreds of GB max), CPU (diminishing returns past 32-64 cores), and fault tolerance (single point of failure). Sharding distributes vectors across multiple machines to scale beyond these limits.

At 1 billion vectors with 768 dimensions, raw storage is 3TB. With HNSW overhead, expect 4-5TB. No single machine holds this. Sharding splits the index across 10-100 machines, each holding 10-100M vectors.

SHARDING STRATEGIES

Random sharding: Hash vector ID to determine shard. Simple, even distribution, but queries must fan out to all shards. Query latency = max shard latency. 100 shards = 100 parallel queries per user query.

Semantic sharding: Cluster vectors, assign each cluster to a shard. Queries route to subset of relevant shards based on query embedding. Reduces fan-out but requires accurate routing. If routing fails, recall drops.

Hybrid: Coarse semantic routing to select shard groups, then fan-out within group. Balances routing accuracy and coverage.

QUERY AGGREGATION

Each shard returns top-K candidates. Aggregator merges results: collect top-K from each shard, re-rank by distance, return global top-K. With 100 shards and K=10, aggregator processes 1000 candidates.

Latency consideration: query latency is dominated by slowest shard (tail latency). Use timeouts and graceful degradation—return partial results if some shards are slow.

REPLICATION FOR FAULT TOLERANCE

Each shard is replicated across 2-3 machines. If one replica fails, others serve traffic. Replication also enables read scaling—distribute queries across replicas.

💡 Key Insight: Sharding adds complexity: coordination, routing, aggregation, and consistency. Only shard when single-machine limits are reached. Start simple, scale when needed.
💡 Key Takeaways
Shard when single machine hits limits: 100s GB RAM, 32-64 cores, single point of failure
Random sharding: simple, fan-out to all shards; semantic: route to subset, risk recall loss
Query latency = slowest shard; use timeouts and return partial results
📌 Interview Tips
1Interview Tip: Compare random vs semantic sharding—random is simple but max fan-out; semantic reduces fan-out but needs accurate routing.
2Interview Tip: Explain aggregation—collect top-K from each shard, merge, return global top-K.
← Back to Index Management (Building, Updating, Sharding) Overview