Learn→Embeddings & Similarity Search→Index Management (Building, Updating, Sharding)→4 of 6

Embeddings & Similarity Search • Index Management (Building, Updating, Sharding)Medium⏱️ ~2 min

Update Strategies: Deletes, Tombstones, and Compaction

WHY SHARDING
Single-machine indexes hit limits: memory (hundreds of GB max), CPU (diminishing returns past 32-64 cores), and fault tolerance (single point of failure). Sharding distributes vectors across multiple machines to scale beyond these limits.
At 1 billion vectors with 768 dimensions, raw storage is 3TB. With HNSW overhead, expect 4-5TB. No single machine holds this. Sharding splits the index across 10-100 machines, each holding 10-100M vectors.
SHARDING STRATEGIES
Random sharding: Hash vector ID to determine shard. Simple, even distribution, but queries must fan out to all shards. Query latency = max shard latency. 100 shards = 100 parallel queries per user query.
Semantic sharding: Cluster vectors, assign each cluster to a shard. Queries route to subset of relevant shards based on query embedding. Reduces fan-out but requires accurate routing. If routing fails, recall drops.
Hybrid: Coarse semantic routing to select shard groups, then fan-out within group. Balances routing accuracy and coverage.
QUERY AGGREGATION
Each shard returns top-K candidates. Aggregator merges results: collect top-K from each shard, re-rank by distance, return global top-K. With 100 shards and K=10, aggregator processes 1000 candidates.
Latency consideration: query latency is dominated by slowest shard (tail latency). Use timeouts and graceful degradation—return partial results if some shards are slow.
REPLICATION FOR FAULT TOLERANCE
Each shard is replicated across 2-3 machines. If one replica fails, others serve traffic. Replication also enables read scaling—distribute queries across replicas.
💡 Key Insight: Sharding adds complexity: coordination, routing, aggregation, and consistency. Only shard when single-machine limits are reached. Start simple, scale when needed.

💡 Key Takeaways

✓Shard when single machine hits limits: 100s GB RAM, 32-64 cores, single point of failure

✓Random sharding: simple, fan-out to all shards; semantic: route to subset, risk recall loss

✓Query latency = slowest shard; use timeouts and return partial results

📌 Interview Tips

1Interview Tip: Compare random vs semantic sharding—random is simple but max fan-out; semantic reduces fan-out but needs accurate routing.

2Interview Tip: Explain aggregation—collect top-K from each shard, merge, return global top-K.

← Back to Index Management (Building, Updating, Sharding) Overview