Embeddings & Similarity SearchIndex Management (Building, Updating, Sharding)Medium⏱️ ~2 min

Sharding Vector Indexes: Balancing Load and Latency

Sharding splits the index across machines to scale beyond single node memory and CPU limits. The sharding strategy determines latency, load balance, and failure resilience. Hash based sharding distributes vectors uniformly. Hash the item ID modulo shard count. For 500 million vectors across 16 shards, each holds roughly 31 million vectors. Queries fan out to all shards, each searches its local index in parallel and returns top 200 candidates in 15 milliseconds median and 30 milliseconds p99, then a coordinator merges to global top 500. This is simple and balances writes well, but every query hits all shards. At 10,000 queries per second (QPS) system wide, each shard serves 10,000 QPS, requiring substantial CPU. Range or learned sharding routes queries to fewer shards. If using Inverted File (IVF) partitioning, the query vector is assigned to the nearest 3 to 5 clusters, and only those shards are queried. This reduces fanout from 16 shards to 3, cutting network and CPU by 5x. The tradeoff is skew: popular clusters become hot shards. Spotify and Pinterest use coarse clustering to route queries, achieving 20 to 40 millisecond p99 latencies with 3 shard fanout versus 60 to 80 milliseconds with full fanout. Shard sizing is a critical capacity decision. Practitioners target 100 million vectors per 64 to 128 gigabyte node when using Product Quantization at 16 bytes per vector, plus graph overhead. For text search, 30 to 50 gigabyte shards are common because recovery and rebalancing complete in minutes. Smaller shards increase coordination overhead and metadata cost; larger shards concentrate load and slow failure recovery.
💡 Key Takeaways
Hash sharding provides uniform load distribution. Every shard receives equal writes and equal query traffic. Simple to implement and reason about, but requires full fanout for global top K results.
Inverted File or learned routing reduces fanout from 16 shards to 3 to 5, cutting query latency by 40 to 60 percent. Meta and Google use coarse quantization to route queries, achieving 20 millisecond p99 per query versus 50 to 80 milliseconds with full fanout.
Hot shard problems emerge with skewed data. Popular categories or trending items can concentrate 3x to 5x more traffic on certain shards. Monitor per shard QPS and p99 latency, not just cluster averages.
Shard sizing balances recovery time and efficiency. For vector indexes, 5 to 10 million vectors per shard at 20 bytes per vector is 100 to 200 megabytes, supporting fast replication. Text search uses 30 to 50 gigabyte shards for minute scale recovery.
Replicas are mandatory for availability and read scaling. Two to three replicas per shard tolerate node failures and distribute query load. At 10,000 QPS per shard, three replicas handle 30,000 QPS combined.
📌 Examples
Pinterest uses coarse clustering for visual search. Query embeddings route to 3 to 5 clusters out of 256 total, reducing shard fanout and achieving 25 millisecond p99 latency at 95 percent recall.
Meta FAISS deployments shard by Inverted File partitions. Training identifies 4,096 clusters, and queries probe the nearest 32, fanning out to 32 shards instead of all shards, reducing network and CPU cost by 10x.
Elasticsearch log search systems use daily indexes with 30 to 50 gigabyte shards. Five primary shards with one replica each provide 10 effective shards for parallel search and tolerate single node failure.
← Back to Index Management (Building, Updating, Sharding) Overview
Sharding Vector Indexes: Balancing Load and Latency | Index Management (Building, Updating, Sharding) - System Overflow