ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Medium⏱️ ~2 min
Sharding Strategies for ML Search and Ranking Systems
Sharding splits large datasets and traffic across multiple machines to prevent any single node from becoming a bottleneck. In ML powered search systems handling billions of vectors and hundreds of thousands of queries per second (QPS), choosing the right sharding strategy directly impacts latency distribution, operational complexity, and how gracefully the system handles load spikes.
The three main approaches trade off different properties. Hash sharding applies a consistent hash function to a key like user ID or item ID, distributing data evenly across shards. This simplicity comes at a cost: you break natural locality, so queries that need related items often hit multiple shards. Range sharding groups data by ranges like timestamps or geographic regions, preserving locality for time series queries or regional searches. The danger is hotspots when traffic concentrates on recent data or popular regions. Directory based sharding uses a separate mapping service to assign keys to shards flexibly, allowing sophisticated placement and live rebalancing, but introduces a new failure point and lookup latency.
In production, a typical large scale recommender might use consistent hashing with 100 to 1000 virtual nodes per physical shard. Amazon and Meta report keeping shard sizes under 500 GB for Solid State Drive (SSD) backed indices or under 50 GB for in memory vector stores. This sizing allows resharding to complete in minutes rather than hours. For a 1 billion vector index with 16 bytes per compressed vector via Inverted File with Product Quantization (IVF PQ), you would provision roughly 60 to 80 shards at 200 to 300 GB each, maintaining 70 percent memory occupancy to handle failover without cascading overload.
The key tension is fanout versus hotspots. Routing queries to many shards amplifies tail latency because your 99th percentile (P99) approaches the slowest shard's P99. YouTube and TikTok limit cross shard queries to fewer than 16 shards by using coarse spatial partitioning in the vector space. When a celebrity user causes a hot shard, hash sharding cannot easily rebalance that single key. Mitigations include salting the key to spread load or dynamically splitting hot shards, but these add operational complexity. Monitor per shard load variance; if maximum load exceeds 1.5 times average, investigate rebalancing or key distribution.
💡 Key Takeaways
•Hash sharding provides even distribution and simple routing but breaks locality, causing cross shard fanout. Range sharding preserves locality for time or region queries but risks hotspots from skewed traffic.
•Production systems keep shard sizes under 500 GB for SSD or 50 GB for memory to enable fast resharding. Meta and Amazon report 60 to 80 shards for billion scale indices with 70 percent occupancy for failover headroom.
•Consistent hashing with 100 to 1000 virtual nodes per physical shard smooths load distribution and makes adding or removing shards incremental rather than a full reshuffle.
•Cross shard fanout amplifies tail latency because P99 approaches the slowest shard. Limit queries to fewer than 16 shards by spatial partitioning or coarse clustering to keep P99 under budget.
•Hot shards from celebrity users or trending items concentrate load on one partition. Mitigations include key salting to distribute load, dynamic routing based on real time metrics, or splitting hot shards elastically.
📌 Examples
A recommender serving 120k QPS with 1 billion item vectors uses 60 shards of 16 million vectors each. Consistent hashing routes user queries, and spatial partitioning limits each query to 16 shards, achieving 20 millisecond (ms) P95 retrieval latency.
Pinterest reports HNSW based sharding for image recommendations where hash sharding by image ID spreads writes evenly, but queries use a routing layer to target only relevant partitions based on coarse embedding clusters.
YouTube handles billions of video vectors by sharding with geographic and temporal hints, keeping popular recent videos in dedicated high memory shards to reduce cross shard queries during peak hours.