ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Hard⏱️ ~2 min
End to End Latency Budget for ML Powered Ranking at 120k QPS
Meeting strict latency Service Level Objectives (SLOs) like 150 ms P99 for a recommendation feed at 120k QPS requires decomposing the request into stages, assigning each a latency budget, and optimizing the critical path. The challenge is tail latency amplification: when a request fans out to multiple services, the overall P99 approaches the maximum of individual stage P99s, so each component must stay well under budget to leave headroom for variance and retries.
A typical two stage retrieval and ranking pipeline breaks down as follows. First, fetch or compute the user embedding. With 99 percent of requests hitting an in memory cache, cache hits return in 1 to 2 ms. Misses compute the embedding with a small model in 10 to 20 ms. To cap tail latency, precompute embeddings for the top 50 million active users every 5 minutes and push to cache, ensuring cold compute affects less than 1 percent of traffic. Second, candidate retrieval runs approximate nearest neighbor search on a sharded vector index. A cluster of 200 hosts, each serving 600 QPS at 15 ms P95 latency, supports 120k total QPS with headroom. Queries touch 16 shards on average to limit fanout. Third, feature fetch pulls hundreds of features per user and item from a feature store. A memory front cache serves hot features in 1 to 3 ms, cold reads hit SSD in 5 to 15 ms. At 95 percent cache hit rate, median is 2 ms and P99 is 12 ms. Fourth, ranking model inference processes 1,000 candidates per request with a vectorized CPU model in 10 to 25 ms, or 2 to 3 ms median on GPU with careful batching to avoid queueing delays.
The latency budget allocates 20 ms for retrieval, 15 ms for features, 25 ms for ranking, and 10 ms for network and orchestration overhead, totaling 70 ms median and aiming for 150 ms P99. Fanout to multiple shards and services adds variance. Use partial timeouts and fallbacks to degrade gracefully. If retrieval exceeds 30 ms, skip some shards and rank fewer candidates. If feature fetch is slow, drop non critical features rather than fail the request.
Operationally, monitor per stage metrics: QPS, P50 and P99 latency, error rate, cache hit rate, and shard load variance. Set budget enforcement at each boundary. If a service breaches its budget consistently, apply admission control or overload shedding. On overload, reduce candidate set size from 1,000 to 500, widen cache TTLs slightly to reduce miss rate, or drop expensive features. These graceful degradations keep the system serving traffic instead of failing hard. Overprovision capacity by 30 to 50 percent over the 95th percentile daily peak to absorb spikes and give room for deployments without impacting latency.
💡 Key Takeaways
•Decompose request into stages with individual budgets: 5 ms embedding fetch, 20 ms retrieval, 15 ms features, 25 ms ranking, 10 ms overhead. Total 75 ms median targeting 150 ms P99.
•Fanout amplifies tail latency because P99 approaches the slowest component. Limit cross shard queries to fewer than 16 shards and use partial timeouts with fallbacks to keep P99 under control.
•Precompute embeddings for top 50 million active users every 5 minutes to ensure 99 percent cache hit rate, capping cold compute to less than 1 percent of traffic and avoiding 10 to 20 ms miss penalty at P99.
•Feature store with 95 percent cache hit rate delivers 2 ms median and 12 ms P99. Dropping to 90 percent hit rate increases P99 to 18 ms and risks breaching SLO during traffic spikes.
•On overload, degrade gracefully: reduce candidate set from 1,000 to 500, widen cache TTLs, or drop non critical features. This keeps the system serving rather than failing hard. Overprovision by 30 to 50 percent over peak to absorb variance.
📌 Examples
YouTube recommendation pipeline allocates 15 ms for first stage retrieval from billions of videos using approximate search, 10 ms for feature fetch, and 25 ms for neural ranking, with 10 ms network budget to meet 100 ms P95 SLO.
TikTok For You page uses GPU batch inference for ranking to achieve 2 to 3 ms median per request, but monitors queueing delay carefully to prevent P99 spikes during traffic bursts from synchronized client refreshes.
Airbnb listing ranking fetches user and listing features in parallel with 20 ms timeout. If feature service is slow, the system serves with partial features and logs the degradation for offline analysis rather than failing the search.