Learn→ML-Powered Search & Ranking→Scalability (Sharding, Caching, Approximate Search)→5 of 5

ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Hard⏱️ ~2 min

Production Failure Modes: Hot Shards, Cache Stampedes, and Recall Regressions

Scalability patterns like sharding, caching, and approximate search introduce subtle failure modes that are invisible in development but cause incidents at scale. Understanding these failures and their mitigations is critical for operating ML systems in production.

Hot shards occur when traffic or data concentrates on one partition. A celebrity user with millions of followers generates 100 times more queries than a typical user, overwhelming the shard that owns that user's data. Hash sharding hides natural locality, so you cannot easily move just the hot key. Symptoms include high P99 latency on one shard, timeouts, and cascading retries. Mitigation strategies include key salting, where you append a random suffix to the key and replicate data across multiple shards, spreading load at the cost of fanout. Dynamic routing uses real time load metrics to shed requests from hot shards to replicas. Elastic shard splitting divides a hot shard into two, but this takes minutes and requires careful state migration.

Cache stampede happens when a hot key expires and thousands of concurrent requests miss simultaneously, flooding the backend. A recommendation cache entry for a trending topic expires at midnight and 10,000 QPS hit the feature store at once, spiking latency from 2 ms to 500 ms. Request coalescing or single flight per key allows only one request to refresh the cache while others wait. Jittered TTLs add random offsets so keys do not expire in sync. Stale while revalidate serves slightly stale data for a short window while refreshing asynchronously in the background. Without these, cold start after a deployment or cache flush can overwhelm downstream services.

Approximate Nearest Neighbor (ANN) recall regression is insidious because it degrades recommendation quality without throwing errors. After a model retraining that shifts vector distributions, a fixed Hierarchical Navigable Small World (HNSW) efSearch parameter or Inverted File with Product Quantization (IVF PQ) probe count might drop recall from 0.92 to 0.85. Users see worse recommendations, Click Through Rate (CTR) drops by 5 percent, but logs show no failures. Monitor recall by running shadow queries on a sampled subset against an exact search reference. Offline validation before deploying new indices catches some issues, but online monitoring is essential. Index rebuilding for billion scale indices takes hours. Use blue green deployment: build the new index in parallel, validate, then swap routing atomically. Keep a rollback pointer to revert quickly if online metrics degrade.

Memory pressure and Out Of Memory (OOM) failures affect both HNSW and IVF PQ. HNSW has graph overhead proportional to the M parameter, typically 64 to 128 bytes per node. If memory fills, the kernel OOM killer terminates processes unpredictably. Product quantization reduces memory but causes Central Processing Unit (CPU) spikes during decompression, which can degrade latency under load. Apply admission control to reject traffic when memory headroom drops below 20 percent. Spill rarely accessed partitions to Solid State Drive (SSD) with prefetch to hide latency, or horizontally scale by adding shards. Monitor memory per shard and per node, not just cluster aggregate, because imbalance causes localized failures.

💡 Key Takeaways

•Hot shards from celebrity users or trending items concentrate traffic on one partition, causing P99 spikes and timeouts. Mitigate with key salting to spread load, dynamic routing based on real time metrics, or elastic shard splitting.

•Cache stampede floods the backend when a hot key expires and thousands of requests miss simultaneously. Use single flight per key, jittered TTLs to desynchronize expiry, and stale while revalidate to serve during refresh.

•ANN recall regression degrades recommendation quality silently after model retraining shifts vector distributions. Recall can drop from 0.92 to 0.85 with no errors, reducing CTR by 5 percent. Monitor with shadow exact search on sampled queries.

•Index rebuilding for billion scale takes hours. Use blue green deployment: build new index in parallel, validate offline recall, then atomically swap routing. Keep rollback capability to revert if online metrics degrade.

•Memory pressure causes OOM failures. HNSW graph overhead is 64 to 128 bytes per node. Apply admission control when memory headroom drops below 20 percent. Monitor per shard and per node, not just cluster aggregate, to catch imbalance.

📌 Examples

During a viral event, a TikTok user trending key causes one shard to hit 10 times average load. Dynamic routing shifts 30 percent of queries to replica shards, keeping P99 under 100 ms while auto scaling provisions new capacity.

Meta recommendation cache entry for a breaking news topic expires, causing 20k simultaneous misses. Without single flight, feature store latency spikes to 800 ms for 10 seconds until the cache warms. Implementing per key locks caps miss fanout to 1 request.

Pinterest retrains image embedding model, shifting vector distributions. HNSW efSearch tuned for old model drops recall from 0.94 to 0.87. Shadow testing detects 7 percent degradation, triggering index rebuild with tuned parameters before full rollout.

← Back to Scalability (Sharding, Caching, Approximate Search) Overview