Learn→ML-Powered Search & Ranking→Scalability (Sharding, Caching, Approximate Search)→1 of 6

ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Easy⏱️ ~2 min

What is ML Search Scalability and Why It Matters

Definition
ML Search Scalability combines three techniques—sharding, caching, and approximate search—to handle billions of documents and 100k+ QPS while keeping latency under 100ms.
THE SCALABILITY CHALLENGE
A single machine cannot serve production ML search. At 1KB per embedding, a billion documents requires 1TB RAM. At 10ms per query, one machine handles ~100 QPS. A major platform needs 100k QPS with p99 under 50ms. The solution: distribute data (sharding), reduce computation (caching), and trade precision for speed (approximate search).
THREE PILLARS OF SCALABILITY
Sharding: Split the index across machines. Each shard holds a portion of documents. Queries fan out to all shards, results merge. Caching: Store frequently accessed embeddings and features in memory. Cache hits avoid expensive computation. Approximate search: Use algorithms like HNSW that find 95%+ of true neighbors in 1ms instead of exact search taking 100ms+.
💡 Key Insight: These three techniques compound. Sharding alone gets you to 10k QPS. Add caching for 50k QPS. Add approximate search for 100k+ QPS with sub-50ms latency. You need all three at scale.
SCALE NUMBERS TO KNOW
Embedding size: 256-1024 floats (1-4KB). Index size at 1B docs: 1-4TB. Shard count: 20-100 for TB-scale indexes. Cache hit rate target: 80-95%. ANN recall target: 95-99%. Query fanout overhead: 2-5ms per shard tier. Replication factor: 3x for fault tolerance.
⚠️ Key Trade-off: Each technique trades something. Sharding adds coordination latency. Caching uses memory and risks staleness. Approximate search sacrifices recall. Understand what you are giving up.

💡 Key Takeaways

✓Scalability combines sharding (distribute data), caching (reduce computation), and approximate search (trade precision for speed)

✓Single machine limit: 1TB RAM for 1B embeddings, ~100 QPS at 10ms per query

✓All three techniques compound—you need all at scale for 100k+ QPS

📌 Interview Tips

1Start with the three pillars when asked about scaling search systems

2Mention concrete numbers: 1B docs = 1TB, target 80-95% cache hit rate

← Back to Scalability (Sharding, Caching, Approximate Search) Overview