ML-Powered Search & RankingScalability (Sharding, Caching, Approximate Search)Easy⏱️ ~2 min

What is ML Search Scalability and Why It Matters

Definition
ML Search Scalability combines three techniques—sharding, caching, and approximate search—to handle billions of documents and 100k+ QPS while keeping latency under 100ms.

THE SCALABILITY CHALLENGE

A single machine cannot serve production ML search. At 1KB per embedding, a billion documents requires 1TB RAM. At 10ms per query, one machine handles ~100 QPS. A major platform needs 100k QPS with p99 under 50ms. The solution: distribute data (sharding), reduce computation (caching), and trade precision for speed (approximate search).

THREE PILLARS OF SCALABILITY

Sharding: Split the index across machines. Each shard holds a portion of documents. Queries fan out to all shards, results merge. Caching: Store frequently accessed embeddings and features in memory. Cache hits avoid expensive computation. Approximate search: Use algorithms like HNSW that find 95%+ of true neighbors in 1ms instead of exact search taking 100ms+.

💡 Key Insight: These three techniques compound. Sharding alone gets you to 10k QPS. Add caching for 50k QPS. Add approximate search for 100k+ QPS with sub-50ms latency. You need all three at scale.

SCALE NUMBERS TO KNOW

Embedding size: 256-1024 floats (1-4KB). Index size at 1B docs: 1-4TB. Shard count: 20-100 for TB-scale indexes. Cache hit rate target: 80-95%. ANN recall target: 95-99%. Query fanout overhead: 2-5ms per shard tier. Replication factor: 3x for fault tolerance.

⚠️ Key Trade-off: Each technique trades something. Sharding adds coordination latency. Caching uses memory and risks staleness. Approximate search sacrifices recall. Understand what you are giving up.
💡 Key Takeaways
Scalability combines sharding (distribute data), caching (reduce computation), and approximate search (trade precision for speed)
Single machine limit: 1TB RAM for 1B embeddings, ~100 QPS at 10ms per query
All three techniques compound—you need all at scale for 100k+ QPS
📌 Interview Tips
1Start with the three pillars when asked about scaling search systems
2Mention concrete numbers: 1B docs = 1TB, target 80-95% cache hit rate
← Back to Scalability (Sharding, Caching, Approximate Search) Overview