What is ML Search Scalability and Why It Matters
THE SCALABILITY CHALLENGE
A single machine cannot serve production ML search. At 1KB per embedding, a billion documents requires 1TB RAM. At 10ms per query, one machine handles ~100 QPS. A major platform needs 100k QPS with p99 under 50ms. The solution: distribute data (sharding), reduce computation (caching), and trade precision for speed (approximate search).
THREE PILLARS OF SCALABILITY
Sharding: Split the index across machines. Each shard holds a portion of documents. Queries fan out to all shards, results merge. Caching: Store frequently accessed embeddings and features in memory. Cache hits avoid expensive computation. Approximate search: Use algorithms like HNSW that find 95%+ of true neighbors in 1ms instead of exact search taking 100ms+.
SCALE NUMBERS TO KNOW
Embedding size: 256-1024 floats (1-4KB). Index size at 1B docs: 1-4TB. Shard count: 20-100 for TB-scale indexes. Cache hit rate target: 80-95%. ANN recall target: 95-99%. Query fanout overhead: 2-5ms per shard tier. Replication factor: 3x for fault tolerance.