ML-Powered Search & RankingScalability (Sharding, Caching, Approximate Search)Medium⏱️ ~2 min

Trade-offs: Latency, Cost, Accuracy, and Freshness

Core Concept
Scaling ML search requires balancing latency, cost, accuracy, and freshness. Each technique trades one for another.

LATENCY VS COST

Lower latency requires more resources. HNSW with full vectors: 2ms, 1TB RAM (k/month). IVF-PQ: 20ms, 64GB RAM (/month). 10x latency reduction can mean 20x cost increase. Define latency SLAs first, then optimize cost within budget.

ACCURACY VS SPEED

Approximate search trades recall for speed. 99% recall at 5ms vs 95% at 1ms. Missing 5% may be OK for recommendations but not for search. ANN parameters (HNSW M, efSearch) control this—tune based on accuracy needs.

💡 Key Insight: Recall@100 matters more than Recall@10 for ranking. Retrieve top 100, rerank to find top 10. 95% recall + perfect reranking = near-perfect final results.

FRESHNESS VS EFFICIENCY

Caching improves latency but serves stale data. For recommendations, 1-hour staleness is acceptable. For news, 5-minute freshness is required. Choose cache TTLs based on content velocity.

DECISION FRAMEWORK

Step 1: Define latency SLA. Step 2: Define accuracy (recall target). Step 3: Define freshness. Step 4: Calculate cost at configurations. Step 5: Choose minimum cost meeting all requirements.

⚠️ Key Trade-off: Fast + accurate + fresh + cheap is impossible. Pick your constraints, accept trade-offs on the rest.
💡 Key Takeaways
10x latency reduction can mean 20x cost increase
Approximate search trades recall for speed—tune to requirements
Freshness vs efficiency: cache TTLs depend on content velocity
📌 Interview Tips
1Use decision framework: define latency, accuracy, freshness, minimize cost
2Recall@100 matters more than Recall@10—retrieve broadly, rerank precisely
← Back to Scalability (Sharding, Caching, Approximate Search) Overview