Learn→ML-Powered Search & Ranking→Scalability (Sharding, Caching, Approximate Search)→4 of 6

ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Medium⏱️ ~2 min

Trade-offs: Latency, Cost, Accuracy, and Freshness

Core Concept
Scaling ML search requires balancing latency, cost, accuracy, and freshness. Each technique trades one for another.
LATENCY VS COST
Lower latency requires more resources. HNSW with full vectors: 2ms, 1TB RAM (k/month). IVF-PQ: 20ms, 64GB RAM (/month). 10x latency reduction can mean 20x cost increase. Define latency SLAs first, then optimize cost within budget.
ACCURACY VS SPEED
Approximate search trades recall for speed. 99% recall at 5ms vs 95% at 1ms. Missing 5% may be OK for recommendations but not for search. ANN parameters (HNSW M, efSearch) control this—tune based on accuracy needs.
💡 Key Insight: Recall@100 matters more than Recall@10 for ranking. Retrieve top 100, rerank to find top 10. 95% recall + perfect reranking = near-perfect final results.
FRESHNESS VS EFFICIENCY
Caching improves latency but serves stale data. For recommendations, 1-hour staleness is acceptable. For news, 5-minute freshness is required. Choose cache TTLs based on content velocity.
DECISION FRAMEWORK
Step 1: Define latency SLA. Step 2: Define accuracy (recall target). Step 3: Define freshness. Step 4: Calculate cost at configurations. Step 5: Choose minimum cost meeting all requirements.
⚠️ Key Trade-off: Fast + accurate + fresh + cheap is impossible. Pick your constraints, accept trade-offs on the rest.

💡 Key Takeaways

✓10x latency reduction can mean 20x cost increase

✓Approximate search trades recall for speed—tune to requirements

✓Freshness vs efficiency: cache TTLs depend on content velocity

📌 Interview Tips

1Use decision framework: define latency, accuracy, freshness, minimize cost

2Recall@100 matters more than Recall@10—retrieve broadly, rerank precisely

← Back to Scalability (Sharding, Caching, Approximate Search) Overview