Learn→Embeddings & Similarity Search→Index Management (Building, Updating, Sharding)→6 of 6

Embeddings & Similarity Search • Index Management (Building, Updating, Sharding)Hard⏱️ ~2 min

Trade-offs: Freshness, Recall, Latency, and Cost

THE FUNDAMENTAL TENSION
Index management forces you to choose between four competing goals that cannot all be maximized simultaneously. Improving one typically degrades another. Understanding these trade-offs lets you make intentional decisions rather than discovering painful surprises in production.
Freshness vs recall: Faster index updates (better freshness) mean smaller training batches for clustering. Smaller batches produce worse centroids, reducing recall. A batch of 100K vectors produces centroids that miss 8-12% of relevant items; a batch of 10M produces centroids missing only 2-4%.
Recall vs latency: Higher recall requires searching more partitions. Searching 5% of partitions gives ~92% recall at 15ms. Searching 20% gives ~98% recall at 60ms. For most recommendation systems, 92% recall is acceptable. For safety-critical search (medical, legal), you need 98%+.
COST TRADE-OFFS
Memory vs latency: Keeping all indexes in RAM gives sub-10ms latency. Spilling to SSD increases p99 to 50-100ms but cuts memory costs by 70%. Tiered storage (hot data in RAM, cold on SSD) balances this.
Replication vs cost: More replicas improve read throughput and fault tolerance. 3 replicas give good availability but triple storage costs. For non-critical workloads, 2 replicas may suffice. Critical systems need 3+ across availability zones.
Build frequency vs compute cost: Daily full rebuilds ensure optimal index quality but consume significant compute. Weekly rebuilds with daily incremental updates reduce cost 5-8x while maintaining 95%+ of optimal recall.
MONITORING ESSENTIALS
Track p50, p95, p99 latencies with clear SLOs (e.g., p99 < 50ms). Sample recall weekly by comparing index results to brute-force search on test queries. Monitor index freshness (time since newest item indexed). Alert on recall drops >2% or latency exceeding SLO for 5+ minutes.
⚠️ Key Trade-off: You cannot have fresh, high-recall, low-latency, and cheap simultaneously. Pick three. Most production systems optimize for latency + recall + cost, accepting 1-6 hour freshness delays.

💡 Key Takeaways

✓Monitor: latency (p50/p95/p99), recall (stable at 95%+), freshness (<1 hour for real-time)

✓Alert thresholds: p99 latency exceeds SLO, recall drops 2%+, freshness exceeds target

✓Capacity planning: estimate 6 months ahead; scale before hitting 70% CPU, 80% memory

📌 Interview Tips

1Interview Tip: Describe the key metrics—latency, recall, freshness—and how each is measured.

2Interview Tip: Explain scaling triggers—CPU, memory, latency thresholds that indicate need for more capacity.

← Back to Index Management (Building, Updating, Sharding) Overview