ML-Powered Search & RankingScalability (Sharding, Caching, Approximate Search)Hard⏱️ ~2 min

Production Architecture: Integrating Sharding, Caching, and ANN

Core Concept
A production ML search system integrates sharding, caching, and ANN into a unified architecture with predictable latency and cost.

REQUEST FLOW

Query → L1 cache (100μs) → L2 cache (2ms) → Shard routing → Fan-out to shards → ANN per shard (5ms) → Merge → Rerank → Response. Total: 15-50ms at 100k QPS.

COMPONENT SIZING

Shards: 1B vectors ÷ 50M/shard = 20 shards × 3 replicas = 60 total. Cache: L1 10GB (hot), L2 Redis 100GB (warm). ANN: HNSW M=16, efSearch=64 for 98% recall at 5ms. Memory: 50GB/shard.

💡 Key Insight: Size independently, validate together. Cache hit rate affects shard load. ANN recall affects reranking. Tune holistically after deployment.

OPERATIONS

Deploy: Roll shard-by-shard. Monitor: Per-shard latency, cache hits, ANN recall vs brute-force. Scale: Replicas for QPS, shards for data. Index updates: Build offline, swap atomically.

COST BREAKDOWN

1B vectors, 100k QPS: Compute (60 replicas) k/month. Memory (3TB) k/month. Cache k/month. Total ~k/month. IVF-PQ cuts to k/month with 3x latency.

⚠️ Key Trade-off: Memory is dominant cost. Trading latency for efficiency (HNSW → IVF-PQ) cuts costs 50%+. Evaluate against SLAs.
💡 Key Takeaways
Request flow: L1 cache → L2 cache → shard routing → ANN per shard → merge → rerank
Sizing: 1B vectors needs ~20 shards × 3 replicas, 100GB L2 cache, 50GB per shard
Memory is dominant cost—IVF-PQ can cut costs 50%+ with latency trade-off
📌 Interview Tips
1Walk through the request flow with concrete latency numbers at each stage
2Provide cost breakdown and optimization path (HNSW → IVF-PQ)
← Back to Scalability (Sharding, Caching, Approximate Search) Overview
Production Architecture: Integrating Sharding, Caching, and ANN | Scalability (Sharding, Caching, Approximate Search) - System Overflow