ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Hard⏱️ ~2 min
Production Architecture: Integrating Sharding, Caching, and ANN
Core Concept
A production ML search system integrates sharding, caching, and ANN into a unified architecture with predictable latency and cost.
REQUEST FLOW
Query → L1 cache (100μs) → L2 cache (2ms) → Shard routing → Fan-out to shards → ANN per shard (5ms) → Merge → Rerank → Response. Total: 15-50ms at 100k QPS.
COMPONENT SIZING
Shards: 1B vectors ÷ 50M/shard = 20 shards × 3 replicas = 60 total. Cache: L1 10GB (hot), L2 Redis 100GB (warm). ANN: HNSW M=16, efSearch=64 for 98% recall at 5ms. Memory: 50GB/shard.
💡 Key Insight: Size independently, validate together. Cache hit rate affects shard load. ANN recall affects reranking. Tune holistically after deployment.
OPERATIONS
Deploy: Roll shard-by-shard. Monitor: Per-shard latency, cache hits, ANN recall vs brute-force. Scale: Replicas for QPS, shards for data. Index updates: Build offline, swap atomically.
COST BREAKDOWN
1B vectors, 100k QPS: Compute (60 replicas) k/month. Memory (3TB) k/month. Cache k/month. Total ~k/month. IVF-PQ cuts to k/month with 3x latency.
⚠️ Key Trade-off: Memory is dominant cost. Trading latency for efficiency (HNSW → IVF-PQ) cuts costs 50%+. Evaluate against SLAs.
💡 Key Takeaways
✓Request flow: L1 cache → L2 cache → shard routing → ANN per shard → merge → rerank
✓Sizing: 1B vectors needs ~20 shards × 3 replicas, 100GB L2 cache, 50GB per shard
✓Memory is dominant cost—IVF-PQ can cut costs 50%+ with latency trade-off
📌 Interview Tips
1Walk through the request flow with concrete latency numbers at each stage
2Provide cost breakdown and optimization path (HNSW → IVF-PQ)