Recommendation SystemsScalability (ANN, HNSW, FAISS)Hard⏱️ ~2 min

Production Failure Modes and Operational Challenges

INDEX STALENESS

Indexes built on old data miss new vectors. If you rebuild weekly but add 1M items daily, 7M items are invisible to search. Solutions: incremental updates (HNSW supports adding vectors), partial rebuilds (update clusters containing new items), or hybrid search (exact search on recent items, ANN on older items). Balance rebuild frequency against infrastructure cost.

DISTRIBUTION SHIFT

ANN indexes optimize for the training data distribution. If new vectors come from a different distribution (new product category, different language), recall drops significantly. Monitor recall on recent queries. If recall for new items is 10% lower than established items, the index needs retraining. Include diverse samples in index training to handle distribution changes.

⚠️ Warning: PQ codebooks learned on old data may poorly compress new distributions. Relearn codebooks periodically, especially after major catalog changes.

MEMORY FRAGMENTATION

Long running ANN services accumulate memory fragmentation. HNSW graphs grow with additions; deletions leave holes. After months of updates, memory usage may be 2x what you expect. Schedule periodic full rebuilds or use memory efficient allocators. Monitor actual versus expected memory consumption.

HOT SPOTS

Some queries hit popular regions of the index repeatedly while others touch rarely accessed regions. This creates uneven load: some servers are overloaded while others idle. Replicate hot regions across more servers. Monitor query latency distribution: if p99 is 10x median, you likely have hot spots.

💡 Key Takeaways
Index staleness: weekly rebuild + 1M daily items = 7M invisible items; use incremental updates
Distribution shift: new categories may have 10% lower recall; monitor and retrain periodically
PQ codebooks degrade on new distributions; relearn after major catalog changes
Memory fragmentation: after months of updates, memory may be 2x expected; schedule rebuilds
Hot spots: p99 latency 10x median indicates uneven load; replicate hot regions
📌 Interview Tips
1Describe hybrid search: exact search on items added in last hour, ANN on older items
2Explain distribution monitoring: track recall for items < 7 days old vs older items
3Discuss memory: HNSW starts at 51 GB, grows to 100 GB after 6 months of updates
← Back to Scalability (ANN, HNSW, FAISS) Overview