Production Failure Modes and Operational Challenges

INDEX STALENESS
Indexes built on old data miss new vectors. If you rebuild weekly but add 1M items daily, 7M items are invisible to search. Solutions: incremental updates (HNSW supports adding vectors), partial rebuilds (update clusters containing new items), or hybrid search (exact search on recent items, ANN on older items). Balance rebuild frequency against infrastructure cost.
DISTRIBUTION SHIFT
ANN indexes optimize for the training data distribution. If new vectors come from a different distribution (new product category, different language), recall drops significantly. Monitor recall on recent queries. If recall for new items is 10% lower than established items, the index needs retraining. Include diverse samples in index training to handle distribution changes.
⚠️ Warning: PQ codebooks learned on old data may poorly compress new distributions. Relearn codebooks periodically, especially after major catalog changes.
MEMORY FRAGMENTATION
Long running ANN services accumulate memory fragmentation. HNSW graphs grow with additions; deletions leave holes. After months of updates, memory usage may be 2x what you expect. Schedule periodic full rebuilds or use memory efficient allocators. Monitor actual versus expected memory consumption.
HOT SPOTS
Some queries hit popular regions of the index repeatedly while others touch rarely accessed regions. This creates uneven load: some servers are overloaded while others idle. Replicate hot regions across more servers. Monitor query latency distribution: if p99 is 10x median, you likely have hot spots.

💡 Key Takeaways

✓Index staleness: weekly rebuild + 1M daily items = 7M invisible items; use incremental updates

✓Distribution shift: new categories may have 10% lower recall; monitor and retrain periodically

✓PQ codebooks degrade on new distributions; relearn after major catalog changes

✓Memory fragmentation: after months of updates, memory may be 2x expected; schedule rebuilds

✓Hot spots: p99 latency 10x median indicates uneven load; replicate hot regions

📌 Interview Tips

1Describe hybrid search: exact search on items added in last hour, ANN on older items

2Explain distribution monitoring: track recall for items < 7 days old vs older items

3Discuss memory: HNSW starts at 51 GB, grows to 100 GB after 6 months of updates

← Back to Scalability (ANN, HNSW, FAISS) Overview