Production Failure Modes and Operational Challenges
INDEX STALENESS
Indexes built on old data miss new vectors. If you rebuild weekly but add 1M items daily, 7M items are invisible to search. Solutions: incremental updates (HNSW supports adding vectors), partial rebuilds (update clusters containing new items), or hybrid search (exact search on recent items, ANN on older items). Balance rebuild frequency against infrastructure cost.
DISTRIBUTION SHIFT
ANN indexes optimize for the training data distribution. If new vectors come from a different distribution (new product category, different language), recall drops significantly. Monitor recall on recent queries. If recall for new items is 10% lower than established items, the index needs retraining. Include diverse samples in index training to handle distribution changes.
MEMORY FRAGMENTATION
Long running ANN services accumulate memory fragmentation. HNSW graphs grow with additions; deletions leave holes. After months of updates, memory usage may be 2x what you expect. Schedule periodic full rebuilds or use memory efficient allocators. Monitor actual versus expected memory consumption.
HOT SPOTS
Some queries hit popular regions of the index repeatedly while others touch rarely accessed regions. This creates uneven load: some servers are overloaded while others idle. Replicate hot regions across more servers. Monitor query latency distribution: if p99 is 10x median, you likely have hot spots.