Learn→Embeddings & Similarity Search→Real-time Updates (Incremental Indexing)→4 of 6

Embeddings & Similarity Search • Real-time Updates (Incremental Indexing)Hard⏱️ ~2 min

Index Drift and Consistency Guarantees

INDEX DRIFT EXPLAINED
Index drift occurs when the index structure becomes misaligned with the underlying data. In IVF indexes, this happens when cluster centroids no longer represent the actual data distribution. New vectors cluster differently than the training data, so routing queries to centroids misses relevant results.
Measuring drift: sample queries, compare index results against brute-force exact search. Recall@100 dropping from 95% to 88% over two weeks indicates significant drift. At 85% recall, visible quality degradation occurs in user-facing search.
CONSISTENCY CHALLENGES
Write-search consistency: After a vector is added to the hot index, can a query immediately find it? With async writes, there is a brief window (milliseconds to seconds) where the vector exists but is not searchable.
Cross-index consistency: During hot-to-main merges, the same vector might briefly appear in both indexes or neither. Queries during merge might return duplicates or miss items.
Delete consistency: Deleting an item requires removing it from both hot and main indexes. If delete propagates to hot but not main (or vice versa), deleted items may still appear in results.
HANDLING CONSISTENCY
Idempotent operations: Design inserts and deletes to be safely repeatable. If a merge fails mid-way, retry should produce correct results.
Version tracking: Assign monotonic versions to vectors. During query, filter results to exclude outdated versions. This handles duplicates during merges.
Tombstone records: Mark deletions rather than physically removing. Clear tombstones during compaction. Ensures deletes propagate correctly across index tiers.
⚠️ Key Trade-off: Strong consistency (immediate visibility after write) requires synchronous index updates, which limits throughput. Eventual consistency (sub-second delays) enables async batching for higher throughput.

💡 Key Takeaways

✓Index drift: centroids misalign with data over time; recall drops from 95% to 88% over weeks indicates significant drift

✓Consistency challenges: write-search visibility, cross-index duplicates during merges, delete propagation

✓Solutions: idempotent operations, version tracking for deduplication, tombstone records for deletes

📌 Interview Tips

1Interview Tip: Explain how to measure drift—sample queries comparing index results to brute-force exact search.

2Interview Tip: Discuss the consistency tradeoff—strong consistency limits throughput; eventual consistency enables async batching.

← Back to Real-time Updates (Incremental Indexing) Overview