Embeddings & Similarity Search • Real-time Updates (Incremental Indexing)Hard⏱️ ~3 min
Model Evolution and Dual Indexing
Changing embedding models or index parameters breaks incremental updates because new and old vectors are incompatible. If you retrain your encoder and change dimensions from 512 to 768, or switch from a transformer to a different architecture, you cannot mix old and new embeddings in the same index. Similarity scores become meaningless and recall collapses. The same applies to major index parameter changes like switching from HNSW to a quantized structure or changing graph degree significantly.
The safe approach is dual indexing with gradual traffic shift. Build a new index version in parallel with the old one. Dual write all updates to both indexes, applying new embeddings to the new index and old embeddings to the old. Gradually shift read traffic from 0 percent to 100 percent on the new index over hours or days, monitoring quality metrics like click through rate, precision at k, and null rate. If metrics degrade beyond tolerance, roll back traffic to the old index instantly. Keep the old index running until confidence checks pass, then decomission it.
This requires significant infrastructure. You need a feature flag or configuration service to control traffic splits per query type or user cohort. You need a shadow embedding service running the new model in parallel with the old, generating embeddings for both indexes. Storage and compute costs double during the transition. A typical migration for a large scale system with 100 million items might take 1 to 3 days for index build and 2 to 7 days for traffic ramp with continuous quality monitoring.
Training serving skew is another failure mode. If your model was trained on batch features but serves with real time features, accuracy drops. For example, if you train on aggregated 7 day user engagement but serve with only 24 hour fresh features, the distribution shifts and recall can drop 10 to 20 percent. Validate feature parity between training and serving before deploying. Use feature stores like Tecton or Feast to ensure consistent feature computation across offline training and online serving.
💡 Key Takeaways
•Changing embedding dimensions or models makes old and new vectors incompatible, cannot mix in same index or similarity scores are meaningless
•Dual indexing with gradual traffic shift takes 3 to 10 days for large systems, build new index, dual write updates, ramp traffic while monitoring quality
•Storage and compute costs double during migration, need shadow embedding service generating both old and new representations in parallel
•Training serving skew causes 10 to 20 percent recall drop if batch training features differ from real time serving features, use feature stores for consistency
•Monitor click through rate, precision at k, and null rate during ramp, rollback instantly if metrics degrade beyond 2 to 5 percent tolerance
•Keep old index running until confidence checks pass, typical migration for 100 million items takes 1 to 3 days build and 2 to 7 days ramp
📌 Examples
Embedding change: upgrade from 512 dim sentence transformer to 768 dim model, build new HNSW, dual write for 3 days, ramp 1% to 100% over 4 days monitoring precision
Training serving skew: model trained on 7 day aggregated features, serving uses 24 hour fresh features, recall drops 15%, retrain with 24 hour features to fix
Pinterest model upgrade: new embedding model improves engagement by 3%, dual index for 5 days, ramp traffic with A/B test, decomission old index after 100% adoption