Embeddings & Similarity SearchEmbedding Quality EvaluationHard⏱️ ~3 min

Production Rollout and Version Management

Embedding model updates require careful orchestration because embeddings from different model versions are not comparable. Cosine similarity between vectors from model V1 and V2 is meaningless, making mixed version serving a critical failure mode. A search system with 500 million documents cannot reindex overnight. Backfilling at 50,000 to 200,000 vectors per second per worker means a single worker needs 40 to 160 hours to complete 500 million documents. Production rollouts require dual write and dual read phases to maintain availability and correctness. The standard pattern involves several stages. First, dual write: newly created or updated documents get embedded by both old and new models, and both vectors are stored with model version tags. Second, background backfill: workers re embed existing documents with the new model at controlled throughput to avoid overloading embedding services. Pinterest batches backfill jobs at 100,000 documents per batch, rate limiting to 500 requests per second to avoid impacting live traffic. Third, dual index: maintain two Approximate Nearest Neighbor (ANN) indexes, one per model version. Fourth, shadow traffic: route a small percentage (1 to 5%) of queries to the new index and log results alongside the old index without affecting users. Fifth, off policy evaluation: compare win rate in pairwise reranking, Recall overlap at K, and per slice metrics. Sixth, A/B test: route 5% then 50% of live traffic to the new model and measure business Key Performance Indicators (KPIs) like Click Through Rate (CTR) and engagement. Finally, full rollout and deprecate the old model. Version skew creates subtle bugs. If the query embedding uses model V2 but 30% of document embeddings are still V1, relevance degrades unpredictably based on which documents a user interacts with. Google enforces that query and document embeddings must match versions, requiring the system to route queries to the index matching the query model version. During migration, queries use V1 index until backfill reaches 95% completeness, then switch atomically. Monitoring tracks version distribution per index shard, and alerts fire if version skew exceeds 10% after switchover. Cost and time dominate rollout planning. At 300 million documents, 768 dimensions, backfilling takes 24 to 96 hours with 4 to 8 workers. Storing dual embeddings temporarily doubles storage (an extra 450 GB). Running shadow traffic and A/B tests for 7 to 14 days adds opportunity cost. Spotify budgets 3 to 4 weeks end to end for embedding model updates: 1 week backfill, 1 week shadow evaluation, 2 weeks A/B testing with ramp from 5% to 50% to 100%. They gate rollout by requiring no regression on any protected slice (genre, language, user tenure) and at least 2% engagement uplift overall.
💡 Key Takeaways
Embeddings from different model versions are not comparable: cosine similarity between V1 and V2 vectors is meaningless, requiring strict version isolation during rollout
Backfilling 500 million documents at 50k to 200k vectors per second per worker requires 40 to 160 hours, necessitating controlled multi day rollouts
Dual phase rollout maintains two indexes (V1 and V2), uses shadow traffic at 1 to 5% for offline metrics, then A/B tests at 5 to 50 to 100% for business KPIs
Version skew (mixed V1 and V2 documents in same index) degrades relevance unpredictably: Google enforces query and document versions must match, routing queries to matching index until 95% backfill
Pinterest rate limits backfill to 500 requests per second in 100k document batches to avoid impacting live embedding service under production load
Spotify budgets 3 to 4 weeks for embedding updates: 1 week backfill, 1 week shadow evaluation, 2 weeks A/B testing, gated by no protected slice regression and 2% engagement uplift
📌 Examples
At 300 million documents, 768D embeddings, dual storage temporarily adds 450 GB (doubles from 225 GB), costing extra $9 per month during migration at cloud rates
Google routes queries to V1 index until document backfill reaches 95%, then switches atomically to V2, with alerts firing if version skew exceeds 10% post switch
A team discovers 8% CTR drop during rollout caused by 30% version skew: queries used V2 but many documents remained V1, resolved by reverting and enforcing strict version matching
Spotify ramps new playlist embeddings 5% for 3 days (shadow), 50% for 7 days (A/B), then 100%, monitoring slice metrics by genre (pop, classical, hip hop) and user tenure (new vs power users)
← Back to Embedding Quality Evaluation Overview