A/B Testing & ExperimentationInterleaving for Ranking ModelsHard⏱️ ~3 min

Production Implementation and Scale Considerations

At scale, interleaving must fit within tight latency budgets while managing dual ranker costs. Consider a search service handling 10,000 queries per second with median latency of 120 milliseconds and a 200 millisecond P99 Service Level Objective (SLO). Running two rankers can double model inference time, which might push P99 over budget. Teams mitigate this by caching shared features so feature extraction is not duplicated, costing perhaps 50 milliseconds for retrieval and feature computation, then 20 milliseconds each for model A and model B inference. The team draft merge adds under 1 millisecond since it operates on indices for typical K of 10 to 50 items. Logging adds under 2 milliseconds amortized when batching writes to a message queue like Kafka. Total added latency stays under 10 milliseconds if feature caching is effective. Traffic sampling is another lever. Since interleaving needs only 400 to 2,000 samples for significance, you can run on 10 to 20 percent of traffic and still finish in days. This reduces infrastructure cost and blast radius if the treatment model has a bug. Airbnb reported using 6 percent of A/B traffic and one third of duration, suggesting sampling at 5 to 10 percent is viable for moderate traffic systems. However, ensure the sample is representative by randomizing at the query or user session level rather than by time of day, which could introduce seasonality bias. Logging infrastructure must capture rich metadata. Emit item id, query id, user id or session id, timestamp, slot index, team label (A, B, or neutral), and all engagement signals including clicks, dwell time, add to cart, and conversions. Store in a stream processing system for real time aggregation of preference margins. Thumbtack uses a streaming job that computes running preference margins and significance tests, updating dashboards every few minutes. Monitor guardrails like per query first position balance within 2 percent, competitive coverage rate above 30 percent, and latency P99 regressions above 5 percent. Add circuit breakers to disable experiments that violate these thresholds, protecting user experience. Experiment management should support running many interleaving matches in parallel. Bucket by query or user session with deterministic hashing on query text or user id to ensure the same query always sees the same treatment assignment within an experiment, enabling reproducibility for debugging. Keep seeds logged so you can replay coin flips. Use a three stage workflow: offline NDCG to filter weak candidates, interleaving to rank order promising ones, then A/B to validate KPI impact. Document edge cases for set level rankers or product changes that alter inventory, and if uncertain, run shadow evaluation where both models log but only one serves, allowing offline analysis before committing to interleaving.
💡 Key Takeaways
Dual ranker doubles inference cost, mitigated by caching shared features to keep added latency under 10 milliseconds at 10,000 queries per second scale
Team draft merge runs in linear time O(K) with under 1 millisecond latency for typical K of 10 to 50, using pointer operations rather than item copying
Traffic sampling at 10 to 20 percent provides 400 to 2,000 samples in days, reducing infrastructure cost while maintaining statistical power
Rich logging captures item id, query id, slot index, team label, neutral flag, and all engagement signals for real time streaming aggregation
Monitor guardrails including per query first position balance within 2 percent, competitive coverage above 30 percent, latency P99 under 5 percent regression
Bucket by query or user session with deterministic hashing for reproducibility, log coin flip seeds for debugging, support parallel experiments across query types
📌 Examples
Google Search runs 10 to 20 interleaving experiments in parallel, bucketing by query type like navigational versus informational, each with 15 percent traffic for 3 to 5 days
Netflix caches user profile embeddings and item features in Redis for 50 millisecond P99 reads, then runs dual rankers in 20 milliseconds each, keeping total latency under 100 milliseconds
Meta builds streaming pipeline on Kafka and Flink to compute running preference margins every 5 minutes, auto pausing experiments when competitive coverage drops below 25 percent
← Back to Interleaving for Ranking Models Overview