ML-Powered Search & RankingRelevance Feedback (Click Models, Position Bias)Medium⏱️ ~2 min

How Do You Implement Production Exploration to Estimate Propensities?

Propensity estimation requires data where items appear in multiple positions. You need controlled randomization to break the correlation between position and item quality. Production systems use low rate exploration policies that balance data quality with short term metric impact. Common policies include RandTopN, which uniformly shuffles the top N results; RandPair, which swaps random pairs; and score perturbation, which adds small random noise to scores so near ties move around. RandTopN provides the strongest identification and simplest analysis but hurts short term quality the most. Score perturbation focuses changes on near ties and reduces user impact while still creating position variation over time. Production exploration rates are typically 1 to 5 percent of traffic to bound quality risk. At 20,000 QPS, a 2 percent exploration cohort yields 400 QPS of randomized traffic. Over one day, that is 34 million impressions with randomized positions. This is enough to estimate position curves per surface, device, and country with low variance. Even mild RandTop5 can cause a 0.5 to 2 percent CTR drop during exploration, which is acceptable on a small cohort for a bounded time. Add guardrails to stop exploration if metrics degrade sharply. Monitor CTR, revenue per session, and user satisfaction hourly. Stop randomization if CTR drops more than 1 percent or revenue drops more than 0.5 percent for more than an hour. Use stratified sampling to ensure each position and context cell gets sufficient coverage. After collecting data, fit parametric curves, for example exponential decay or piecewise linear functions, per surface and device. Smooth over positions to avoid overfitting noise in cells with fewer observations.
💡 Key Takeaways
RandTopN uniformly shuffles top N results for strong identification but largest CTR impact; score perturbation focuses on near ties and reduces user impact
Exploration rates of 1 to 5 percent are typical; 2 percent of 20,000 QPS yields 400 QPS or 34 million impressions daily for propensity estimation
Guardrails stop exploration if CTR drops more than 1 percent or revenue per session drops more than 0.5 percent for over an hour
Fit parametric curves per surface and device with smoothing to avoid overfitting noise in low traffic cells
Even mild RandTop5 can cause 0.5 to 2 percent CTR drop, acceptable on small cohort for bounded time to gain clean bias corrected data
📌 Examples
Amazon search runs RandTop3 on 3 percent of traffic for one week per quarter. They collect 50 million randomized impressions, fit position curves per device and category, and use these propensities to train click models offline before deploying them to the full ranking stack.
Google uses score perturbation that adds Gumbel noise to ranking scores. Items within 0.05 score units frequently swap positions. Over millions of queries, this creates enough position variation to estimate propensities without large quality drops.
← Back to Relevance Feedback (Click Models, Position Bias) Overview
How Do You Implement Production Exploration to Estimate Propensities? | Relevance Feedback (Click Models, Position Bias) - System Overflow