ML-Powered Search & RankingRelevance Feedback (Click Models, Position Bias)Hard⏱️ ~3 min

How Do You Deploy Bias Correction in a Production Ranking Pipeline?

Phase 1: Instrumentation

Before any model changes, add client side viewability logging. Record when items enter the viewport, how long they remain visible, and scroll depth per session. This takes 2-4 weeks to implement and validate. Run both server side and client side logging in parallel for 1-2 weeks to understand the gap. Typical finding: server logs show 30 impressions per session, client logs show 8-12 true viewable impressions.

Phase 2: Propensity Estimation

Once you have viewability data, estimate propensities using 2-4 weeks of exploration data. Run 2-3% epsilon greedy traffic to generate position variation. Compute examination probability per position by aggregating clicks across items that appeared at each position. Validate by checking that propensity curves are monotonically decreasing (position 1 should have higher examination than position 10). Segment by device type (mobile vs desktop) and user tenure (new vs returning) if data volume allows.

Phase 3: Model Training With IPS

Retrain your ranking model using IPS weighted loss. Each click example gets weight 1/propensity capped at 10-20 to limit variance. Train on viewable impressions only, not server side impressions. Compare offline metrics (NDCG, AUC) against baseline. Expect slight drops because baseline was optimized for biased data. The true test is online.

Phase 4: Staged Rollout

Deploy to 1% of traffic first. Monitor clicks, dwell time, and scroll depth. Bias corrected models should increase clicks at lower positions and increase average scroll depth as users find relevant content deeper in the list. Ramp to 5%, 20%, 50%, 100% over 2-4 weeks. At each stage, check that position 5-10 metrics improve without hurting position 1-4. Full rollout should show 3-8% improvement in total relevant item exposure.

💡 Key Takeaways
Start with client side viewability instrumentation (2-4 weeks). Compare server logs (30 impressions) to client logs (8-12 viewable).
Estimate propensities from 2-4 weeks of 2-3% exploration data. Validate curves are monotonically decreasing by position.
Train with IPS weights capped at 10-20 on viewable impressions only. Expect offline metric drops, true test is online.
Stage rollout 1% to 100% over 2-4 weeks. Watch for 3-8% improvement in relevant item exposure at lower positions.
📌 Interview Tips
1Describe the phased rollout: instrumentation first, then propensity estimation, then model training, then staged deployment. Each phase has specific validation criteria.
2Emphasize the gap between server and client logging: 30 vs 8-12 impressions per session is typical. This gap represents false negatives in training data.
3When discussing success metrics, focus on clicks and dwell time at positions 5-10 improving without hurting positions 1-4. Total relevant exposure should increase 3-8%.
← Back to Relevance Feedback (Click Models, Position Bias) Overview