How Do You Deploy Bias Correction in a Production Ranking Pipeline?
Phase 1: Instrumentation
Before any model changes, add client side viewability logging. Record when items enter the viewport, how long they remain visible, and scroll depth per session. This takes 2-4 weeks to implement and validate. Run both server side and client side logging in parallel for 1-2 weeks to understand the gap. Typical finding: server logs show 30 impressions per session, client logs show 8-12 true viewable impressions.
Phase 2: Propensity Estimation
Once you have viewability data, estimate propensities using 2-4 weeks of exploration data. Run 2-3% epsilon greedy traffic to generate position variation. Compute examination probability per position by aggregating clicks across items that appeared at each position. Validate by checking that propensity curves are monotonically decreasing (position 1 should have higher examination than position 10). Segment by device type (mobile vs desktop) and user tenure (new vs returning) if data volume allows.
Phase 3: Model Training With IPS
Retrain your ranking model using IPS weighted loss. Each click example gets weight 1/propensity capped at 10-20 to limit variance. Train on viewable impressions only, not server side impressions. Compare offline metrics (NDCG, AUC) against baseline. Expect slight drops because baseline was optimized for biased data. The true test is online.
Phase 4: Staged Rollout
Deploy to 1% of traffic first. Monitor clicks, dwell time, and scroll depth. Bias corrected models should increase clicks at lower positions and increase average scroll depth as users find relevant content deeper in the list. Ramp to 5%, 20%, 50%, 100% over 2-4 weeks. At each stage, check that position 5-10 metrics improve without hurting position 1-4. Full rollout should show 3-8% improvement in total relevant item exposure.