Production Implementation: Logging, Calibration, and Monitoring

PROPENSITY LOGGING
Every impression must log the propensity score: the probability that this item would be shown at this position given the model and randomization policy. Without propensity, you cannot apply IPS later. The logging pipeline must capture: user context, item ID, position shown, propensity, timestamp, and eventual outcome (click, conversion). Store propensity with high precision (at least 4 decimal places) to avoid numerical issues in IPS weights.
CALIBRATION AND MONITORING
After debiasing, your model should predict the same CTR regardless of position. Test this by computing predicted CTR versus actual CTR bucketed by position. If position 1 predictions are 20% higher than actual, debiasing is incomplete. Recalibrate using isotonic regression or Platt scaling. Monitor calibration weekly as user behavior and catalog change.
WHOLE PAGE OPTIMIZATION
Individual item relevance is not enough. Consider page level effects: diversity (showing 10 similar items is worse than 10 varied items), context (an item might be great after a specific preceding item), and diminishing returns (user is less likely to click any item in position 10 regardless of relevance). Whole page models optimize the entire slate, not just individual rankings.
✅ Best Practice: Run continuous calibration checks. Create dashboards showing predicted vs actual CTR by position, by user segment, by item category. Drift in any dimension indicates a problem.
A/B TESTING DEBIASING CHANGES
Debiasing improves long term metrics but may hurt short term. Run experiments for at least 2 to 4 weeks to see the full effect. Compare both engagement (CTR, time spent) and diversity metrics (catalog coverage, long tail engagement). A successful debiasing shows stable or improved engagement plus significantly better diversity.

💡 Key Takeaways

✓Log propensity with 4+ decimal places for every impression: user, item, position, propensity, outcome

✓Debiased model should predict same CTR regardless of position - test by bucketing predictions

✓Recalibrate with isotonic regression when position 1 predictions are 20%+ off from actual

✓Whole page optimization considers diversity, context, and diminishing returns across the slate

✓Run A/B tests 2-4 weeks to see full debiasing effect on engagement and diversity metrics

📌 Interview Tips

1Describe propensity logging: store 0.0847 not 0.08 to avoid numerical issues in IPS

2Explain calibration check: predicted CTR by position should be flat line if debiased correctly

3Discuss experiment length: 1 week misses long-term diversity gains from debiasing

← Back to Position Bias & Feedback Loops Overview