Production Implementation: Logging, Calibration, and Monitoring
PROPENSITY LOGGING
Every impression must log the propensity score: the probability that this item would be shown at this position given the model and randomization policy. Without propensity, you cannot apply IPS later. The logging pipeline must capture: user context, item ID, position shown, propensity, timestamp, and eventual outcome (click, conversion). Store propensity with high precision (at least 4 decimal places) to avoid numerical issues in IPS weights.
CALIBRATION AND MONITORING
After debiasing, your model should predict the same CTR regardless of position. Test this by computing predicted CTR versus actual CTR bucketed by position. If position 1 predictions are 20% higher than actual, debiasing is incomplete. Recalibrate using isotonic regression or Platt scaling. Monitor calibration weekly as user behavior and catalog change.
WHOLE PAGE OPTIMIZATION
Individual item relevance is not enough. Consider page level effects: diversity (showing 10 similar items is worse than 10 varied items), context (an item might be great after a specific preceding item), and diminishing returns (user is less likely to click any item in position 10 regardless of relevance). Whole page models optimize the entire slate, not just individual rankings.
A/B TESTING DEBIASING CHANGES
Debiasing improves long term metrics but may hurt short term. Run experiments for at least 2 to 4 weeks to see the full effect. Compare both engagement (CTR, time spent) and diversity metrics (catalog coverage, long tail engagement). A successful debiasing shows stable or improved engagement plus significantly better diversity.