Learn→Recommendation Systems→Real-time Personalization (Session-based, Contextual Bandits)→3 of 6

Recommendation Systems • Real-time Personalization (Session-based, Contextual Bandits)Hard⏱️ ~3 min

Propensity Logging and Offline Policy Evaluation (OPE)

When you deploy a new bandit policy, you need to estimate its performance before sending live traffic to it. Offline Policy Evaluation (OPE) lets you answer "how would policy B have performed on the traffic policy A saw" using only logged data from policy A. This requires propensity logging: recording the exact probability p that policy A used to choose each action. Without correct p, OPE estimates are biased and you cannot safely compare policies.

The logged tuple is (context x, action a, propensity p, reward r, decision ID, model version). Inverse Propensity Scoring (IPS) reweights observed rewards by 1 divided by p to correct selection bias: if policy A chose action a with probability 0.1 but policy B would choose it with probability 0.5, you upweight that observation by 1 over 0.1 equals 10 when estimating policy B's value. IPS is unbiased but high variance: rare actions get huge weights. Practical systems clip importance weights at 10 to 20x or use self normalized IPS to control variance.

Doubly Robust (DR) estimators combine IPS with a reward model: use the model to predict baseline reward, then use IPS to correct the residual error. DR has lower variance than pure IPS and is unbiased as long as either the model or the propensities are correct. Microsoft Personalizer and Vowpal Wabbit based systems use DR as the default OPE method. Typical workflow: log decisions for a week, train candidate policy offline, replay logged data through candidate policy computing IPS or DR estimates, compare to baseline, promote if estimate shows significant lift without violating guardrails.

Operational pitfalls: if you quantize or round propensities when logging, OPE becomes biased. If model version at logging time differs from serving (clock skew, deployment race), recorded p is wrong. Use idempotent decision IDs, consistent hashing for model routing, and end to end observability to detect these issues. Always maintain a holdout baseline policy (5 to 10 percent of traffic) for ground truth comparison and circuit breaker triggers.

💡 Key Takeaways

•Propensity p is the exact probability used to select action a. Without correct p, Inverse Propensity Scoring (IPS) estimates are biased and you cannot safely evaluate new policies offline.

•IPS reweights rewards by 1 divided by p to correct selection bias. High variance problem: rare actions with small p get huge weights (1 over 0.01 equals 100x). Clip at 10 to 20x in production.

•Doubly Robust (DR) combines IPS with a reward model for lower variance. Unbiased if either model or propensities are correct. Microsoft Personalizer uses DR as default for offline evaluation.

•Logging failures cause silent bias: quantized propensities, model version mismatches, clock skew, and duplicate events all corrupt p. Use idempotent decision IDs and consistent model routing.

•Typical OPE workflow: log 1 week of decisions under policy A, train candidate policy B offline, replay logged data computing IPS or DR, promote B if lift is significant and guardrails pass.

•Always maintain 5 to 10 percent holdout traffic on baseline policy for ground truth comparison and circuit breaker. Prevents catastrophic policy deployments when OPE estimates are wrong.

📌 Examples

Yahoo LinUCB deployment: logged (x, a, p, r) for every front page decision. Offline IPS evaluation estimated 12+ percent CTR lift for new policies before live promotion, reducing risk.

Microsoft Personalizer replay: customer logs decisions for 7 days, uploads to OPE service. Service computes IPS and DR estimates for candidate policies, returns confidence intervals. Promotes if lower bound exceeds baseline.

Meta bandit systems: use self normalized IPS with weight clipping at 20x. Typical logging: 100 million decisions per day, OPE runs daily on previous week, evaluates 5 to 10 candidate policies, promotes top performer if guardrails pass.

← Back to Real-time Personalization (Session-based, Contextual Bandits) Overview