Recommendation SystemsCold Start ProblemMedium⏱️ ~3 min

Exploration Policies: Contextual Bandits and New Item Boosting

Core Concept
New items cannot escape cold start without exposure. Exploration policies deliberately surface new items to collect engagement signals, even when the model has low confidence in their relevance.

Epsilon-Greedy

Reserve a fraction of recommendations for exploration. With epsilon = 0.1, 10% of slots show random or cold items. Simple to implement. Downside: exploration is untargeted. You might show a baby product to a college student.

Thompson Sampling

Model uncertainty around predicted scores. For new items with high uncertainty, sample from the optimistic end of the distribution. This naturally explores items where you are uncertain while exploiting confident predictions. More sophisticated than epsilon-greedy but requires probabilistic model outputs.

New Item Boosting

Explicitly boost scores for new items: boosted_score = score + boost × (1 - item_age / max_age). Fresh items get maximum boost, which decays over time. Tune boost magnitude to balance exploration against short-term engagement loss. Typical: 5-15% score boost for items under 24 hours old.

💡 Key Trade-off: Exploration hurts immediate metrics (CTR, conversion) to improve long-term catalog utilization. Run A/B tests to quantify the trade-off. If 10% exploration drops CTR by 2% but increases long-tail item exposure by 50%, leadership must decide if the trade-off is worth it.
💡 Key Takeaways
Contextual bandits maintain uncertainty estimates for each item and allocate more impressions to high uncertainty candidates, naturally reducing exploration as confidence narrows with data
Typical exploration budgets range from 5 to 15% of total impressions, with higher budgets accelerating learning but degrading short term CTR and user satisfaction metrics
New item boosts provide explicit ranking bonuses for a fixed trial period (commonly 200 to 500 impressions or 14 to 30 days), ensuring new catalog entries collect enough signals to compete
Guardrails protect user experience by imposing quality thresholds (only explore items above 20th percentile baseline), capping per user exploration (2 to 3 uncertain items per session), and defining clear exit criteria
Measurement uses interleaving or counterfactual logging to isolate exploration impact, tracking exposure normalized metrics like CTR per 100 impressions and catalog coverage percentage
📌 Interview Tips
1When asked about exploration budgets: explain dedicating 5-10% of impressions to cold items, using UCB or Thompson Sampling to balance learning with exploitation.
2For new item boosts: describe time-limited ranking bonuses (+20-50% score for first 7-30 days or first 100-500 impressions), tapering as signals accumulate.
3When discussing trade-offs: mention that over-exploration hurts short-term metrics but under-exploration causes winner-take-all effects where new items never surface.
← Back to Cold Start Problem Overview
Exploration Policies: Contextual Bandits and New Item Boosting | Cold Start Problem - System Overflow