Recommendation Systems • Cold Start ProblemMedium⏱️ ~3 min
Exploration Policies: Contextual Bandits and New Item Boosting
Deliberate exploration is essential to break cold start cycles: without allocating traffic to uncertain users and items, the system never collects the data needed to move beyond initial priors. Production systems use contextual bandit algorithms and explicit new item boosts to balance learning (gathering data) against exploitation (maximizing immediate metrics). The core idea is to occasionally show items with high uncertainty, observe outcomes, and update beliefs.
Contextual bandits like Thompson Sampling or Upper Confidence Bound (UCB) maintain probability distributions over expected rewards (CTR, conversion) for each item. Items with wide confidence intervals get sampled more frequently because the system is uncertain whether they might perform well. As impressions accumulate and confidence narrows, exploration naturally decreases. In practice, systems allocate 5 to 15% of impressions to exploration depending on the context: higher budgets speed learning but degrade short term CTR. Spotify might reserve 10% of playlist slots for tracks with fewer than 100 listens, while Amazon dedicates 5% of recommendation carousels to products with under 50 page views.
New item boosts provide a simpler mechanism: newly added catalog entries receive an explicit ranking bonus within relevant contexts for a fixed trial period. Airbnb gives new listings a boost in search results for the first 200 to 500 impressions or 14 to 30 days, ensuring they collect enough engagement signals to estimate quality. The boost is context aware (only shown to searches matching location, price, amenities) and capped by quality filters to protect user experience. Once the trial budget is exhausted, the item competes on equal footing using accumulated signals.
Guardrails are critical: unconstrained exploration can severely degrade user experience. Systems impose minimum predicted CTR thresholds (only explore items above the 20th percentile of baseline quality), cap daily exploration per user (no more than 2 to 3 uncertain items per session), and define clear exit criteria from cold start state. Measurement uses interleaving or counterfactual logging to isolate exploration impact, tracking exposure normalized metrics like CTR per 100 impressions and catalog coverage (percent of items receiving any impressions).
💡 Key Takeaways
•Contextual bandits maintain uncertainty estimates for each item and allocate more impressions to high uncertainty candidates, naturally reducing exploration as confidence narrows with data
•Typical exploration budgets range from 5 to 15% of total impressions, with higher budgets accelerating learning but degrading short term CTR and user satisfaction metrics
•New item boosts provide explicit ranking bonuses for a fixed trial period (commonly 200 to 500 impressions or 14 to 30 days), ensuring new catalog entries collect enough signals to compete
•Guardrails protect user experience by imposing quality thresholds (only explore items above 20th percentile baseline), capping per user exploration (2 to 3 uncertain items per session), and defining clear exit criteria
•Measurement uses interleaving or counterfactual logging to isolate exploration impact, tracking exposure normalized metrics like CTR per 100 impressions and catalog coverage percentage
📌 Examples
Amazon dedicates 5% of recommendation carousel slots to products with under 50 page views, using UCB to allocate impressions within that budget based on uncertainty
Airbnb new listing boost: +20% ranking score in matching searches for first 200 impressions or 30 days, then pure learned ranking, prevents spam with quality filters
Spotify exploration: 10% of Discover Weekly slots go to tracks with fewer than 100 listens, sampled via Thompson Sampling within user taste clusters