Recommendation SystemsDiversity & Exploration (Multi-armed Bandits)Hard⏱️ ~3 min

Slate and Ranked Bandits: Handling Multiple Positions and Positional Bias

When you show multiple recommendations simultaneously (a slate or ranked list), naive per item bandits fail because position matters and items interact. A recommendation in position 1 gets 10x more clicks than position 10 purely due to visibility, creating positional bias. If you naively credit position 10 items with low CTR, you'll never learn their true quality. Additionally, showing the same content type in all positions creates a poor user experience and ignores diversity. Two architectural patterns address this. Per position bandits (used by Scribd) run one independent bandit per position, treating each slot as a separate decision problem. Scribd deployed 10 position bandits × 3 user segments = 30 total bandits, each with 42 arms (row types). This controls for position bias because each bandit learns the best content for its specific position, and you can account for context from rows above. The tradeoff is you need sufficient traffic per position to converge; Scribd needed about one week with their traffic volume. Slate bandits (used by Udemy) optimize the unordered top k set rather than a ranked list. You pick k=3 recommendation units to show, observe feedback from all k items in the slate, and update all k arms simultaneously. This provides k reward signals per impression instead of one, accelerating learning by 3x. Udemy chose k=3 based on viewport visibility (users see top 3 without scrolling). The algorithm typically uses Thompson Sampling: sample scores for all candidate arms, pick the top k samples, show them in an arbitrary or secondary ranked order. Both patterns require careful reward attribution. Scribd rows above influence rows below (if position 1 satisfies the user, they may not scroll to position 10). Slate bandits must handle the fact that multiple items were shown simultaneously, so a click on one doesn't mean the others were bad. Contextual features (user segment, time of day, device) can be incorporated to further improve relevance, though this requires more sophisticated algorithms like contextual Thompson Sampling or LinUCB.
💡 Key Takeaways
Positional bias means position 1 gets 10x more clicks than position 10 purely due to visibility. Naive bandits would incorrectly learn that items shown in lower positions are low quality, preventing discovery of actually good content placed in bad positions.
Per position bandits (Scribd approach) run one independent bandit per slot, treating each position as a separate decision. This controls for position bias and cross position interactions but requires sufficient traffic per position. Scribd used 30 bandits (10 positions × 3 segments) with 42 arms each, converging in one week.
Slate bandits (Udemy approach) optimize unordered top k sets and observe feedback from all k items per impression. This accelerates learning by k× (Udemy chose k=3 for 3× faster convergence based on viewport visibility) but provides less control over exact ordering and cross position effects.
Reward attribution becomes complex with slates. If a user clicks position 2, positions 1 and 3 were also shown and contributed to context. Scribd rows influence each other (position 1 satisfaction may prevent scrolling to position 10). Solutions include per position bandits or counterfactual correction methods.
Traffic requirements scale with granularity. Scribd needed enough traffic to converge 30 simultaneous bandits (420 total arms across positions and segments). Low traffic entities (Expedia small properties) never converge and need hierarchical priors or traffic gating to only run bandits on high volume contexts.
📌 Examples
Scribd homepage optimization: 10 position bandits × 3 user activity segments = 30 total bandits, each selecting from 42 row types. Initial randomization followed by one week exploitation. Achieved +10% reads overall, best row saw 4× activity uplift. Per position approach controlled for position bias and rows above influencing rows below.
Udemy slate bandit: Optimizes top k=3 recommendation units visible in viewport. Thompson Sampling samples all candidate arms, picks top 3 samples, shows them. Composite reward of clicks + enrollments within 15 minutes provides 3 feedback signals per impression, accelerating convergence 3× versus single arm feedback.
Netflix homepage could use per position bandits: Position 1 (hero) optimized separately from position 2 (because you watch), each bandit selecting content row type. Alternatively, slate bandit optimizes top 5 visible rows as unordered set, reranked by secondary heuristic or personalized scores.
← Back to Diversity & Exploration (Multi-armed Bandits) Overview
Slate and Ranked Bandits: Handling Multiple Positions and Positional Bias | Diversity & Exploration (Multi-armed Bandits) - System Overflow