Learn→Recommendation Systems→Real-time Personalization (Session-based, Contextual Bandits)→2 of 6

Recommendation Systems • Real-time Personalization (Session-based, Contextual Bandits)Hard⏱️ ~3 min

Exploration Strategies: Epsilon Greedy, LinUCB, and Thompson Sampling

Exploration strategies determine how the bandit balances exploitation (choose the best known action) with exploration (try uncertain actions to learn). Pure exploitation gets stuck on locally optimal choices; pure exploration wastes user experience on bad options. The strategy must generalize across contexts, unlike non contextual bandits that learn independently per arm.

Epsilon greedy is the simplest: with probability ε (typically 0.01 to 0.05 in mature systems), choose a random action; otherwise exploit the best. It wastes traffic uniformly on all actions including obviously bad ones. Use epsilon greedy as a robust starting point with decay schedules (start at 0.10 during cold start, decay to 0.02 steady state). Never set ε to zero in production or you lose the ability to detect drift and your offline policy evaluation becomes impossible.

LinUCB (Linear Upper Confidence Bound) assumes rewards are approximately linear in features and maintains per action matrices A (covariance) and b (reward accumulator). It scores each action as mean reward plus alpha times uncertainty, where uncertainty shrinks as you collect data. LinUCB naturally explores uncertain actions more and generalizes across similar contexts. Update cost is O(d squared) per action for feature dimension d; feasible in memory for d under 100. Works well when reward truly is linear; breaks down for complex non linear patterns. Google and Meta use UCB variants for multi variant optimization with single digit millisecond policy budgets.

Thompson Sampling takes a Bayesian approach: maintain a posterior distribution over model parameters (e.g., Normal Inverse Gamma for linear models), sample from it each request, and choose the action with highest sampled reward. It naturally balances exploration via uncertainty without tuning alpha or ε. Handles heterogeneity across contexts well and empirically often matches or beats UCB. Requires careful prior initialization (optimistic priors for new actions) and can be sensitive to model misspecification. Thompson Sampling is common at Google for allocation problems and in systems using Vowpal Wabbit style contextual bandits.

💡 Key Takeaways

•Epsilon greedy with ε between 0.01 and 0.05 is robust and simple but wastes traffic uniformly. Never set ε to zero or offline policy evaluation breaks and drift goes undetected.

•LinUCB updates are O(d squared) per action and run at microsecond scale for d under 100. Yahoo front page used LinUCB to achieve 12 to 20 percent CTR lift with sub 100ms latency.

•Thompson Sampling requires no manual tuning of alpha or ε but needs careful prior initialization. Use optimistic priors for new actions to accelerate cold start learning.

•Exploration hygiene in production: maintain a floor of 1 to 2 percent exploration even in mature systems to detect drift and maintain offline evaluation coverage.

•Model capacity trade off: linear scorers enable microsecond inference and online updates; shallow neural scorers are more expressive but require cached embeddings and increase latency to 5 to 10ms.

•Non stationarity handling: exponentially decay historical data or use recency weighted updates. Spotify and Netflix retune priors daily and use faster update cadences (5 to 15 minute micro batches) to track trends.

📌 Examples

LinUCB production: maintain A (d x d matrix) and b (d vector) per action. On reward arrival, update A += x x^T and b += r x. Score new context as b^T A^-1 x + alpha sqrt(x^T A^-1 x). Typical d = 20 to 100, alpha = 0.1 to 2.0.

Thompson Sampling for linear rewards: maintain Normal Inverse Gamma posterior. Each request, sample covariance Σ ~ Inverse Gamma, then sample weights θ ~ Normal(μ, Σ). Score actions with θ^T x and choose argmax. Update posterior with observed (x, a, r).

Google uses Thompson Sampling in experiment allocation where action set is small (5 to 20 variants) and policy call must complete in single digit milliseconds within broader 50 to 200ms UI budget.

← Back to Real-time Personalization (Session-based, Contextual Bandits) Overview