Learn→A/B Testing & Experimentation→Multi-Armed Bandits (Thompson Sampling, UCB)→6 of 6

A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Hard⏱️ ~2 min

Contextual Bandits: LinUCB and Neural Linear Methods

FROM NON-CONTEXTUAL TO CONTEXTUAL
Non-contextual bandits treat all users the same: arm 3 is either globally good or not. Contextual bandits condition on user and item features. Different users may prefer different arms. This enables generalization: when a new user arrives, use their features to predict which arm they will prefer, even with zero observations for that specific user.
LINUCB: LINEAR CONTEXTUAL BANDITS
LinUCB maintains a linear model per arm. For arm a, the expected reward is θ_a · x where x is the context vector (user features, time of day, device type). It also maintains uncertainty over θ_a via an inverse covariance matrix. The UCB bonus comes from the uncertainty in the prediction for this specific context.
Update: After observing reward r for context x on arm a, update: A_a = A_a + x × x^T and b_a = b_a + r × x. Then θ_a = A_a^{-1} × b_a.
NEURAL LINEAR: DEEP FEATURES WITH LINEAR BANDIT
Training a full neural network online is unstable and expensive. Neural Linear freezes a pretrained deep feature extractor and runs a linear bandit on the embeddings. For example, use a pretrained neural network to encode user profiles and items into 128-dimensional vectors, then run Thompson Sampling or LinUCB on these embeddings. This combines deep representation power with stable online updates.
💡 Key Insight: Contextual bandits solve cold start: new users immediately get personalized recommendations based on features, not requiring historical observations for that specific user.
COMPUTATIONAL COST
With 50-100 features and 10 arms, precompute inverse covariance matrices offline. Scoring is dot product per arm: 50 operations × 10 arms = 500 operations, submillisecond. The feature extraction (running through neural network) can take 2-5ms but is often shared with other systems.

💡 Key Takeaways

✓Contextual bandits condition on user/item features, enabling personalization and solving user cold start

✓LinUCB: maintains linear model θ_a per arm with inverse covariance for uncertainty; update is closed-form

✓Neural Linear: freeze pretrained feature extractor, run linear bandit on embeddings for stable online updates

✓Scoring with 50-100 features and 10 arms is submillisecond; feature extraction (2-5ms) is the bottleneck

📌 Interview Tips

1When explaining contextual bandits, contrast with non-contextual: different users may prefer different arms based on features

2Describe Neural Linear pattern: deep features frozen, online linear layer, best of both worlds

3Mention cold start advantage: new users get personalized recommendations immediately via feature-based generalization

← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview