A/B Testing & ExperimentationMulti-Armed Bandits (Thompson Sampling, UCB)Hard⏱️ ~2 min

Contextual Bandits: LinUCB and Neural Linear Methods

Contextual bandits extend the basic multi-armed bandit by incorporating features, or context, that describe the user, item, and environment at decision time. Instead of learning a single reward estimate per arm, you learn a function that maps context to reward for each arm. This allows generalization: a new user can immediately benefit from patterns learned from similar users, and new arms can be scored using their features. LinUCB is the workhorse algorithm, which maintains a linear model per arm with Bayesian online updates to the weight posterior. For each arm a, LinUCB keeps a design matrix of observed contexts and a vector of observed rewards. It computes an estimate of the weight vector and an inverse covariance matrix. At decision time, for context x, it scores arm a as the dot product of the weight estimate and x, plus an uncertainty term proportional to the square root of x transpose times the inverse covariance times x. This uncertainty term is analogous to UCB's exploration bonus but adapted to the contextual setting. With 50 to 100 features and 10 arms, scoring is a few dozen floating point operations per arm, easily fitting into a 5 millisecond budget. Yahoo deployed LinUCB on their front page with features including user demographics, time of day, and article categories, achieving significant lift over non-contextual methods. For richer representations, Neural Linear methods pair a neural network feature extractor with a Bayesian linear last layer. Train a deep model offline on historical data to learn embeddings, then freeze the network and use the embeddings as context for an online linear bandit. This combines the representation power of deep learning with the fast, stable updates of linear models. Research shows this approach is competitive with fully Bayesian deep networks while being far more stable and interpretable. LinkedIn uses similar techniques for feed ranking, where a pretrained transformer encodes user and post features, and the final scoring layer is updated online via contextual Thompson Sampling over linear weights.
💡 Key Takeaways
LinUCB maintains per arm linear model with weight vector and inverse covariance matrix, scoring context x via weight transpose x plus uncertainty term
With 50 to 100 features and 10 arms, scoring requires dozens of floating point operations per arm, fitting into 5 millisecond latency budget at high throughput
Yahoo Front Page deployment with LinUCB used user demographics, time of day, and article categories, achieving significant lift over non-contextual bandits
Neural Linear pattern freezes pretrained deep feature extractor and runs online linear bandit on embeddings, combining representation power with fast stable updates
Contextual methods enable generalization to new users and new arms immediately by leveraging features, avoiding pure cold start problem of non-contextual bandits
📌 Examples
Yahoo Today Module with LinUCB: 50 dimensional context including user age, gender, location, time of day, article topic, and click history
LinkedIn feed ranking with Neural Linear: Transformer encodes user profile and post content into 128 dimensional embeddings, Bayesian linear layer updated online
Spotify playlist recommendation: song embeddings from collaborative filtering model used as context, linear bandit learns per user listening pattern weights online
← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview