Contextual Bandits: LinUCB and Neural Linear Methods
FROM NON-CONTEXTUAL TO CONTEXTUAL
Non-contextual bandits treat all users the same: arm 3 is either globally good or not. Contextual bandits condition on user and item features. Different users may prefer different arms. This enables generalization: when a new user arrives, use their features to predict which arm they will prefer, even with zero observations for that specific user.
LINUCB: LINEAR CONTEXTUAL BANDITS
LinUCB maintains a linear model per arm. For arm a, the expected reward is θ_a · x where x is the context vector (user features, time of day, device type). It also maintains uncertainty over θ_a via an inverse covariance matrix. The UCB bonus comes from the uncertainty in the prediction for this specific context.
Update: After observing reward r for context x on arm a, update: A_a = A_a + x × x^T and b_a = b_a + r × x. Then θ_a = A_a^{-1} × b_a.
NEURAL LINEAR: DEEP FEATURES WITH LINEAR BANDIT
Training a full neural network online is unstable and expensive. Neural Linear freezes a pretrained deep feature extractor and runs a linear bandit on the embeddings. For example, use a pretrained neural network to encode user profiles and items into 128-dimensional vectors, then run Thompson Sampling or LinUCB on these embeddings. This combines deep representation power with stable online updates.
COMPUTATIONAL COST
With 50-100 features and 10 arms, precompute inverse covariance matrices offline. Scoring is dot product per arm: 50 operations × 10 arms = 500 operations, submillisecond. The feature extraction (running through neural network) can take 2-5ms but is often shared with other systems.