Recommendation SystemsDiversity & Exploration (Multi-armed Bandits)Medium⏱️ ~3 min

Production Architecture: Sampler, Parameter Store, and Streaming Feedback

Production bandit systems have three decoupled components to balance low latency serving with real time learning. The sampler service sits in the request path and must select an arm within the API latency budget (typically adding <10ms). It reads per arm statistics (impression count, success count, or posterior parameters) from a parameter store, executes the bandit algorithm (epsilon greedy sample, UCB calculation, or Thompson Sampling draw), and returns the selected arm. State per arm is minimal: for CTR with Thompson Sampling, just two integers (clicks and impressions). The parameter store provides high throughput point reads and atomic writes for per arm counters. It must support thousands of queries per second with p99 latency under 5ms since every recommendation request needs current statistics. Teams use Redis, DynamoDB, or specialized key value stores with caching of hot arms in the sampler's local memory to reduce tail latency. Atomic increment operations prevent race conditions when multiple requests update the same arm simultaneously. The streaming feedback engine is fully asynchronous and decoupled from serving. User interactions (clicks, purchases, video starts) flow through Kafka or similar infrastructure. The feedback pipeline filters bot traffic, deduplicates events (same user clicking twice), validates attribution (was this arm actually shown?), and then atomically increments the corresponding arm's statistics in the parameter store. Expedia's architecture explicitly separated these three layers: sampler, parameter store, and streaming ingestion, which improved resilience since sampler availability doesn't depend on logging infrastructure. This separation enables high throughput and fast iteration. The sampler can be deployed independently to test new algorithms. The feedback pipeline can be backfilled or replayed for debugging. The parameter store can be scaled horizontally by sharding across arms (hash arm ID to determine shard). Udemy achieved strong revenue lifts with this pattern, and Scribd converged 30 bandits simultaneously (10 positions × 3 segments) within one week by handling feedback asynchronously.
💡 Key Takeaways
Sampler must add less than 10ms latency to request path. Reading parameters and executing Thompson Sampling or UCB is O(1) per arm with minimal computation. Cache hot arm statistics in local memory to reduce parameter store round trips.
Parameter store needs high throughput point reads and writes (thousands of QPS) with p99 latency under 5ms. Use Redis with atomic INCR operations, DynamoDB with conditional updates, or sharded key value stores. State per arm is tiny: just two integers for Beta-Bernoulli Thompson Sampling.
Streaming feedback pipeline is fully asynchronous and decoupled from serving. User events flow through Kafka, get filtered for bots and duplicates, validated for attribution (was this arm actually shown?), then update parameter store. This separation improves resilience and allows backfill or replay.
Scribd ran 30 simultaneous bandits (10 positions × 3 user segments, each with 42 arms) and converged within one week using this decoupled architecture. Asynchronous feedback processing handled high event volume without blocking recommendation serving.
Expedia architecture explicitly separated sampler service, parameter store, and streaming ingestion into three independent components. This allowed deploying algorithm changes to the sampler without touching feedback infrastructure, and scaling each layer independently.
📌 Examples
Expedia three component design: (1) Parameter store for low latency reads/writes of per image clicks and impressions. (2) Sampler service executing Thompson Sampling to select hero image. (3) Streaming feedback engine with filtering, deduplication, and incremental updates. Deployment of each component is independent.
Udemy parameter store: Per recommendation unit stores impression count and success count (clicks + enrollments). Sampler reads these, draws from Beta posteriors, picks top 3 (slate bandit), returns unit IDs. Kafka streams user interactions back to increment counters within 15 minutes.
Redis implementation for 1000 arms: Hash arm ID to one of 10 Redis shards (100 arms per shard). Each arm key stores JSON: {"impressions": 5000, "clicks": 750}. Sampler uses MGET to fetch all arms in parallel (<3ms), computes samples locally, returns winner. Feedback uses HINCRBY for atomic increment.
← Back to Diversity & Exploration (Multi-armed Bandits) Overview