Model Serving & InferenceBatch vs Real-time InferenceMedium⏱️ ~3 min

Production Patterns: Two-Stage Retrieval and Prediction Stores

Large scale recommendation and ranking systems universally adopt a two stage architecture to balance quality, latency, and cost. The first stage runs offline in batch to generate candidate sets and compute embeddings at terabyte to petabyte scale. This might produce billions of predictions, scoring every item for every user or generating top 1,000 candidates per user from a catalog of millions. These candidates are materialized into a prediction store, essentially a key value database keyed by entity such as user identifier, item identifier, or device identifier, with versioning and Time To Live (TTL) semantics. The second stage happens online during the user request. A lightweight re-ranking model retrieves the precomputed candidates from the prediction store with single digit millisecond latency, then applies fresh contextual signals like current session behavior, device type, time of day, or recent interactions. This re-ranker operates under a strict latency budget, typically 20 to 80 milliseconds, leaving room in the overall page render budget of 150 to 300 milliseconds p95 for network hops, feature fetches, and safety margins. This pattern appears everywhere. YouTube generates video candidates offline using collaborative filtering and content embeddings, then re-ranks online with watch history and session signals. Pinterest precomputes board and pin recommendations in batch, then personalizes online with real time engagement. The key insight: you get 80 to 90% of the quality from the offline stage at a fraction of the cost, then spend your expensive real-time budget only on the marginal value from fresh context.
💡 Key Takeaways
Prediction stores materialize batch predictions as key value pairs with versioning and TTL, enabling sub 10 millisecond lookups that avoid recomputation on the hot path
Two stage architectures decouple expensive candidate generation that can take hours and process terabytes from lightweight re-ranking that must complete in tens of milliseconds
LinkedIn feed ranking precomputes connection based and content based candidates offline, then online re-ranks with real time engagement signals within a 50 millisecond budget at 100,000+ queries per second
Atomic version cutover prevents serving mixed predictions: batch writes to a new version, validates quality, then flips consumers to the new version with rollback capability on anomalies
Cache with short TTLs between 5 to 60 minutes blunts tail latency by avoiding repeated feature fetches and model scoring, but requires careful invalidation logic to prevent stale context
📌 Examples
Meta ads ranking: Batch generates billions of user item pairs daily, stores in Tao (distributed cache), online ranker retrieves 500 candidates per auction and scores in under 20 milliseconds
Uber dispatch: Batch computes driver reliability scores and long term demand patterns hourly, online matching engine retrieves precomputed scores and blends with real time driver location within 50 milliseconds p95
Spotify playlist generation: Offline computes track embeddings and collaborative filtering over 100M+ tracks, materializes top 2,000 candidates per user, online personalizes with listening session context in under 100 milliseconds
← Back to Batch vs Real-time Inference Overview