ML-Powered Search & Ranking • Real-time Search PersonalizationMedium⏱️ ~3 min
Online Feature Store Architecture and Latency Budget
Real-time personalization requires fetching and computing features within a strict latency budget, forcing a careful split between precomputed, cached, and online computed features. The entire ranking flow from retrieval to final scored results must complete in 50 to 150 milliseconds at p95 for web search scale, leaving only 1 to 5 milliseconds for user and session feature fetches and 3 to 10 milliseconds for scoring 1,000 candidates.
Item features are cached in memory on search servers to avoid per candidate network calls. For a catalog of 4.5 million listings like Airbnb, this means loading embeddings, quality scores, category tags, and price into local RAM, consuming several gigabytes per server. User and session features are stored in a low latency key value store such as Redis, DynamoDB, or a custom system with Time To Live (TTL) of days to weeks. These aggregates are keried by user ID or session ID and fetched once per request. Typical fetch latency is 1 to 5 milliseconds at p95 for a single multi get operation retrieving 20 to 50 feature values.
User events stream into Kafka or Kinesis. A stream processor sessionizes events with a 30 minute inactivity gap and updates online aggregates such as category intensity, last clicked brand, price range histogram, and short-term embedding centroids. Updates use at least once semantics with event ID deduplication to prevent double counting. The processor writes to the online store every few seconds in microbatches to balance write Query Per Second (QPS) and freshness. At 100 million Monthly Active Users (MAU) with 10 percent daily active and 5 sessions per day, the system handles roughly 60,000 writes per second at peak.
Personalization features like EmbClickSim are computed on the search server to avoid remote fanout. Given a candidate embedding and the user's recent click centroid from the feature store, the server computes the dot product in under 1 millisecond for 32 to 128 dimensional vectors. This keeps the critical path local. Airbnb reports sub millisecond similarity computation with in-memory vectors for thousands of candidates.
The tradeoff is cost versus coverage. Storing aggregates for every user at 100 million MAU with 50 feature values per user costs roughly 50 billion keys, requiring 200 to 500 GB of memory across the key value cluster depending on encoding. At cloud pricing this can reach thousands of dollars per month in infrastructure. Segment based or cohort based personalization lowers cost by grouping users into thousands of cohorts, but loses granularity and can reduce conversion lift by 30 to 50 percent compared to per user features.
💡 Key Takeaways
•User and session features fetch from online key value store in 1 to 5 milliseconds p95 using multi get for 20 to 50 values keyed by user ID or session ID
•Item features are cached in memory on search servers to avoid network calls, consuming several GB per server for catalogs with millions of items
•Similarity features like EmbClickSim are computed on the search server via dot product in under 1 millisecond for 32 to 128 dimensional vectors on thousands of candidates
•Stream processors update online aggregates every few seconds in microbatches, handling 60,000 writes per second at peak for 100 million MAU with 10 percent daily active
•Storing per user features at 100 million MAU with 50 values costs 50 billion keys and 200 to 500 GB memory, versus cohort based personalization at lower cost but 30 to 50 percent reduced lift
📌 Examples
Airbnb caches 4.5 million listing embeddings in memory on search servers, avoiding remote calls for each candidate and keeping similarity computation under 1 millisecond
Google uses feature collocation to keep ranking latency under 30 milliseconds by storing all required features on the same machine as the ranking service, eliminating cross datacenter calls
Amazon employs DynamoDB with TTL for session features, fetching 30 aggregates in 2 milliseconds p50 and falling back to generic ranking if latency exceeds 10 milliseconds