Serving Flow: Assembly, Latency Budgets, and Caching
Feature Serving Flow: The path from prediction request to assembled feature vector. A single prediction may require features from multiple entities (user, item, context), each stored separately. Assembly must complete within milliseconds while handling failures gracefully.
Assembly Pattern
Prediction request arrives with entity IDs (user_123, item_456). The serving layer issues parallel lookups to the online store: one for user features, one for item features, one for user-item interaction history. Results are assembled into a single feature vector matching the model input schema. For recommendation systems, you might fetch one user vector and hundreds of item vectors for ranking—batching these lookups is critical for performance.
Latency Budget Allocation
If total latency budget is 50ms and model inference takes 20ms, feature serving gets 30ms. Within that: network round-trip 5ms, online store lookup 10ms, assembly 5ms, buffer for variance 10ms. Monitor p99 latency at each step. When feature count grows, lookup latency grows—plan for this by pre-aggregating features or using hierarchical caching. A single slow feature can blow the entire budget.
Caching Strategies
Entity-level cache: Cache entire feature vectors per entity. Effective for popular entities (trending items, active users) but cache invalidation is complex when features update. Request-level cache: Cache assembled vectors for repeated requests. Works well when the same user-item pairs are scored multiple times (refresh, scroll). Precomputation: For predictable access patterns, pre-compute and store final feature vectors. Eliminates serving-time assembly but increases storage and staleness.
Failure Handling: Missing features are inevitable (new users, cold items). Define fallback values per feature: global mean, category default, or special "unknown" embedding. Never fail the entire request because one feature is missing.