Three Lane Architecture
Production ML systems at scale use a three lane architecture to balance freshness, latency, and cost. The batch lane computes features daily or hourly from data warehouses, achieving high throughput and low cost per feature but accepting staleness measured in hours. The nearline lane uses stream processing to update features within seconds to minutes, handling moderate velocity signals at 10 to 100x the cost of batch. The request time lane computes cheap features like current time, device type, or simple lookups during inference, maximizing freshness but limited to sub millisecond computations.
Uber Example
Low volatility features like driver lifetime rating and user home location are batch computed daily. High volatility features like nearby driver supply and surge multiplier use nearline streaming with seconds of staleness. Request time features include current GPS coordinates, request timestamp, and device properties computed at inference with zero staleness.
Fallback Cascades
When nearline features are stale beyond SLA, the serving layer falls back to the most recent batch value plus a freshness penalty in the feature vector. This ensures predictions continue with graceful degradation rather than failing hard. The model is trained with stale features occasionally included to make it robust to fallback scenarios.
Cost Allocation
A typical breakdown sees batch features at $0.01 per million feature reads, nearline at $0.10 to $1.00, and request time at negligible marginal cost but high infrastructure fixed cost for low latency compute. Engineering teams assign features to lanes based on freshness sensitivity analysis: move to nearline only if offline A/B tests show measurable quality gain justifying the 10 to 100x cost increase.
✓Cost scales exponentially with freshness requirements. Batch features cost $0.02 per GB month in object storage versus $2 to $5 per GB month for in memory nearline stores, a 100x to 250x difference.
✓Uber allocates 5 to 15ms p99 for feature retrieval out of a 20 to 50ms total inference budget at 100k plus QPS. This forces most features to be precomputed and limits request time computation to sub millisecond operations.
✓Netflix proved through A/B tests that moving user embeddings from weekly to daily refresh improved engagement by only 0.3%, not justifying real time infrastructure. Context features (device, time) computed at request time delivered 2% lift at minimal cost.
✓Burst factors of 5x to 10x are common during peak events. Planning for average load causes freshness SLA violations when viral content or dinner rush hits. DoorDash provisions nearline capacity for p99 load, not average.
✓Fallback ordering prevents total failure. If nearline is stale, use last known batch value. If batch is unavailable, use static defaults. LinkedIn's Feathr explicitly encodes this cascade in feature definitions.
✓Hot key mitigation through sharding is essential. Instead of one counter per entity that gets thousands of updates per second, maintain 10 sharded counters and sum them on read, spreading write load.
1Uber marketplace predictions merge 70% batch features (driver stats, user history), 25% nearline features (supply density, surge signals with 60s TTL), and 5% request time features (current distance, time of day).
2DoorDash stream processes store busy state as a 30 minute sliding window with 5 minute watermark. During dinner peak, one popular store can generate 3000 orders per hour. They shard the counter 10 ways to avoid overwhelming a single partition.
3LinkedIn Venice serves features with p99 read latency under 10ms by keeping hot working sets in memory. A feature for "profile views in last 7 days" lives in nearline storage, while "total career history" is batch loaded daily.