Feature Engineering & Feature StoresFeature Freshness & StalenessMedium⏱️ ~3 min

Hybrid Freshness Architecture: Batch, Nearline, and Request Time

Production ML systems at scale use a three lane architecture to balance freshness, latency, and cost. The batch lane computes features daily or hourly from data warehouses, achieving high throughput and low cost per feature but accepting staleness measured in hours. The nearline lane uses stream processing to update features within seconds to minutes, handling moderate velocity signals at 10x to 100x the cost of batch. The request time lane computes cheap features like current time, device type, or simple lookups during inference, maximizing freshness but limited to sub millisecond computations. Uber's marketplace ML exemplifies this pattern. Low volatility features like driver lifetime rating and user home location are batch computed daily. High volatility features like nearby driver supply and current queue length are streamed with Time To Live (TTL) of 30 to 120 seconds and updated continuously. Request time features include current local time, distance to pickup, and device network quality. The online feature assembler merges all three lanes, preferring fresher values and falling back to coarser aggregates if fresh values miss their TTL. The cost difference is dramatic. Batch features stored in object storage cost around $0.02 per Gigabyte (GB) month. Nearline features in distributed in memory stores like Redis or Cassandra cost $2 to $5 per GB month (100x to 250x more expensive). Request time computation adds Central Processing Unit (CPU) cost proportional to Query Per Second (QPS) but no storage cost. LinkedIn's Venice and Netflix's EVCache both report that hot working sets (hours to days of data) fit in memory at reasonable cost, while full history lives in cheap cold storage. Capacity planning must account for burst factors. If 100k entities need features refreshed every minute on average, that's 1,667 updates per second baseline. But hotspots (viral content, popular stores during dinner rush) create 5x to 10x bursts. DoorDash handles this by sharding high velocity counters across multiple keys and merging on read, spreading write load. They also load shed non critical updates during extreme spikes to protect freshness SLAs for critical features.
💡 Key Takeaways
Cost scales exponentially with freshness requirements. Batch features cost $0.02 per GB month in object storage versus $2 to $5 per GB month for in memory nearline stores, a 100x to 250x difference.
Uber allocates 5 to 15ms p99 for feature retrieval out of a 20 to 50ms total inference budget at 100k plus QPS. This forces most features to be precomputed and limits request time computation to sub millisecond operations.
Netflix proved through A/B tests that moving user embeddings from weekly to daily refresh improved engagement by only 0.3%, not justifying real time infrastructure. Context features (device, time) computed at request time delivered 2% lift at minimal cost.
Burst factors of 5x to 10x are common during peak events. Planning for average load causes freshness SLA violations when viral content or dinner rush hits. DoorDash provisions nearline capacity for p99 load, not average.
Fallback ordering prevents total failure. If nearline is stale, use last known batch value. If batch is unavailable, use static defaults. LinkedIn's Feathr explicitly encodes this cascade in feature definitions.
Hot key mitigation through sharding is essential. Instead of one counter per entity that gets thousands of updates per second, maintain 10 sharded counters and sum them on read, spreading write load.
📌 Examples
Uber marketplace predictions merge 70% batch features (driver stats, user history), 25% nearline features (supply density, surge signals with 60s TTL), and 5% request time features (current distance, time of day).
DoorDash stream processes store busy state as a 30 minute sliding window with 5 minute watermark. During dinner peak, one popular store can generate 3000 orders per hour. They shard the counter 10 ways to avoid overwhelming a single partition.
LinkedIn Venice serves features with p99 read latency under 10ms by keeping hot working sets in memory. A feature for "profile views in last 7 days" lives in nearline storage, while "total career history" is batch loaded daily.
← Back to Feature Freshness & Staleness Overview