Time Series Forecasting • Real-time Updates (Online Learning, Sliding Windows)Hard⏱️ ~3 min
End to End Architecture for Real Time Features at Scale
A production real time feature system combines ingestion, stream processing, state management, and low latency serving into a cohesive architecture that meets strict correctness and freshness Service Level Objectives (SLOs). The design must handle hundreds of thousands to millions of events per second, compute features over sliding windows with sub second updates, serve features with single digit millisecond latency, and maintain exactly once semantics for critical aggregates.
The ingestion layer uses a distributed log like Kafka or Pulsar to buffer events with durable replication. Events include event time timestamps, unique IDs for deduplication, and entity keys for routing. At peak load, systems handle 500K to 2M events per second, partitioned by entity to colocate related events. Each partition targets 1 MB per second or 1000 records per second to avoid hot partitions. Retention is typically 3 to 7 days to support backfills and debugging. For late arriving events, ingestion assigns server timestamps and flags events with excessive clock skew.
The stream processing layer reads from the log, groups events by key, assigns them to windows using event time, and computes aggregates. Popular frameworks are Apache Flink and Google Cloud Dataflow, both supporting event time semantics, watermarks, and state management. The processor maintains local state using RocksDB for large state or in memory hash maps for hot features. State is checkpointed every 1 to 5 minutes to durable storage for recovery. Watermarks lag by 1 to 5 minutes depending on measured delay distributions, and allowed lateness is typically 2 to 5 minutes for corrections. The processor emits updated features every hop interval, commonly 5 to 30 seconds.
Feature serving uses a low latency key value store like Redis, DynamoDB, or a custom system like Uber's Michelangelo Palette or Airbnb's Zipline. Features are written with composite keys encoding entity and window definition, with TTL matching window duration to automatically expire stale data. Reads target p99 latency under 5 milliseconds for online inference. For high throughput, features are cached at the application tier or precomputed and embedded in model serving infrastructure. Hot features for top 1 percent of entities are often duplicated to multiple regions or availability zones to handle read load.
Capacity planning is critical. For 10 million active entities with 60 bucket sliding windows and 24 bytes per bucket, state is approximately 14 GB. At 500K events per second and 300 millisecond p95 latency target, provision enough partitions so each handles under 5K events per second, which typically means 100 to 150 partitions. Each partition runs on a core with 2 to 4 GB memory. Feature writes to the key value store at 10K updates per second require provisioned capacity of 50K to 100K writes per second across shards to handle write amplification from hot keys and retries. Monitoring tracks end to end latency from event time to feature availability, typically targeting p95 under 500 milliseconds and p99 under 2 seconds for user facing features.
💡 Key Takeaways
•Ingestion layer uses distributed logs (Kafka or Pulsar) with 3 to 7 day retention, partitioned to keep per partition load under 1 MB per second or 1000 records per second, handling 500K to 2M events per second at peak
•Stream processing with Flink or Dataflow maintains per key state in RocksDB or memory, checkpointed every 1 to 5 minutes, with watermark lag of 1 to 5 minutes and allowed lateness of 2 to 5 minutes for event time correctness
•Feature stores like Redis or DynamoDB serve features with p99 read latency under 5 milliseconds using composite keys (entity:window), TTL matching window duration, and provisioned capacity of 50K to 100K writes per second
•End to end latency targets p95 under 500 milliseconds and p99 under 2 seconds from event publication to feature availability, with ingestion adding 50 milliseconds, processing adding 200 to 500 milliseconds, and serving adding 5 to 10 milliseconds
•Capacity planning for 10 million entities with 60 bucket windows requires approximately 14 GB state, 100 to 150 partitions at 5K events per second each, and 2 to 4 GB memory per processing core
•Production examples include Uber Michelangelo computing per driver and rider features with 300 millisecond end to end latency, and Airbnb Zipline serving per listing features with sub 10 millisecond reads to ranking models
📌 Examples
Uber real time demand forecasting: Ingest 1M trip events/sec across 150 Kafka partitions. Flink computes per city zone rolling 10 min demand with 30 sec hops, 200ms p95 processing latency. Features written to Michelangelo Palette with 3ms p99 read latency, serving ETA models at 50K QPS.
Airbnb listing freshness features: Kafka ingests 200K search and click events/sec. Zipline computes per listing last hour clicks, views, and unique searchers over 60 minute hopping windows with 1 minute hops. State is 80 GB across 50 Flink workers. Features served from Cassandra with 5ms p99 latency to ranking models scoring 10K listings per search.
Amazon product trending score: Kinesis ingests 800K product interaction events/sec. Flink computes per product last 15 minute view count, cart adds, and purchases with 15 sec granularity. State uses approximate counts (HyperLogLog) to keep per product overhead at 500 bytes. Features cached in ElastiCache Redis with 2ms p99 latency, scoring 1M products per page for personalized recommendations.