Message Queues & Streaming • Kafka/Event Streaming ArchitectureHard⏱️ ~3 min
Stateful Stream Processing: Aggregations, Joins, and Checkpointing
Stream processing layers stateful operators on top of the event log to compute aggregations, joins, and enrichments in near real time. Unlike stateless filters or maps, stateful operators maintain local state (hash tables, time windows, join buffers) that must survive failures. The framework checkpoints this state periodically to durable storage; on crash, the operator restores from the last checkpoint and replays events from that offset, reconstructing its state deterministically.
Event time semantics separate event timestamps (when the event occurred) from processing time (when the system sees it). Watermarks track progress through event time, signaling when the system believes all events up to a certain timestamp have arrived. This enables windowed aggregations (count logins per user per 5 minute window) that handle out of order delivery. Configure lateness tolerance (minutes to hours) based on expected source jitter: mobile apps reconnecting after offline periods may deliver events hours late, requiring wider windows and higher memory for buffering.
The tradeoff is state size and recovery time. A stateful operator with 100 GB of in memory state requires 100 GB storage for checkpoints and minutes to restore on failure. Throughput drops during recovery as the operator replays. Partitioning state by key (shard user aggregations across 100 operators) bounds per operator state and enables horizontal scaling, but complicates cross key operations like global aggregations or stream to stream joins across different partition keys.
In practice, keep operator state under 10 to 50 GB per instance for sub minute recovery. Netflix runs stateful processors for real time feature computation with strict lateness windows (5 to 15 minutes) to cap memory. LinkedIn materializes aggregated views by checkpointing to compacted Kafka topics, enabling recovery by replaying the changelog rather than recomputing from raw events.
💡 Key Takeaways
•Checkpointing frequency trades recovery time against overhead: checkpointing every 10,000 events limits replay to 10,000 on crash but writes gigabytes to storage every few seconds. Checkpointing every 100,000 events reduces write amplification but increases recovery time.
•Event time and watermarks handle out of order delivery: mobile apps or cross region replication introduce seconds to hours of event time skew. Set lateness windows (5 to 60 minutes) based on percentile 99 source delay; events arriving later are dropped or sent to side outputs for reconciliation.
•State size limits scale: 100 GB in memory state requires 100 GB checkpoint storage and 3 to 5 minute recovery time. Partition stateful operations by key and scale horizontally; keep per operator state under 10 to 50 GB for fast recovery.
•Stream to stream joins amplify state: joining two streams requires buffering both sides within a time window. A 1 hour join window on 1,000 events per second per stream requires buffering 7.2 million events, consuming tens of gigabytes. Use narrow windows (minutes) or pre aggregate to reduce cardinality.
•Changelog topics enable faster recovery than full recomputation: write state changes (inserts, updates, deletes) to a compacted Kafka topic. On restart, replay the changelog instead of reprocessing raw input events from the beginning, reducing recovery from hours to minutes.
📌 Examples
LinkedIn uses changelog backed state stores for user profile aggregations. Each operator writes state mutations to a compacted Kafka topic with the same partition key. On restart, the operator replays the changelog (tens of gigabytes) in 2 to 3 minutes instead of recomputing from trillions of raw activity events.
Netflix computes real time feature vectors for personalization with 10 minute tumbling windows and 15 minute lateness tolerance. Each operator maintains per user state; cross user aggregations are pre computed in batch jobs to avoid global state in streaming.
A fraud detection pipeline at Uber joins transaction events with account history using a 5 minute window. The operator checkpoints every 50,000 events, maintaining 20 GB of join state per instance. Recovery time is 60 to 90 seconds, acceptable for their service level objective of sub 5 minute detection latency.