Stream Processing Architectures • Windowing & Time-based AggregationsMedium⏱️ ~2 min
Stateful Aggregation: How Windowing Actually Works
The Architecture: Windowing is implemented using stateful operators partitioned by key. Messages are routed so all events for a specific key (user ID, card number, service name) go to the same processing task. Each task maintains a local state store holding partial aggregates per window per key.
Think of it like a distributed hash map where the key is "user123, window 12:00 to 12:05" and the value is the running aggregate (count of 47, sum of 1283.50). When a new event arrives, the engine looks up the relevant window state, updates it, and decides whether to emit results.
State Management at Scale: At 1 million events per second with millions of active keys, state size becomes critical. Systems typically cap state to tens of gigabytes per task using a combination of in memory caches and persistent backing stores like RocksDB.
Checkpoint intervals matter enormously. With 30 to 60 second checkpoints, recovery from failure takes under 2 minutes. But longer checkpoints or larger state means slower recovery, potentially violating Service Level Agreements (SLAs).
Update Semantics: When late events arrive after a window has closed, the engine can emit retractions or updates. The downstream store must handle these correctly. Some systems send a negative record to cancel the old aggregate, then a positive record with the corrected value. Others support upserts where the new value replaces the old.
1
Window Assigner: Given an event timestamp, determines which window or windows it belongs to. For a 5 minute tumbling window, this is simple arithmetic on event time. For sliding windows, events map to multiple overlapping windows.
2
Trigger Logic: Decides when to emit results. Event time triggers depend on watermarks: when the watermark passes window end plus allowed lateness, the trigger fires the final result. Early triggers may fire every 30 seconds for partial results.
3
Eviction Policy: Removes expired window state to bound memory usage. State retention must consider both window size and maximum allowed lateness, since late events may force reopening windows.
State Management Targets
10s GB
STATE PER TASK
30-60s
CHECKPOINT
<2min
RECOVERY TIME
✓ In Practice: Netflix uses sliding and session windows to compute quality of service metrics at millions of events per second. The streaming engine maintains tens of terabytes of state across thousands of parallel tasks, with careful tuning of checkpoint intervals and state backend configuration to keep p99 latency between 2 to 5 seconds.
Production systems often pair streaming with batch recomputation. A nightly batch job recomputes the last 24 hours of aggregates from durable logs and compares with streaming results. Differences beyond a small tolerance indicate configuration issues, giving strong correctness guarantees while retaining low latency benefits.💡 Key Takeaways
✓Stateful operators partition events by key so all events for a given key (user, card, service) go to the same task, which maintains local state stores holding partial aggregates per window
✓Window assigners map event timestamps to one or more windows, triggers decide when to emit results based on watermarks, and eviction policies remove expired state to bound memory usage
✓At 1 million events per second, systems cap state to tens of gigabytes per task with 30 to 60 second checkpoints, enabling recovery from failure in under 2 minutes to meet SLAs
✓Late events can trigger retractions or updates to already closed windows, requiring downstream stores to support update semantics like upserts or compensating records
📌 Examples
1A fraud detection system partitions payment events by card number, maintaining running counts and sums per 5 minute window, with state evicted 10 minutes after window close to handle late arrivals
2Netflix maintains tens of terabytes of windowed state across thousands of tasks, computing quality metrics with p99 latency of 2 to 5 seconds at millions of events per second
3Production systems run nightly batch recomputation of the last 24 hours to validate streaming results, catching configuration issues when differences exceed tolerance thresholds