Stream Processing Architectures • Windowing & Time-based AggregationsHard⏱️ ~3 min
Trade-offs: Choosing Window Types and Time Semantics
Latency vs Correctness: The fundamental trade off in windowing is speed versus accuracy. Processing time windows can deliver results in under a second because you close windows as soon as the wall clock advances. But network delays and client buffering distort your metrics. Event time windows with watermarks give stable, reproducible metrics, but you must wait for late events.
The math is concrete. If you configure 10 minutes of allowed lateness on a 5 minute window, metrics for that window are not final until 10 minutes after the window ends. Your end to end latency goes from potentially sub second to 10+ minutes.
When to Choose Processing Time: Use processing time for operational dashboards and real time alerting where you need to know "what is happening right now." Systems that are monitoring focused rather than analytics focused benefit from the simplicity and speed. If your ingestion pipeline is stable and delays are minimal (p99 under 1 second), processing time works well.
When to Choose Event Time: Use event time for business critical metrics, billing, compliance, feature engineering for machine learning, or anything that must be reproducible when replaying historical data. If events can be delayed by minutes or hours (mobile apps, IoT sensors, cross datacenter replication), event time is essential.
Window Type Trade-offs: Tumbling windows are simplest to reason about and cheapest in compute and memory because each event belongs to exactly one window. But they miss patterns that span window boundaries. A spike starting at 12:04:55 and ending at 12:05:05 gets split across two windows, potentially hiding it.
Sliding windows capture trends smoothly but multiply costs. A 5 minute window sliding every 1 minute means each event belongs to 5 overlapping windows. Compute increases 5x, state size increases proportionally. For a system processing 100,000 events per second, this means 500,000 aggregate updates per second instead of 100,000.
Session windows are powerful for user behavior but make capacity planning harder. Session lengths vary wildly, and data skew can be extreme. If a bot or power user generates thousands of events per second in a long session, that key becomes a hotspot, causing backpressure and potential out of memory failures.
Backfill and Correction Trade-off: If you want the ability to correct aggregates when events arrive hours late, you must retain window state longer and support updates in downstream stores. Some teams accept approximate or partial correction to avoid expensive state retention and compaction. The decision depends on your correctness requirements versus operational complexity and cost.
Processing Time
Sub second latency, but metrics shift with ingestion delays
vs
Event Time
Stable metrics, 10+ minute latency with allowed lateness
"Choose tumbling for simple aggregates and reports. Choose sliding when you need smooth trend detection despite 5x cost. Choose session windows when user behavior patterns matter more than predictable resource usage."
💡 Key Takeaways
✓Processing time delivers sub second latency but metrics shift with ingestion delays; event time provides stable reproducible metrics but adds 10+ minutes of latency when allowing for late arrivals
✓Tumbling windows are 1x cost (each event in one window), sliding windows are 5x cost (each event in multiple overlapping windows), making resource planning critical at scale
✓Session windows cause variable and unpredictable resource usage; a bot generating thousands of events per second in one long session can create a hotspot that causes backpressure and memory failures
✓Supporting late event correction requires retaining window state longer and implementing update semantics in downstream stores, trading operational complexity for correctness guarantees
📌 Examples
1An operational dashboard uses processing time tumbling windows for "requests per second" metrics, accepting that a 2 second ingestion delay might shift spikes by one window boundary
2A billing system uses event time sliding windows (5 min, slide 1 min) despite 5x compute cost because smooth trend detection catches usage anomalies that tumbling windows might split across boundaries
3An e-commerce site uses session windows per user with 30 minute inactivity timeout, but implements key splitting for power users who generate over 100 events per minute to avoid hotspots