Batch vs Stream Processing • Batch vs Stream Processing Trade-offsMedium⏱️ ~3 min
Stream Processing: Windows, State, and Watermarks
The Fundamental Challenge: Unlike batch where you process a complete, bounded dataset, streaming works on unbounded data. Events keep arriving forever. How do you compute aggregates like "count of purchases in the last 5 minutes" when the data never ends?
The answer is windowing. You slice the infinite stream into finite chunks.
Three Types of Windows: Tumbling windows divide time into fixed, non overlapping intervals. A 5 minute tumbling window creates buckets from 12:00 to 12:05, then 12:05 to 12:10, with no overlap. Each event belongs to exactly one window. This works well for metrics dashboards showing requests per minute.
Sliding windows overlap. A 10 minute window sliding every 1 minute means windows from 12:00 to 12:10, then 12:01 to 12:11. An event at 12:05 appears in 10 different windows. This smooths spikes and works for moving averages.
Session windows are data driven, not time based. They group events by user activity, closing the window after, say, 30 minutes of inactivity. If a user browses products from 2:00 to 2:15, then returns at 3:00, you get two separate session windows.
The State Problem: To compute aggregates, the stream processor must remember intermediate results. For 100 million active users, each with a count of events in the current window, you might hold 5 GB of state in memory. This state must survive failures.
This trade off is explicit: longer lateness windows capture more events but delay final results. If you allow 5 minutes of lateness, your "real time" dashboard has 5 minute stale data.
⚠️ Common Pitfall: State can blow up. If you track per user state for a 7 day window, active memory usage might exceed 50 GB. Stream processors handle this by checkpointing state to fast storage like distributed key value stores, accepting a small latency penalty on recovery.
Handling Late and Out of Order Events: Network delays mean events can arrive late. An event timestamped 12:03 might reach your processor at 12:08. If you already closed the 12:00 to 12:05 window, what happens?
Watermarks solve this. A watermark is a logical timestamp that says "I believe I have seen all events up to time T." When the watermark passes 12:05, the system closes the 12:00 to 12:05 window and emits results. You configure allowed lateness, maybe 2 minutes. Events arriving after the watermark but within lateness get included. Events beyond that are dropped or sent to a side output for later reconciliation.
Impact of Late Events
ON TIME
95%
→
LATE < 2 min
4%
→
DROPPED
1%
💡 Key Takeaways
✓Tumbling windows create fixed non overlapping intervals (5 minute buckets); sliding windows overlap for smoothing; session windows group by activity with inactivity timeouts
✓Stream processors maintain state in memory or fast storage, checkpointing periodically; for 100M users with per user counts, expect 5 to 50 GB of active state
✓Watermarks indicate "I have seen all events up to time T" and trigger window closure; configured lateness (typically 1 to 5 minutes) allows slightly late events before dropping
✓At scale, 95% of events arrive on time, 4% arrive within allowed lateness, and 1% arrive too late and must be dropped or reconciled later
✓Longer lateness windows (5 minutes vs 30 seconds) capture more events but delay results, creating an explicit trade off between completeness and freshness
📌 Examples
1A real time metrics dashboard using 1 minute tumbling windows with 30 second allowed lateness completes each window 90 seconds after its end time
2Session windows with 30 minute inactivity timeout track user browsing: a user active 2:00 to 2:15 then 3:00 to 3:10 generates two separate sessions
3Stream processor holding per user state for 50M active users with 5 minute windows requires checkpointing 8 GB to fast storage every 10 seconds