Learn→Stream Processing Architectures→Watermarking & Late Data Handling→2 of 5

Stream Processing Architectures • Watermarking & Late Data HandlingMedium⏱️ ~3 min

How Watermarks Track Progress and Finalize Windows

The Mechanism:
Stream processors maintain per partition or per task tracking of maximum event time observed. Each executor processes events from its assigned partitions and records the highest event_timestamp it has seen. These local maxima are periodically aggregated, typically every few seconds, to compute a global watermark across the entire job.

The standard policy is conservative: global watermark equals the minimum of all local maximums minus configured allowed lateness. Using the minimum prevents data loss because it ensures no partition has progressed past that point yet.

Window State Management:
For a tumbling 1 minute window computing click counts, the engine maintains state keyed by (window_start, user_id). On each watermark update, the system evaluates which windows have end times before the current watermark. Those windows are considered complete.

The engine then emits their aggregated results downstream and deletes their state from the state store. This deletion is critical at high throughput. Processing 500,000 events per second with millions of unique keys means gigabytes of state accumulate every minute. Without cleanup, memory exhausts in hours.

State Management Impact
Unbounded
WITHOUT WATERMARKS
50GB
WITH 5 MIN WATERMARK
Multi Partition Coordination:
The challenge emerges with uneven partitions. Imagine a Kafka topic with 16 partitions processing user events. Partition 0 through 14 are seeing events with timestamps around 13:45:00, but partition 15 has been idle for 20 minutes with its last event at 13:25:00.

If you compute the global watermark as the minimum across all partitions, it stays stuck at 13:20:00 (assuming 5 minute allowed lateness from 13:25:00). This blocks all window completion across the entire job, even though 15 of 16 partitions are making progress. State accumulates indefinitely.

Frameworks handle this with idle partition detection. If a partition receives no new records for a configurable timeout (typically 1 to 5 minutes), it is excluded from the minimum calculation. The watermark can then advance based on active partitions. When the idle partition resumes, it rejoins the calculation and may temporarily slow watermark progress until it catches up.

Practical Configuration:
Choosing allowed lateness requires understanding your data's delay distribution. If 95% of events arrive within 1 minute but 5% take up to 10 minutes due to mobile client buffering, you face a decision. Setting allowed lateness to 2 minutes gives you low latency (windows close quickly) but drops that 5%. Setting it to 15 minutes captures everything but triples memory usage and delays results.

⚠️ Common Pitfall: Reducing allowed lateness during a config change can instantly classify in flight events as late, causing a sudden spike in dropped data and metric undercounts that look like a production incident.

💡 Key Takeaways

✓Stream processors track maximum event time per partition, then aggregate using the minimum across all partitions minus allowed lateness to compute a conservative global watermark

✓Windows are finalized and their state deleted when their end time falls before the current watermark, which is essential for bounding memory at 100,000+ events per second

✓Idle partition detection excludes partitions with no recent data from watermark calculation, preventing one slow partition from blocking progress across the entire streaming job

✓Allowed lateness configuration trades off completeness against latency and memory: 2 minutes captures 95% of events with low lag, while 15 minutes captures 99.9% but triples state size

📌 Interview Tips

1A fraud detection system processing 300,000 transactions per second uses 3 minute allowed lateness. This keeps state under 80GB across 32 workers while finalizing windows with p99 latency of 4 minutes, meeting their 5 minute SLA for alerting

2When an analytics pipeline increased allowed lateness from 5 to 20 minutes to capture more mobile events, their RocksDB state store grew from 120GB to 410GB per worker, requiring instance upgrades from 256GB to 512GB RAM

← Back to Watermarking & Late Data Handling Overview