What is Watermarking in Stream Processing?

Definition
A watermark is a moving threshold in event time that tells a streaming system when it can safely finalize results for a time window, balancing completeness against memory usage and latency.
The Core Problem:
When processing millions of events per second from mobile apps, IoT devices, or web browsers, events never arrive in perfect timestamp order. Mobile clients buffer when offline, network retries delay packets, and message brokers reorder for performance. If you compute aggregates like "clicks per minute" based on when your server processes events (processing time), your metrics will be completely wrong whenever delays occur.

Event Time vs Processing Time:
Every event has two timestamps. The event_timestamp is when something actually happened ("user clicked at 12:05:30"). The processing_timestamp is when your server sees it ("server processed at 12:10:45"). That 5 minute gap is normal.

Without watermarking, you face an impossible choice: close your 1 minute windows immediately and miss delayed events, or keep all windows open indefinitely and consume unlimited memory.

How Watermarks Work:
A watermark is calculated as "maximum event time seen so far minus allowed lateness." If your system has observed events up to 12:10:00 and you configure 5 minutes of allowed lateness, the watermark sits at 12:05:00. This tells the system: "I believe I have probably seen all events before 12:05:00."

Any window ending before that watermark can be finalized. Its results get emitted downstream, and its state gets deleted to free memory. Events arriving with timestamps older than the watermark are considered "late data" and require special handling.

✓ In Practice: Systems like Apache Flink, Google Cloud Dataflow, and Spark Structured Streaming all use watermarking to process billions of events daily while keeping memory bounded and latency predictable.

💡 Key Takeaways

✓Watermarks solve the fundamental problem that events arrive out of order in distributed systems, making processing time based windows produce incorrect results

✓A watermark represents the event time threshold below which the system believes it has seen most events, calculated as maximum observed event time minus configured allowed lateness

✓Windows ending before the current watermark can be safely finalized and their state deleted, preventing unbounded memory growth at scale

✓Late data refers to events whose timestamps fall before the current watermark, requiring explicit policies for dropping, correcting, or routing to side channels

📌 Interview Tips

1At 200,000 events per second with 5 minute allowed lateness, watermarking prevents keeping gigabytes of state by finalizing old windows while accepting 99% of events within the latency bound

2A mobile analytics pipeline processes user clicks with event timestamps from devices. With network delays averaging 30 seconds but sometimes reaching 10 minutes, a watermark of 15 minutes captures 99.9% of events while keeping memory under 50GB for millions of user sessions

← Back to Watermarking & Late Data Handling Overview