Stream Processing ArchitecturesWatermarking & Late Data HandlingHard⏱️ ~3 min

Watermark Configuration Trade-offs and When to Choose Alternatives

The Fundamental Trade-off: Watermarking is fundamentally about trading latency against completeness and state size. Every configuration choice moves you along this spectrum, and there is no universally correct answer.
Short Watermark (1 to 2 min)
Fast results, smaller state, drops 3 to 5% of mobile events, p99 window latency under 3 minutes
vs
Long Watermark (15 to 20 min)
Captures 99.9% of events, 3x larger state, p99 window latency 18 to 22 minutes
Longer allowed lateness increases memory and storage linearly. If your state is 50GB with 5 minute watermark, expect 150GB with 15 minutes, assuming uniform event distribution. At 200,000 events per second across millions of keys, this can push you from memory based state stores to disk backed RocksDB, degrading trigger latency from 50ms to 200ms or more. Shorter watermarks reduce cost and latency but increase late data. If your data shows 95% of events within 2 minutes and 99% within 10 minutes, a 3 minute watermark captures most events with minimal lag. A 12 minute watermark adds correctness but triples resource usage for only 4% more events. Decision Framework by Use Case: Choose watermarking with short lateness (1 to 3 minutes) when you need real time user facing metrics, can tolerate small undercounts, and want predictable low latency. Examples include product dashboards, operational monitoring, and real time personalization features. The 2 to 5% of dropped events rarely justify the cost and complexity of longer windows. Choose watermarking with medium lateness (5 to 10 minutes) for internal analytics, fraud detection, or compliance reporting where accuracy matters more than freshness. This captures 98 to 99% of events at reasonable cost. Use side outputs to preserve the remaining late data for nightly reconciliation. Choose watermarking with long lateness (15+ minutes) or allowed lateness after close only when contractual or regulatory requirements demand near perfect accuracy in real time, such as financial settlement or SLA billing. Accept significantly higher infrastructure cost and operational complexity. When NOT to Use Watermarking: If your use case cares only about arrival order and processing time, skip event time and watermarks entirely. Simple processing time windows are faster, simpler, and use less state. This works for operational metrics like "requests per second as seen by the server" or rate limiting. For use cases where you can tolerate hours of delay and need exact counts, micro batch or nightly batch processing is often cheaper and simpler than streaming with watermarks. Many organizations use a hybrid: streaming with aggressive watermarks for product features, then batch jobs overnight for finance, compliance, and ML training datasets. If events from a single source can arrive weeks late due to offline sync or regulatory data delivery, watermarking breaks down. You cannot keep state for weeks. Instead, use a lambda architecture pattern: streaming for recent data with best effort accuracy, and batch reprocessing for historical corrections.
"Watermarking is the right tool when event time correctness matters and delays are bounded to minutes or tens of minutes. Beyond that, architecture matters more than configuration."
Multi Stream Joins Add Complexity: Watermarking becomes significantly harder with joins. Consider joining a stream of payment events with a stream of user risk scores for fraud detection. Both streams have independent delays and watermarks. If you use the minimum watermark across both streams to decide when to drop unmatched join state, the slower stream blocks progress. If payment events are delayed by 5 minutes but risk scores by 2 minutes, you keep payment state for 5 minutes even though most risk scores arrive in 2. At 50,000 joins per second, this can balloon state by 2.5x compared to single stream aggregation. Some teams handle this by setting independent watermarks per stream and accepting that some legitimate matches get missed if events arrive beyond their respective watermarks. Others buffer high value joins in a separate store and reconcile offline. There is no clean solution: multi stream joins with out of order data always involve compromise.
💡 Key Takeaways
Longer allowed lateness scales state size linearly: 5 minute watermark uses 50GB, 15 minutes uses 150GB for the same throughput, often forcing migration from memory to disk backed stores
Choose short watermarks (1 to 3 min) for user facing features where 95% completeness and sub 5 minute latency matter more than perfect counts, dropping 2 to 5% of late mobile events
Skip watermarking entirely when only processing time matters, like server side rate limiting or operational metrics, saving significant complexity and resource overhead
Multi stream joins with watermarking can cause state to balloon 2 to 3x because the slower stream's delays force longer retention, often requiring independent watermarks and accepting missed matches
📌 Examples
1A social media company uses 90 second watermarks for real time engagement metrics shown to users, accepting 97% accuracy. Their data science team runs nightly Spark batch jobs on the complete dataset for accurate reporting and model training
2An ad tech platform tried 20 minute watermarks for billing accuracy but state grew to 800GB per worker. They switched to 3 minute watermarks with side outputs, processing 99.5% in real time and reconciling the 0.5% late data in a daily batch job that updates invoices
3When joining click streams (2 min delay) with purchase streams (8 min delay) at 100k events per second, one team set independent 3 min and 10 min watermarks, accepting that 1% of clicks matched to late purchases would be missed rather than keeping all click state for 10 minutes
← Back to Watermarking & Late Data Handling Overview