State Management and Watermark Strategies

The State Challenge: Event time processing requires maintaining per key window state until watermarks progress past the window end. If you allow 15 minutes of lateness and have 10 million active keys, you might need to store 15 minutes worth of aggregates. At large scale, this translates to hundreds of gigabytes of state that must be kept in memory or on fast storage.

That state must also be checkpointed periodically for fault tolerance. Checkpointing writes state snapshots to durable storage, which adds IO overhead and can affect tail latencies, especially when state size grows large.

How Watermarks Work: A watermark is the mechanism that tells the system "I believe I have seen all events up to time T, except possibly a small fraction of late arrivals." The simplest strategy is a fixed lag watermark. If the maximum observed event time is E, the watermark is E minus L, where L is your allowed lateness (for example, 5 minutes).

When the watermark passes the end of a window, the system can emit final results for that window and drop the associated state. This is how you balance accuracy (waiting for late events) with resource efficiency (not holding state forever).

State Memory Requirements
LATENESS
5 min
→
STATE SIZE
50 GB
→
LATENESS
15 min
→
STATE SIZE
150 GB
Adaptive Watermarks: More sophisticated strategies use histograms of observed delays to set adaptive watermarks. If your system notices that 99 percent of events arrive within 2 seconds but 1 percent arrive 2 to 5 minutes late, you can set the watermark more aggressively for the common case while still handling outliers through a separate late data path.

Handling Very Late Data: Events arriving after the watermark has passed are considered "too late." Many architectures write these to a side output stream. A separate batch or micro batch job periodically reads these late events and applies corrections to downstream stores, such as adjusting precomputed aggregates in a data warehouse. This is how companies reconcile strict latency Service Level Agreements (SLAs) for dashboards (for example, 99th percentile under 5 seconds) with eventual accounting correctness over hours or days.

⚠️ Common Pitfall: If one partition lags behind due to a slow node or unbalanced key distribution, the global watermark may be held back by that slow partition, delaying results for all keys. Design that is too strict about synchronizing watermarks across partitions can have 99th percentile latencies dominated by a few slow shards.

💡 Key Takeaways

✓Longer allowed lateness windows increase memory requirements linearly: 15 minutes of lateness with 10 million keys can require 150 gigabytes of state compared to 50 gigabytes with 5 minutes

✓Watermarks use the formula: watermark equals maximum observed event time minus allowed lateness, enabling the system to emit results and drop state once the watermark passes a window boundary

✓Very late events (arriving after watermark) are written to side outputs and processed by separate batch jobs that apply corrections to ensure eventual correctness without blocking real time results

✓Adaptive watermarks using delay histograms can optimize for the common case (99 percent arrive within 2 seconds) while still handling outliers through a separate late data correction mechanism

📌 Interview Tips

1A system with 10 million active keys and 15 minute lateness tolerance maintains aggregates for all keys across all windows in that time range, requiring checkpointing of 100 to 200 gigabytes of state every few minutes to disk for fault tolerance.

2When computing hourly revenue metrics, events with event time 2:30 PM arriving at 2:47 PM (17 minutes late) after the watermark has passed 2:45 PM are written to a correction stream and reconciled overnight in batch jobs.

← Back to Event Time vs Processing Time Overview