Failure Modes: Clock Skew, Sparse Streams, and Recovery

Client Clock Skew:
Watermarks assume event timestamps are reasonably accurate and not too far ahead of actual time. When mobile clients or IoT devices have incorrect clocks, this assumption breaks catastrophically.

If a client's clock is 30 minutes fast and generates events with timestamps in the future, the observed maximum event time jumps forward by 30 minutes. This instantly advances the watermark by 30 minutes. Now all legitimate events with correct timestamps, which were previously on time, suddenly appear to be 30 minutes late and get dropped or sidelined.

This happened to a major IoT platform when a firmware bug caused devices to set their clocks to January 1, 2038 (the Unix timestamp overflow date). The watermark jumped decades into the future, and every real event was classified as extremely late and dropped. The pipeline appeared healthy to monitoring but was silently discarding all data until operators noticed zero output and manually reset state.

❗ Remember: Always bound check event timestamps. Reject or clamp timestamps more than a threshold (like 1 hour) ahead of processing time to prevent clock skew from corrupting watermarks.

Robust implementations compare event_timestamp against processing_timestamp. If event time is more than max_allowed_future (commonly 5 to 60 minutes) ahead, treat it as suspicious. Either clamp it to current processing time or route to a dead letter queue for investigation.

Sparse and Bursty Streams:
Watermarks struggle when event arrival is irregular. Consider a stream of high value transactions that occur only a few times per hour. Partition 5 might see its last event at 14:30:00, then go silent for 90 minutes.

Without idle detection, the global watermark (minimum across partitions) stalls at 14:25:00 forever, blocking all window finalization. State accumulates on other busy partitions even though they are making progress. Memory grows unbounded and eventually the job fails with Out Of Memory errors.

Frameworks implement idle timeouts, typically 1 to 5 minutes. If a partition sees no events for this duration, it is temporarily excluded from the minimum calculation. The watermark can advance based on active partitions. When the idle partition receives data again, it rejoins and may temporarily slow progress until it catches up.

But this creates a subtle race condition. If the idle partition receives a burst of old events after being excluded, those events might have timestamps before the now advanced global watermark and get classified as late, even though they arrived in source order. The solution is to track "last included watermark" per partition and apply special handling to partitions rejoining after idleness.

Failure Recovery and State Consistency:
When a streaming job crashes and recovers from a checkpoint, watermark state must be restored consistently with window state. If the watermark is restored too low (earlier than it should be), replayed events that were already processed and emitted might be counted again, causing double counting.

If the watermark is restored too high (later than actual progress), events in the replay that should be accepted get classified as late and dropped, causing undercounting. This is particularly dangerous with exactly once processing semantics. The output appears exactly once, but the counts are wrong.

Production systems checkpoint watermark state with window state atomically. Apache Flink snapshots the current watermark as part of each checkpoint barrier. On recovery, the watermark resumes from the checkpointed value. This ensures consistency but requires careful tuning. Checkpoint intervals that are too long (over 5 minutes) increase recovery time and potential data loss. Intervals that are too short (under 30 seconds) increase overhead and can cause backpressure.

Recovery Impact
5 MIN CHECKPOINTS
High Loss Risk
→
30 SEC CHECKPOINTS
Balanced
Configuration Changes are Dangerous:
Reducing allowed lateness in a live system is a common source of production incidents. Suppose you deploy a config change from 10 minute to 3 minute allowed lateness. Events already in flight with timestamps between 3 and 10 minutes old are suddenly reclassified as late.

This can cause a huge spike in dropped events. If 5% of your traffic is normally between 3 and 10 minutes delayed, that 5% instantly becomes late data. Dashboards show sudden drops in metrics, alerting fires, and teams scramble thinking there is a data loss bug.

Safe configuration changes require migration strategies: deploy the new config to a canary pipeline first, monitor late data metrics for several hours, then gradually roll out. Some teams run dual pipelines with different watermarks temporarily and compare outputs before switching traffic.

💡 Key Takeaways

✓Client clock skew can cause watermarks to jump forward by hours or days, instantly classifying all legitimate events as late and causing silent data loss until operators notice zero output

✓Sparse streams without idle partition detection cause global watermark stalls, blocking window finalization across the entire job and leading to unbounded state growth and Out Of Memory failures

✓Failure recovery requires atomic checkpointing of watermark state with window state, or replayed events will be double counted (watermark too low) or incorrectly dropped (watermark too high)

✓Reducing allowed lateness in production reclassifies in flight events as late, causing sudden metric drops that look like data loss incidents, requiring canary deployments and gradual rollout

📌 Interview Tips

1An IoT platform lost 6 hours of data when a firmware bug set device clocks to 2038, advancing the watermark decades forward and dropping all real events as extremely late until engineers manually reset job state

2A payments stream with partition 8 going idle for 2 hours caused the global watermark to stall, accumulating 400GB of state across active partitions until the job crashed with Out Of Memory despite idle timeout being configured to 5 minutes but not working due to a bug

3When a team reduced allowed lateness from 15 minutes to 5 minutes, their dashboards suddenly showed 8% drop in transaction volume, triggering an incident. Investigation revealed the 8% were legitimate transactions between 5 and 15 minutes delayed now being dropped

← Back to Watermarking & Late Data Handling Overview