Failure Modes and Multi Region Complexity

Late and Out of Order Events: The most common failure mode is events arriving much later than expected. In a ride sharing system, most events might arrive within 2 seconds of event time, but 1 percent arrive 2 to 5 minutes late due to mobile connectivity issues. This is not rare: at Google or Amazon scale, even 0.1 percent late out of 1 billion events per day means 1 million correction events.

If your event time window closes after 1 minute of allowed lateness, those 1 percent are considered "too late." Depending on your design, you might drop them (losing data), update a separate corrections table (adding complexity), or attempt to retroactively adjust aggregates (risking double counting).

Clock Skew and Bad Timestamps: Another edge case is incorrect timestamps on client devices. If a client clock is off by 10 minutes, that user's events will fall into the wrong windows. Without validation or server side timestamping, event time semantics can be corrupted at the source.

Some companies capture both client event time and server receipt time, then constrain or adjust event time when the skew exceeds a threshold (for example, if client time is more than 5 minutes ahead of server time, clamp it to server time). This adds logic but prevents a single misconfigured device from polluting your analytics.

Impact of Late Events at Scale
1%
LATE RATE
1B
EVENTS/DAY
10M
CORRECTIONS
Backfills and Replays: When you replay historical data through a stream processor, processing time is far in the future relative to event time. If your logic accidentally uses processing time anywhere (for example, in logging or side effects), backfills will generate nonsensical windows.

Even in purely event time based code, watermarks and lateness policies might need adjustment for replays. If you use a fixed 5 minute lateness in production but replay 7 days of data in 2 hours, the replay will see massive compression of processing time relative to event time, and watermarks may close windows prematurely, dropping events that would have been considered on time in the original run.

Operational Failures and Tail Latency: Node crashes and checkpoint restores can affect how watermarks advance. If one partition is lagging behind due to a slow disk or garbage collection pause, the global watermark may be held back by that slow partition, delaying results for all keys. A design that is too strict about synchronizing watermarks across all partitions can have 99th percentile latencies dominated by a few slow shards.

Some frameworks allow per partition watermarks with policies like "advance global watermark when 90 percent of partitions have advanced," accepting a small amount of late data in exchange for not being blocked by stragglers.

❗ Remember: In multi region architectures, events generated in one region and processed in another add hundreds of milliseconds to seconds of network transit delay. If windows are keyed and aggregated globally, all regions must agree on event time interpretation and maximum lateness tolerance, or the same user action may be counted in different windows in different systems.
Testing and Observability: You need explicit testing for time semantics. This includes metrics for event time versus processing time lag distributions, counts of late and dropped events, watermark progression per partition, and divergence between live streaming aggregates and offline recomputations.

In a technical interview, you should be able to explain how you would detect a bug where 2 percent of events are systematically assigned to the wrong hour due to a watermark misconfiguration, and how you would safely reprocess to fix the historical data without double counting.

💡 Key Takeaways

✓At scale, even 1 percent late events translates to millions of corrections daily: 1 billion events per day with 1 percent late rate means 10 million events requiring special handling through side outputs or batch reconciliation

✓Client clock skew can corrupt event time semantics if not validated: capturing both client event time and server receipt time enables clamping when skew exceeds thresholds like 5 minutes

✓Backfills and replays require watermark policy adjustments because processing time is compressed relative to event time, potentially closing windows prematurely and dropping events that would have been on time originally

✓Multi region architectures add hundreds of milliseconds of network transit, requiring global agreement on event time interpretation and lateness tolerances to avoid the same user action being counted in different windows across regions

📌 Interview Tips

1A mobile app with 1 billion events per day sees 1 percent of events delayed 2 to 5 minutes due to connectivity issues. With 1 minute allowed lateness, 10 million events daily are routed to a side output for batch correction processing overnight.

2During a backfill replaying 7 days of data in 2 hours, a system using 5 minute production lateness sees watermarks advance too quickly relative to event time, closing windows before late events from the original timeframe arrive, requiring a separate backfill specific lateness configuration of 15 minutes.

3A global platform processing events from mobile clients in Asia but aggregating in US data centers sees 200 to 500 milliseconds of additional network latency, requiring coordination on whether to use client event time or US server receipt time for window assignments.

← Back to Event Time vs Processing Time Overview