Time Series Forecasting • Real-time Updates (Online Learning, Sliding Windows)Hard⏱️ ~3 min
Event Time, Watermarks, and Handling Late Data
Event time is the timestamp embedded in the event when it was generated, while processing time is when your system receives it. The difference matters because networks introduce variable delays. Mobile events can arrive minutes late when devices reconnect. Batch uploads might be hours delayed. If you use processing time for windowing, late events land in the wrong window, causing undercounts for past periods and silent data loss.
Watermarks solve the completeness problem. A watermark is a timestamp assertion that says all events with event time less than W have been observed. When the watermark advances past a window's end time, you know the window is complete and can emit the final aggregate. But watermarks involve a tradeoff: setting the watermark too aggressively (close to the current processing time) causes many late arrivals, while setting it too conservatively (far behind processing time) increases end to end latency.
The standard approach is to measure your delay distribution empirically. For web traffic with good connectivity, p99 delay might be 30 seconds, so you set the watermark to current time minus 1 minute as a safety buffer. For mobile apps with offline users, p99 delay could be 5 minutes, requiring a longer watermark lag. You accept that 1% of events will arrive after the watermark and design a strategy for them.
Late data handling has three options. First, drop late events entirely, which is simple but loses data and causes metrics drift. Second, allow a bounded lateness period (for example, 2 minutes after watermark) during which you accept late events, update the window, and emit corrections. This requires downstream systems to handle updates and retractions, usually via upserts keyed by window and entity. Third, route very late events to a separate backfill process that recomputes historical aggregates offline. Production systems often combine all three: accept lateness up to 2 minutes with corrections, then route older events to backfill, and drop events older than 1 hour as unrecoverable.
The implementation complexity is significant. You need per key state to track which windows are still open for updates. You must version window outputs to prevent out of order writes from overwriting newer values in the feature store. Downstream consumers must handle retractions or additive updates correctly. Google's dataflow model introduced these concepts and showed they're necessary for correctness, but many teams initially underestimate the operational burden. The payoff is accurate metrics even with unreliable event delivery, which is critical for billing, SLAs, and model training data quality.
💡 Key Takeaways
•Event time uses timestamps from when events occurred, handling out of order delivery correctly, while processing time uses system receive time for simpler semantics but miscounts when delays vary
•Watermarks indicate completeness: watermark at time W means all events with event time below W have arrived, allowing windows to be finalized with p99 confidence based on measured delay distributions
•Typical watermark lags are 1 to 2 minutes for web traffic (p99 delay 30 seconds plus buffer) and 3 to 5 minutes for mobile apps (p99 delay 2 to 3 minutes due to offline periods)
•Allowed lateness extends watermarks with bounded update periods, commonly 2 to 5 minutes, during which late events trigger window updates and corrections via versioned upserts to feature stores
•Production systems route very late events (older than 5 to 10 minutes) to separate backfill pipelines and drop events beyond recovery windows (1 hour typical) to bound state and prevent indefinite corrections
📌 Examples
Uber trip metrics: Mobile GPS events have p99 delay of 3 minutes. Watermark lags by 5 minutes. Windows for [10:00, 10:05) close at 10:10 processing time. Allow 2 minute lateness until 10:12, then route to hourly backfill.
Airbnb booking funnel: Track search to book conversion in session windows with 30 minute inactivity gap. Events from flaky hotel WiFi arrive up to 10 minutes late. Watermark lags 12 minutes, allowed lateness is 3 minutes, then sessions marked complete.
Amazon ad billing: Clicks must be counted exactly for billing SLAs. Event time windows with 1 minute watermark lag and 5 minute allowed lateness. Late events beyond 5 minutes go to daily reconciliation batch job that issues credits.