Event Time, Watermarks, and Handling Late Data
Event Time vs Processing Time: Event time is when something actually happened (user clicked at 10:00:00). Processing time is when the system receives the event (processed at 10:00:05). The gap—caused by network delays, batching, or retries—creates the late data problem that watermarks address.
Why Event Time Matters
Processing time windows give inconsistent results. If your server slows down, events queue up and arrive late—a 5-minute window might contain events spanning 15 minutes of real activity. Event time windows give deterministic results: the 10:00-10:05 window always contains events from those 5 minutes, regardless of when processed. For ML features, this reproducibility is essential—training and serving must compute identical features.
Watermarks: Tracking Progress
A watermark is a timestamp assertion: all events with event time less than the watermark have been processed. If watermark is 10:05:00, the system believes all events before 10:05 have arrived. Watermarks lag behind real time by expected maximum lateness. With 30-second expected delay, when wall clock shows 10:05:30, watermark is at 10:05:00. When watermark passes a window boundary, that window can be finalized. Setting watermark too aggressively causes late data drops; too conservative delays results.
Late Data Strategies
Drop late data: Simplest. Once window closes, late arrivals discarded. Acceptable if late data is rare (under 0.1%). Allowed lateness: Keep windows open for additional time after watermark passes. More accurate but uses more memory. Retractions: Emit preliminary results, then emit corrections when late data arrives. Complex but necessary for high-accuracy applications.
Warning: Late data handling must be consistent between training and serving. If training includes late data but serving drops it, feature values diverge—a classic source of training-serving skew.