Data Processing Patterns • Stream Processing (Flink, Kafka Streams)Medium⏱️ ~3 min
Stream Processing Core Model: Unbounded Logs, Partitions, and Event Time
Stream processing treats data as an unbounded, continuously arriving log of events rather than finite batches. The foundational abstraction is a partitioned, ordered log where partitioning by key enables parallel processing and state locality. Order guarantees exist only within a single partition, not across partitions. This design trades global ordering for horizontal scalability.
Event time processing decouples correctness from when data physically arrives. Pipelines reason about when something happened (event time) rather than when it was seen (processing time). For example, a mobile app might generate a click event at 10:00 AM but send it at 10:05 AM due to network issues. Event time processing correctly attributes this to the 10:00 AM window. Watermarks provide the mechanism: monotonic signals indicating the system believes no events earlier than time T will arrive. Late data beyond the watermark is handled via allowed lateness windows, side outputs, or retractions.
Throughput and latency are explicitly traded through batching, compression, and buffering. A stateless transform might achieve 100,000 to 1,000,000 events per second per cluster of tens of vCPUs with p99 operator latency of 1 to 10 milliseconds. Windowing operations like tumbling (fixed non overlapping intervals), sliding (overlapping intervals), or session windows (activity based gaps) bound infinite streams into finite computations that can actually complete and emit results.
💡 Key Takeaways
•Partitioning by key enables parallelism and co-locates state with computation; order is guaranteed only within a partition, not globally across partitions
•Event time decouples correctness from network delays and processing order; a click at 10:00 AM arriving at 10:05 AM is correctly attributed to the 10:00 window
•Watermarks signal progress through event time (for example watermark at 10:00 means no events before that time expected); allowed lateness handles events beyond watermark without data loss
•Throughput scales from 100k events per second for stateless transforms to 10k to 200k events per second for stateful aggregations per mid-sized cluster
•Windowing (tumbling, sliding, session) converts infinite streams into finite computations with emission triggers based on event or processing time
📌 Examples
LinkedIn processes 7+ trillion messages per day with peaks of 10+ million messages per second using Kafka partitioned by user ID for feed ranking and anomaly detection
Google Dataflow (Beam model) combines large windows of minutes to hours with low latency incremental updates for YouTube analytics, using watermarks and late data triggers for correctness