Stream Processing Architectures • Exactly-Once Processing SemanticsMedium⏱️ ~3 min
Exactly-Once at Production Scale
The Real World Challenge:
Implementing exactly-once semantics in production means handling massive throughput, keeping latency low, and staying reliable during failures, all simultaneously. A streaming payments system at a fintech might process 20,000 transactions per second at peak, requiring median end to end latency under 200 milliseconds and 99th percentile (p99) below 1 second. Any duplicate transaction creates customer complaints and potential financial liability.
The entire pipeline must maintain exactly-once guarantees: events arrive in a durable log like Kafka, a stateful stream processor enriches and validates them, updates account balances in a state store, and publishes derived events to downstream services and data warehouses. Every component in this chain participates in the exactly-once protocol.
Real System Architecture:
Yelp processes hundreds of millions of business interactions and over 100,000 photos daily through their streaming lakehouse on AWS. They use Apache Flink as the compute engine with Kafka as the message bus. Flink provides exactly-once semantics so that derived tables in the lakehouse remain consistent even when jobs restart or nodes fail.
Their architecture checkpoints state to S3 every few seconds. Each checkpoint covers operator state (aggregations, joins, windows) and source offsets. Outputs write to the lakehouse using transactional protocols. When a Flink task manager fails, recovery loads the most recent successful checkpoint and replays from those offsets. Because partial outputs after the checkpoint were never committed to the lakehouse, replaying produces identical results that now commit successfully.
The lakehouse storage provides time travel capability as additional safety. If something goes wrong despite exactly-once guarantees in the streaming layer, they can roll back to a prior consistent version of derived tables.
Performance at Scale:
At companies processing 1 million events per second across a cluster of stream processors, maintaining exactly-once semantics while hitting aggressive latency targets requires careful tuning. Typical production deployments achieve p50 latencies under 50 milliseconds for aggregation pipelines and p99 under 500 milliseconds.
Checkpoint intervals balance recovery time against overhead. Shorter intervals (every 2 seconds) mean faster recovery but more frequent writes to durable storage. Longer intervals (every 30 seconds) reduce overhead but increase the amount of data to replay on failure. Most systems settle on 5 to 10 second checkpoints as a sweet spot.
State backend choice matters enormously. RocksDB embedded in each processing node handles gigabytes of local state with millisecond access times. Checkpointing this state to S3 uses incremental snapshots, uploading only changed data to minimize I/O.
Production Scale Metrics
20K/sec
PEAK THROUGHPUT
200ms
MEDIAN LATENCY
1 sec
P99 LATENCY
"The payoff is simpler reasoning: the streaming path becomes the source of truth. This enables Kappa architecture that replaces separate batch and speed layers with a single exactly-once streaming pipeline."
💡 Key Takeaways
✓Production systems handle 20,000 to 1 million events per second with exactly-once semantics while maintaining p99 latency under 1 second
✓Checkpoint intervals typically range from 5 to 10 seconds, balancing recovery speed (less replay) against I/O overhead (frequent snapshots)
✓State backends like RocksDB store gigabytes of operator state locally with millisecond access, checkpointing incrementally to S3 or distributed filesystems
✓Real deployments at Yelp process hundreds of millions of interactions daily, using Flink with Kafka to maintain lakehouse consistency across failures
✓Exactly-once streaming enables Kappa architecture where the streaming pipeline replaces both batch and speed layers as the single source of truth
📌 Examples
1A fintech processes 20,000 payment transactions per second with median latency of 200ms. Exactly-once guarantees prevent duplicate charges even during node failures or network partitions.
2Yelp checkpoints Flink state to S3 every 5 seconds while ingesting 100,000 photos per day. On task manager failure, recovery replays only those 5 seconds of input, roughly 500,000 events at their scale.
3A fraud detection pipeline processes 1 million events per second with p50 latency of 50ms. Each checkpoint writes incremental RocksDB snapshots, uploading only changed state to minimize overhead.