Stream Processing ArchitecturesApache Flink Architecture & State ManagementMedium⏱️ ~3 min

Checkpointing and Fault Tolerance: Consistent Snapshots at Scale

The Consistency Challenge: When processing 1 million events per second across 50 machines, each maintaining local state, how do you create a consistent backup? You can't simply pause all processing, snapshot state, then resume. That would violate latency Service Level Agreements (SLAs). You also can't snapshot each machine independently at arbitrary times because machines are processing events at different rates, leading to inconsistent global state. Flink solves this with checkpoint barriers, a technique based on the Chandy Lamport distributed snapshot algorithm. The JobManager periodically injects a barrier marker into the event stream at the sources. This barrier flows through the dataflow graph along with regular events. When an operator receives a barrier on all input channels, it snapshots its state and forwards the barrier downstream. How It Works in Practice: Imagine a pipeline: Kafka Source → Filter → Aggregate → Sink. At time T=0, the JobManager tells sources to inject barrier 42. The source immediately snapshots its Kafka offsets (operator state) and emits the barrier. The filter processes events until it sees barrier 42, snapshots any state, and forwards the barrier. The aggregate does the same, snapshotting per key aggregates. When the sink receives barrier 42 and completes its snapshot, the JobManager marks checkpoint 42 as complete.
1
Barrier injection: JobManager triggers barrier at sources every 30 to 300 seconds.
2
Barrier alignment: Each operator waits for barrier on all inputs before snapshotting.
3
Asynchronous write: State snapshot written to S3/HDFS without blocking processing.
4
Completion: Once all operators finish, checkpoint marked complete and safe for recovery.
The key insight: because barriers flow in order with events, and operators snapshot when they see a barrier, the resulting snapshots represent a consistent cut through the distributed computation. If you recover from checkpoint 42, every operator restarts with state that reflects all events before barrier 42 and none after, even though the actual snapshot times differed by milliseconds across machines. Incremental and Asynchronous Optimizations: Full checkpoints of 500 gigabytes of state every 60 seconds would be prohibitive. Modern Flink uses incremental checkpoints with RocksDB, writing only changed state since the last checkpoint. This reduces checkpoint size from hundreds of gigabytes to tens of gigabytes, keeping durations under 10 to 30 seconds. Checkpoints are also asynchronous. When an operator receives a barrier, it takes a local snapshot (often a copy on write or filesystem snapshot) and continues processing. The actual upload to S3 happens in the background, not blocking the operator. This keeps checkpoint overhead below 5 to 10% of processing time.
Checkpoint Performance
FULL
500 GB
INCREMENTAL
30 GB
Recovery Process: When a TaskManager fails, the JobManager detects the failure (usually within 10 to 30 seconds via heartbeats), identifies the latest completed checkpoint, and restarts all affected tasks on healthy machines. State is restored from the checkpoint, and the Kafka sources rewind to the offsets stored in that checkpoint. Processing resumes exactly where it left off, with no data loss or duplication.
💡 Key Takeaways
Checkpoint barriers flow through the dataflow with events, ensuring operators snapshot at consistent logical points without pausing processing globally
Incremental checkpoints with RocksDB reduce snapshot size from 500 GB to 30 GB by writing only changed data, keeping checkpoint durations under 30 seconds at high throughput
Asynchronous snapshots let operators continue processing while uploading state to S3/HDFS in the background, limiting overhead to 5 to 10% of processing capacity
Recovery restores state from the latest complete checkpoint and rewinds sources to recorded offsets, resuming processing with exactly once guarantees
Checkpoint interval (30 to 300 seconds) is a tradeoff between recovery time (longer interval means more replay) and overhead (shorter interval means more I/O and coordination cost)
📌 Examples
1Job with 1 TB state across 50 TaskManagers using RocksDB incremental checkpoints every 60 seconds, with each checkpoint writing 20 to 40 GB of changes to S3
2Kafka source operator snapshotting committed offsets for 100 partitions as part of checkpoint 42, allowing precise rewind on recovery
3Recovery scenario where TaskManager crashes after processing 10 million events past last checkpoint; upon restart, those 10 million events are replayed from Kafka to restore exactly once semantics
← Back to Apache Flink Architecture & State Management Overview