Stream Processing Architectures • Apache Flink Architecture & State ManagementHard⏱️ ~4 min
Failure Modes and Edge Cases: What Breaks at Scale
Checkpoint Instability Under Load:
The most common failure mode in production Flink is checkpoint instability. When processing 1 million events per second with 1 terabyte of state, a checkpoint every 30 seconds must snapshot and upload tens of gigabytes to S3 or HDFS without disrupting processing.
If the checkpoint takes too long, the next checkpoint is due before the current one completes. This creates a vicious cycle: operators spend more time snapshotting and less time processing, causing event queues to fill, which increases state size, which slows the next checkpoint further. You end up with checkpoint durations growing from 30 seconds to 5 minutes to eventually timing out entirely.
Mitigation requires either increasing checkpoint interval (accepting longer recovery times), optimizing state size (reducing what you store per key), or scaling the cluster (more parallelism means less state per TaskManager). In extreme cases, switching from RocksDB to a faster storage system or tuning RocksDB compaction helps.
Backpressure Cascades:
Backpressure occurs when a downstream operator (often a sink writing to a database or search index) slows down. Flink propagates this back upstream to avoid overwhelming the slow component. Buffers between operators fill, and eventually sources stop reading from Kafka.
This is correct behavior, but it has consequences. If your sink can only write 50 thousand records per second but your source produces 100 thousand, backpressure is constant. End to end latency grows from milliseconds to seconds or minutes as events wait in buffers. Checkpoint durations increase because buffered events count as in flight state.
The root cause is usually an under provisioned sink or a downstream system that cannot scale. The Flink specific failure mode is that long backpressure combined with long checkpoint durations can create a situation where the system never completes a checkpoint, making recovery impossible.
Checkpoint Death Spiral
NORMAL
30 sec
→
OVERLOAD
90 sec
→
TIMEOUT
300+ sec
❗ Remember: Backpressure is a symptom, not a disease. It reveals that your pipeline capacity does not match your input rate. Tuning Flink configuration will not fix this; you must scale or optimize the bottleneck component.
Skewed Key Distribution:
Real world data is rarely uniform. In clickstream analytics, 1 percent of users might generate 80 percent of events (power users, bots, automated scripts). If you partition by user_id, a small number of TaskManagers handle the hot keys while others sit idle.
These hot TaskManagers experience higher CPU usage, memory pressure, and longer state access times. Worse, they can fall behind, creating backpressure that affects the entire job. Checkpoints are only complete when ALL operators finish, so the slowest TaskManager (the one with hot keys) determines overall checkpoint duration.
Mitigation strategies include salting hot keys (adding a random suffix to distribute them across multiple partitions), pre aggregation (reducing events per key before the skewed operation), or using custom partitioning logic that detects and redistributes hot keys dynamically. None of these are trivial, and they all require application level changes.
Watermark and Late Data:
Flink uses watermarks to track progress in event time. A watermark of timestamp T means no events with timestamp less than T will arrive. When a watermark passes the end of a window, Flink triggers computation and potentially discards state for that window.
If watermarks are too aggressive (advancing too quickly), late events are dropped or counted separately, leading to incorrect results. If too conservative (advancing slowly), windows stay open longer, accumulating more state and delaying results.
In multi region deployments or mobile applications, events can arrive hours late due to network partitions or offline devices. Configuring allowed lateness (how long to keep window state after watermark) and late event side outputs (to track what was discarded) becomes critical. Getting this wrong means either memory exhaustion from keeping too much state or data loss from closing windows too early.
Sink Exactly Once Failures:
Flink guarantees exactly once processing internally, but sinks must cooperate. If a sink does not support idempotent writes or transactional commits aligned with checkpoints, you can get duplicates or lost data on recovery.
For example, writing to a database without transactions: Flink commits events to the database, then checkpoints. If the job fails after the database write but before checkpoint completion, recovery restarts from the old checkpoint and writes the same events again, creating duplicates.
The solution is two phase commit: sinks prepare writes, wait for checkpoint completion, then commit. On failure, uncommitted transactions are rolled back. But not all external systems support this, forcing you to choose between at least once (possible duplicates) or at most once (possible data loss) semantics at the sink boundary.💡 Key Takeaways
✓Checkpoint instability creates a death spiral where long checkpoint durations cause processing lag, increasing state size, slowing future checkpoints further until timeouts occur
✓Backpressure from slow sinks cascades upstream, filling buffers and increasing latency from milliseconds to minutes; fixing requires scaling or optimizing the bottleneck, not tuning Flink
✓Skewed key distributions cause hot TaskManagers with 80 percent of traffic while others sit idle, creating bottlenecks that limit overall job throughput and checkpoint completion
✓Watermark tuning is critical: too aggressive drops late events (data loss), too conservative keeps windows open too long (memory exhaustion), especially in multi region or mobile scenarios
✓Sink exactly once requires external system support for transactional commits aligned with checkpoints; without this, you must accept duplicates or data loss at the sink boundary
📌 Examples
1Production job processing 1 million events per second sees checkpoint duration grow from 30 seconds to 5 minutes under traffic spike, eventually timing out and preventing recovery
2E-commerce analytics job partitioned by user_id where top 1 percent of users (bots, power users) generate 80 percent of events, causing 2 of 50 TaskManagers to bottleneck entire pipeline
3Mobile app sending events hours late after offline period; watermarks configured with 1 hour allowed lateness keep 3 hours of window state in memory, consuming 200 GB more than expected