Stream Processing Failure Modes: Backpressure, Hot Keys, and Checkpoint Stalls

Backpressure occurs when a slow sink or downstream operator cannot keep up with incoming data rate. Buffers fill, upstream operators stall, and end to end latency balloons from milliseconds to seconds or minutes. Symptoms include increasing consumer lag, growing in-memory buffers, and missed Service Level Agreements (SLAs). Hot keys amplify this: if 10 percent of traffic hashes to one partition, that task saturates CPU while others idle. Mitigations include key salting (appending a small deterministic suffix to spread load), rate limiting at ingestion, and autoscaling based on consumer lag and operator busy time metrics rather than CPU alone.

Checkpoint stalls happen when state size or storage bandwidth causes checkpoint durations to exceed targets. Barrier alignment in Flink halts fast sources until slow operators complete their snapshots, compounding latency. A job with 200 GB state per task checkpointing to network storage might take 30 to 60 seconds at p99, during which sources pause. On failure, rebuilding state from changelogs can take minutes to hours depending on state size and replay rate. During recovery, consumer lag accrues and violates latency SLAs. Mitigations include incremental checkpoints, faster storage like Solid State Drives (SSDs) or object stores with high bandwidth, state compaction via Time to Live (TTL), and standby replicas that maintain warm local state for sub-second failover.

Late data and out of order arrivals break correctness when events arrive after watermarks close windows. Clock skew at producers, network partitions, or mobile offline queues cause arrivals minutes to hours late. Without late data handling, you silently drop events or mis-aggregate. Allowed lateness extends windows beyond watermark, side outputs route late events to correction streams, and retractions or upserts fix previously emitted results. Production systems calibrate allowed lateness to empirical skew (commonly seconds to a few minutes) and design sinks that support updates or idempotent writes keyed by deterministic event IDs.

💡 Key Takeaways

✓Backpressure from slow sinks stalls upstream operators and increases end to end latency from milliseconds to seconds; hot keys (10 percent of keys causing 50 percent of load) saturate tasks while others idle

✓Checkpoint stalls occur when 200 GB state takes 30 to 60 seconds to snapshot; barrier alignment pauses sources, and recovery from failure can take minutes to hours accruing consumer lag

✓Late data beyond watermarks causes silent drops or incorrect aggregates; clock skew and mobile offline queues produce arrivals minutes to hours late requiring allowed lateness windows or side outputs

✓Key salting spreads hot keys by appending deterministic suffix (for example user_123 becomes user_123_0, user_123_1) and re-aggregating downstream; reduces single task saturation by factor of salt cardinality

✓Incremental checkpoints and standby replicas reduce recovery time from minutes to seconds; faster Solid State Drive (SSD) storage cuts checkpoint duration from 60 seconds to under 10 seconds at same state size

📌 Interview Tips

1A fraud detection pipeline with a hot merchant ID consuming 40 percent of traffic salts the key with hash(merchant_id) mod 4, distributing load across 4 tasks and reducing p99 latency from 800ms to 120ms

2An analytics job with 150 GB local state switches from 60 second checkpoints on Hard Disk Drive (HDD) to incremental snapshots on SSD, cutting checkpoint time to 8 seconds and recovery time from 12 minutes to 90 seconds

← Back to Stream Processing (Flink, Kafka Streams) Overview