Failure Modes and Edge Cases in Production

Out of Order and Late Arriving Events:
Network delays, retries, and clock skew mean events rarely arrive in perfect order. An event timestamped 10:00:10 might arrive at 10:00:25, well after events timestamped 10:00:20. If your system has already closed the 10:00 to 10:05 window based on a watermark, what happens to this late event?

The system has three options. First, drop the event entirely. Second, send it to a separate late data output for offline reconciliation. Third, allow a lateness buffer, keeping windows open past their logical end time. A 10 minute lateness buffer means the 10:00 to 10:05 window does not actually close until the watermark reaches 10:15.

The trade off is concrete. Allowing 10 minutes of lateness increases state size by 2 to 3 times because you are keeping multiple windows open simultaneously. It also delays results: you cannot emit final aggregates until 10 minutes after the window ends. In a system processing millions of events, this means holding gigabytes of extra state. But aggressive watermarks (1 minute tolerance) might drop 5 to 10 percent of events in environments with high variance in network latency.

Key Skew and Hot Partitions:
Real data is never uniformly distributed. In a payment system, 1 percent of merchants might account for 80 percent of transactions. In a social platform, 1 percent of users might generate 90 percent of events. When you partition by merchant ID or user ID, these hot keys land on a single task, creating a bottleneck.

The symptoms are dramatic. The hot partition's CPU spikes to 100 percent while other partitions idle at 20 percent. State for the hot partition grows to 50 gigabytes while others hold 2 gigabytes. Checkpoints on the hot partition take 10 minutes while others finish in 30 seconds. The entire pipeline slows down to the speed of the slowest partition.

Impact of Key Skew
10 min
HOT CHECKPOINT
30 sec
NORMAL CHECKPOINT

Mitigation techniques include key salting (splitting hot keys across multiple partitions with a random suffix), hierarchical aggregation (preaggregate at a finer level then combine), or custom partitioning logic that routes hot keys differently. Each adds complexity. Interviewers often probe: "What if one merchant has 10 times more transactions than the next highest? How would you handle that?"

Exactly Once Semantics Edge Cases:
Many stateful engines provide exactly once guarantees within the stream pipeline: each event updates state exactly once, even if tasks fail and restart. But this guarantee often breaks at the boundaries. When writing to an external system (database, search index, API), you typically get at least once delivery.

The edge case is during recovery. If a task processes an event, updates state, writes to the database, checkpoints, and then crashes before acknowledging the checkpoint, the system will replay that event. The state update is idempotent (rolled back with the checkpoint), but the database write happens twice unless you implement deduplication.

Common solutions include idempotent writes (using a unique event ID as the database key so duplicate writes are no op), transactional writes (where the external system supports two phase commit with the stream engine), or accepting eventual consistency (allowing brief periods where counts are slightly off).

❗ Remember: Exactly once inside the stream engine does not guarantee exactly once to external systems. If you write aggregates to a database without an idempotent key, a single restart can double count results.
State Growth and Resource Exhaustion:
The most common production failure is unbounded state growth. If you keep per user state forever without Time To Live (TTL) policies, a system that worked at 10 million users fails at 100 million. Symptoms include increasing checkpoint times (from 2 minutes to 20 minutes), out of memory errors, disk full alerts, and eventually cascading failures where tasks restart but never complete recovery.

The fix requires setting TTLs on state entries. For example, remove per user data if not updated in 30 days. Or use compaction strategies that merge old state into more compact representations. In an interview, showing awareness of "what happens when state grows 10x" demonstrates production experience.

💡 Key Takeaways

✓Late events are common due to network delays and retries; allowing 10 minute lateness buffer increases state size by 2 to 3 times but catches more events, while 1 minute tolerance may drop 5 to 10 percent of late data

✓Key skew (where 1 percent of keys generate 80 to 90 percent of traffic) creates hot partitions with 10 minute checkpoint times versus 30 seconds for normal partitions, requiring techniques like key salting or hierarchical aggregation

✓Exactly once semantics work within the stream engine but typically become at least once when writing to external systems; without idempotent writes using unique event IDs, restarts can cause double counting

✓Unbounded state growth (keeping per user data forever) causes checkpoint times to grow from 2 minutes to 20 minutes and eventually leads to out of memory and disk full failures as user base grows 10x

✓Recovery during partial checkpoints can leave inconsistent state if the system does not properly roll back or verify snapshot completeness, potentially causing correctness issues after restart

📌 Interview Tips

1Payment system with 1 percent of merchants generating 80 percent of transactions sees hot partition CPU at 100 percent while others idle; solution uses hierarchical aggregation to preprocess before final rollup

2Social platform allows 5 minute lateness buffer, catching 95 percent of late events but holding 2.5x more state than 1 minute buffer; monitors p99 event lateness to tune watermark aggressiveness

3Analytics pipeline writes counts to database without deduplication; task restart causes double counting until team adds unique event IDs as database keys to make writes idempotent

← Back to Stateful Stream Processing Overview