Failure Modes and Edge Cases

Executor Out of Memory Cascade:

The most visible failure is executor out of memory during a wide shuffle or join. Symptoms: many tasks fail repeatedly on the same executor with out of memory errors. Root causes fall into three categories. First, data skew: one partition receives 50 GB while others get 1 GB. That task exhausts executor memory even though average memory per task looks fine. Second, misconfigured unified memory leaves too little room for execution, causing aggressive spilling that eventually overflows. Third, user code materializes large collections, perhaps a 10 GB list built incrementally in a User Defined Function, which bypasses Spark's memory management entirely.

At production scale, a single hot partition can kill every executor that attempts it. Spark retries the task on different executors, but if the data itself is the problem, all executors fail. Eventually the job aborts after exhausting retry attempts. This is why companies invest heavily in skew detection and mitigation: salting keys, two phase aggregations, or custom partitioners.

GC Storm Timeline
NORMAL
2% GC time
→
PRESSURE
15% GC time
→
STORM
40% GC time
Garbage Collection Storms:

When heaps are large and object creation rate is high, executors can enter garbage collection storms. Normal state: executors spend 2% of time in garbage collection, 98% running tasks. Under memory pressure, this shifts to 15% garbage collection, slowing throughput noticeably. In a garbage collection storm, executors spend 40% or more of time in garbage collection: multiple second pauses every few seconds. Tasks that should take 2 seconds take 10 seconds. Cluster throughput collapses.

In batch jobs, this inflates p99 task durations and causes cascading backpressure on shuffle services and storage. In streaming jobs with sub minute microbatches, long pauses miss processing deadlines. A streaming pipeline targeting p95 end to end latency of 10 seconds cannot tolerate executors disappearing for 5 to 10 seconds under garbage collection. The result is missed Service Level Objectives (SLOs) and potential data loss if the system cannot keep up with incoming event rate.

Storage vs Execution Contention:

Subtle issues arise from storage and execution contention. Suppose many datasets are cached uncompressed, and storage memory grows until it hits its fraction of unified memory (say, 50%). A heavy shuffle starts, needing more than its initial 50% execution share. Execution can evict cached blocks, but only until storage shrinks to its reserved minimum. If that's insufficient, execution spills to disk.

The job still succeeds, but runtime unexpectedly doubles or triples. Worse, if user code also maintains large in memory caches (outside Spark's management), total memory pressure pushes the executor over the edge into out of memory. This failure is hard to diagnose because metrics show storage and execution memory individually within bounds, but the sum exceeds available heap.

❗ Remember: Cluster wide imbalances matter. If a few nodes have slightly less available memory due to other system processes, those executors hit memory limits first. Spark retries failed tasks on other executors, but under sustained load, this degrades availability even when average resource configuration looks sufficient.
Broadcast Edge Cases:

Broadcast variables create unique failure modes. The driver must collect the entire dataset, serialize it, and then distribute it to all executors. If a dimension table expected at 2 GB grows to 15 GB, the driver may out of memory during collection. Even if the driver survives, executors must hold the full 15 GB in memory. On a 32 GB heap executor running 8 tasks concurrently, this leaves only 17 GB divided among 8 tasks: about 2 GB per task. If the shuffle also needs memory, tasks start failing.

One company experienced cascading failures when an oversized broadcast caused repeated driver out of memory and job retries. Each retry attempted the same oversized broadcast, failing again. The retry loop destabilized the cluster for hours until the root cause was identified and the broadcast disabled.

💡 Key Takeaways

✓A single hot partition with 50 GB while others have 1 GB causes repeated out of memory on every executor that attempts it, failing the job despite sufficient average resources

✓Garbage collection storms occur when executors spend over 40% of time in pauses, inflating task durations from 2 seconds to 10 seconds and collapsing throughput

✓Storage and execution contention can cause unexpected 2x to 3x runtime increases when cached data prevents execution from acquiring needed memory without aggressive spilling

✓Broadcast variables that grow from 2 GB to 15 GB can cause driver out of memory during collection, or executor failures when the broadcast exhausts available memory

✓Cluster imbalances where some nodes have less memory due to co located processes cause those executors to fail first, creating cascading retries that degrade overall availability

📌 Interview Tips

1A streaming job targeting p95 latency of 10 seconds breaks when executors hit 5 to 10 second garbage collection pauses during high memory pressure phases

2An executor with 32 GB heap running 8 concurrent tasks and holding a 15 GB broadcast has only 2 GB per task, causing failures when shuffle operations need more memory

3A production cluster experienced hours of instability when an oversized broadcast caused repeated driver out of memory, with each retry attempting the same failing broadcast in a loop

← Back to Spark Memory Management & Tuning Overview