How Memory Regions Work Together

The Borrowing Mechanism:

When a Spark executor starts, it calculates unified memory as a fraction of available heap after reserving space. By default, 60% of heap minus reserved becomes unified memory, split roughly 50/50 between execution and storage initially. But here's where it gets interesting: these boundaries are soft, not hard.

Imagine an executor with 16 GB heap. After 300 MB reserved and user memory allocation, perhaps 9 GB becomes unified memory: 4.5 GB for execution, 4.5 GB for storage. Now a heavy shuffle stage starts, needing 7 GB to hold hash tables for a join. Execution can borrow up to the full unified memory, evicting cached blocks from storage as needed. Storage shrinks but retains its guaranteed minimum region.

The Eviction Rules:

Execution can evict storage blocks aggressively when it needs memory. Evicted blocks either spill to disk (if persisted with disk fallback) or get recomputed when accessed again. However, storage cannot evict execution data. This asymmetry protects active computation from being interrupted by caching operations. If storage tries to grow beyond its share while execution is active, it simply cannot, preventing cache operations from destabilizing running tasks.

Memory State Transition
IDLE
50/50 split
→
SHUFFLE
80/20 exec
→
CACHING
30/70 storage
Serialization's Hidden Impact:

Object layout profoundly affects how much data fits in memory. Consider a DataFrame with 100 million rows of simple records. Using default Java serialization, each object has header overhead, pointer indirection, and padding. The raw data might be 8 GB, but in memory objects could consume 24 GB: a 3x multiplier.

Serialized storage using formats like Kryo or the Tungsten binary format can pack that same data into 10 GB, cutting memory usage by more than half. The tradeoff is CPU time for serialization and deserialization on access. But at large scale, reducing memory pressure and garbage collection overhead often improves end to end performance despite the CPU cost.

Garbage Collection Interaction:

Spark workloads create millions of temporary objects per executor: intermediate results, deserialized rows, hash map entries. Poor memory management floods the JVM with objects that promote from young generation to old generation. When old generation fills, a full garbage collection pause can freeze the executor for seconds, during which no tasks progress. At 200 executors, if even 10% hit simultaneous multi second pauses, cluster throughput collapses.

✓ In Practice: Netflix processes 20 TB nightly ETL jobs with 200 executors at 32 GB heap each. Early stages with narrow transformations create billions of short lived objects. Later stages with multiway joins need execution memory to hold several GB hash tables per task. The unified memory model dynamically adapts across these phases without manual intervention.

💡 Key Takeaways

✓Unified memory defaults to 60% of heap after reserved space, initially split 50/50 between execution and storage with soft boundaries

✓Execution can borrow from storage and evict cached blocks aggressively, but storage cannot evict execution data, protecting active tasks

✓Serialized storage reduces memory footprint by 2x to 5x compared to object format, trading CPU for reduced garbage collection pressure

✓A 2 TB fact table joining 200 GB dimension generates terabytes of shuffle data; execution memory determines how much stays in memory versus spilling to disk

✓Full garbage collection pauses of multiple seconds can occur when millions of objects promote to old generation, freezing executors and collapsing throughput

📌 Interview Tips

1An executor with 16 GB heap allocates roughly 9 GB to unified memory. During a shuffle needing 7 GB, execution borrows from storage, shrinking cached data from 4.5 GB to 2 GB

2A DataFrame with 100 million rows consumes 24 GB using default Java serialization (3x the 8 GB raw data) but only 10 GB with Tungsten binary format

3At 200 executors, if 20 executors hit 3 second garbage collection pauses simultaneously, effective cluster parallelism drops from 200 to 180 for those seconds, creating throughput valleys

← Back to Spark Memory Management & Tuning Overview