Distributed Data Processing • Spark Memory Management & TuningEasy⏱️ ~3 min
What is Spark Memory Management?
Definition
Spark Memory Management controls how Apache Spark allocates and uses memory within each executor JVM (Java Virtual Machine) to process data in memory without crashing or slowing down due to garbage collection.
❗ Remember: Without proper memory management, a job processing 5 TB might run fine, but the same code on 10 TB could fail repeatedly with executor crashes, not because of logic errors, but because memory footprints scale non linearly with skewed data.
💡 Key Takeaways
✓Spark divides executor heap into reserved memory (300 MB fixed), user memory (your code), and unified memory (execution plus storage)
✓Execution memory handles shuffle, sort, joins, and aggregations while processing. Storage memory holds cached datasets and broadcast variables
✓Unified memory allows execution and storage to dynamically borrow from each other, preventing rigid waste during workload changes
✓Memory pressure patterns vary wildly: a wide shuffle stage might consume 20x more memory than a narrow map stage with identical input
✓Without proper management, jobs scale fine to 5 TB but crash at 10 TB due to non linear memory growth from data skew
📌 Examples
1A cluster with 200 executors at 32 GB heap each provides about 6.4 TB total executor memory for processing 20 TB of raw data
2A dimension table broadcast expected at 2 GB suddenly grows to 15 GB, causing driver out of memory as it must collect and serialize the entire broadcast
3Young generation sizing errors cause frequent minor garbage collection pauses of 100 milliseconds each, which across 2000 tasks aggregate into minutes of overhead