Learn→Distributed Data Processing→Spark Architecture & Execution Model→2 of 5

Distributed Data Processing • Spark Architecture & Execution ModelMedium⏱️ ~2 min

Stages, Tasks, and Partitions: How Spark Executes Work

The Execution Model: Spark transforms your high level code into a DAG of stages, where each stage represents a unit of work that can be executed without shuffling data across the network. Understanding this transformation is crucial because it directly impacts performance and cost.

A transformation like map or filter that processes each record independently creates a narrow dependency. These operations stay within the same stage because no data needs to move between executors. Operations like groupBy or join on a non partition key require shuffling data by key across the cluster, creating a stage boundary.

Tasks and Parallelism: Within each stage, Spark creates one task per partition of data. If you have 1000 partitions, you'll have 1000 tasks in that stage. These tasks run in parallel across your executor cores. With 200 executors and 8 cores each, you have 1600 slots to run tasks simultaneously. If a stage has 1000 tasks, most finish quickly, but the stage isn't complete until ALL tasks finish.

This is where stragglers become painful. If 999 tasks complete in 5 seconds but one skewed partition takes 3 minutes due to data imbalance, your entire stage waits 3 minutes. Your p50 latency might be 5 seconds, but p99 latency is 180 seconds.

Task Processing Time Distribution
P50 TASK
5 sec
→
P99 TASK
180 sec
Shuffle Operations: When Spark hits a stage boundary, it writes intermediate data to disk on each executor. The next stage's tasks read this shuffle data over the network. A shuffle of 10 TB with tens of millions of partitions stresses disk, network, and shuffle services. If executors don't have enough memory, they spill to disk heavily, increasing garbage collection pauses and IO waits.

Adaptive Query Execution (AQE): Modern Spark versions can adjust plans at runtime. If AQE detects that one side of a join is smaller than expected (say, under 10 MB after filtering), it converts a shuffle join into a broadcast join. This eliminates an entire shuffle stage, reducing job time from tens of seconds to just a few seconds. AQE can also coalesce small partitions dynamically and handle skewed joins by splitting hot partitions.

⚠️ Common Pitfall: Setting shuffle partitions too low (few large tasks) causes memory pressure and spills. Setting them too high (many tiny tasks) increases scheduling overhead. Target 100 to 300 MB per partition for optimal balance.

💡 Key Takeaways

✓Narrow dependencies like map and filter stay within one stage and execute in parallel, while wide dependencies like groupBy and join create shuffle boundaries between stages

✓Each stage generates one task per data partition, and the stage completes only when ALL tasks finish, making skewed partitions a major source of p99 latency degradation

✓Shuffle operations write intermediate data to disk and read over the network, with 10 TB shuffles stressing disk, network, and memory, often causing spills and GC pauses

✓Adaptive Query Execution can convert shuffle joins to broadcast joins at runtime when one side is small, eliminating entire shuffle stages and reducing job time from tens of seconds to seconds

📌 Interview Tips

1A 1000 task stage where 999 tasks finish in 5 seconds but one skewed partition takes 3 minutes results in p50 of 5 seconds but p99 of 180 seconds, with the entire stage waiting for the slowest task

2Tuning shuffle partitions to keep each task processing 100 to 300 MB of data balances memory pressure against scheduling overhead, preventing both spills from oversized partitions and slowdown from excessive tiny tasks

← Back to Spark Architecture & Execution Model Overview