Distributed Data Processing • MapReduce Paradigm & Execution ModelMedium⏱️ ~3 min
MapReduce Execution Model: Master Worker Architecture
How Jobs Actually Execute:
When you submit a MapReduce job, you provide the input location (for example, a directory in Hadoop Distributed File System containing 50 TB of logs), your map and reduce functions, and the number of reducers (for example, 2,000). A central master node (called JobTracker in Hadoop) orchestrates everything.
The master queries the distributed file system to see where data blocks are physically stored. If your 50 TB input is split into 400,000 blocks of 128 MB each, the master creates 400,000 map tasks. Here's the critical optimization: each map task is scheduled on a worker node that already has a local copy of its input block. This data locality means 90 to 95% of map input is read from local disk at 100 to 150 MB/s instead of over the network at 10 to 20 MB/s.
The Three Phase Lifecycle:
Execution happens in distinct phases. During the map phase, workers read input blocks, run your map function on each record, and write intermediate key value pairs to local disk. These pairs are partitioned (usually by hashing the key) into R partitions, where R is the number of reducers.
On a 2,000 node cluster with 5 map slots per node, you can run 10,000 concurrent map tasks. The 400,000 total maps complete in waves over perhaps 15 to 20 minutes for a 50 TB job. Each map writes its intermediate data (sorted by key within each partition) to local disk and reports completion to the master.
The shuffle phase begins while maps are still running. Each of the 2,000 reducers pulls its partition from all 400,000 completed map tasks over the network. If intermediate data totals 30 TB (after map side compression and combining), each reducer fetches about 15 GB from 400,000 sources. Reducers use multiple parallel fetch threads and write incoming data to disk, performing external merge sorts to combine sorted segments.
Once all maps finish and shuffle completes, the reduce phase runs. Each reducer has a fully sorted stream of key value pairs. It iterates through this stream, grouping consecutive records with the same key, and invokes your reduce function with each key and its value iterator. Final output is written directly to distributed storage with atomic commits.
Typical Job Execution Time
15-20 min
MAP PHASE
10-15 min
SHUFFLE
5-10 min
REDUCE
⚠️ Common Pitfall: The master is a single logical coordinator. In early Hadoop, if the JobTracker crashed mid job, all progress was lost and jobs needed resubmission. High availability masters were added later, but master scalability remains a bottleneck for clusters running 10,000+ concurrent jobs.
Fault Tolerance Through Re-execution:
Workers send heartbeats to the master every 30 to 60 seconds. If a worker stops responding, the master marks it dead and reschedules all its tasks. Map tasks on failed nodes must be re-run even if they completed, because their intermediate output was on local disk that's now lost. Reduce tasks only need re-running if they were in progress when the worker died. This simple retry mechanism works because map and reduce are deterministic and pure.💡 Key Takeaways
✓Master creates one map task per input block (128 MB typical), scheduling each task on a node with a local replica to achieve 90 to 95% data locality and avoid network bottlenecks
✓Map phase runs in waves across available slots: 2,000 nodes with 5 slots each can run 10,000 concurrent maps out of 400,000 total, completing in 15 to 20 minutes for 50 TB
✓Shuffle transfers all intermediate data over the network: with 30 TB intermediate output and 2,000 reducers, each reducer fetches 15 GB from hundreds of thousands of map tasks
✓Fault tolerance uses simple re-execution: failed map tasks are recomputed from replicated input, failed reducers restart from intermediate data, no complex distributed checkpointing needed
✓Master is the single coordination point: tracks task status, schedules work, and handles failures, but becomes a scalability bottleneck at 10,000+ concurrent jobs
📌 Examples
1A 2,000 node cluster processing 50 TB: 400,000 map tasks (128 MB each) run in 40 waves at 10,000 concurrent tasks, taking 15 to 20 minutes if each map completes in 25 seconds
2Data locality impact: reading 128 MB from local disk at 100 MB/s takes 1.3 seconds, but over network at 15 MB/s takes 8.5 seconds, making locality critical for map phase performance
3Shuffle bottleneck: transferring 30 TB across 2,000 reducers at aggregate cluster bandwidth of 50 GB/s takes 10 minutes, often the longest phase in network bound jobs