What is MapReduce? Core Model and Execution Flow

MapReduce is a programming model for processing massive datasets by breaking work into two distributed phases: map and reduce, connected by a shuffle. The fundamental idea is to push computation to where data already lives rather than moving terabytes across the network. Mappers read input partitions in parallel, transform records, and emit intermediate key value pairs. The shuffle groups all values sharing the same key together across the cluster. Reducers then aggregate per key state to produce final outputs.

This abstraction hides distribution complexity from developers. You write pure functions (map and reduce) over records, and the framework handles parallelization, failure recovery, and data movement. Tasks within each phase are embarrassingly parallel, meaning they don't need to coordinate. If a mapper fails, the system simply reruns it on another machine since operations are deterministic and idempotent.

Google published the original MapReduce paper based on their internal system that by 2008 was running 100,000 jobs per day processing 20 petabytes daily. The model proved so effective for batch ETL, log analysis, and index building that it spawned the entire Hadoop ecosystem. The key insight was trading latency for throughput: jobs take minutes to hours, but can process petabytes with commodity hardware and automatic fault tolerance.

💡 Key Takeaways

✓Map phase processes input partitions in parallel, emitting intermediate key value pairs without coordination between mappers

✓Shuffle phase is the expensive global sort and group operation that routes all values for each key to the same reducer (network and disk intensive)

✓Reduce phase aggregates per key state, with each reducer handling a disjoint subset of the key space independently

✓Fault tolerance via deterministic reexecution: failed map or reduce tasks are simply rerun on different machines since functions are pure

✓Optimized for throughput over latency: typical jobs take minutes to hours but can process terabytes to petabytes on commodity clusters

✓Google's production system ran 100,000 jobs daily processing 20 PB per day by 2008, demonstrating scale for web indexing and log analysis

📌 Interview Tips

1Word count: Mappers emit (word, 1) for each occurrence. Shuffle groups all counts per word. Reducers sum counts to produce (word, total_count).

2Log analysis: Mappers parse logs and emit (user_id, bytes_transferred). Shuffle co-locates all bytes per user. Reducers sum to get total bandwidth per user.

3Inverted index building: Mappers emit (term, document_id). Shuffle groups documents per term. Reducers build posting lists for search indexing.

← Back to MapReduce & Batch Processing Overview