HDFS Architecture: Master/Worker Separation

Hadoop Distributed File System (HDFS) splits responsibilities between a small metadata service called the NameNode (master) and a large fleet of DataNodes (workers). The NameNode keeps the entire namespace in memory including directories, file to block mappings, permissions, and leases, while DataNodes store the actual data blocks on local disks. This separation is crucial for scaling: metadata operations stay fast with in memory lookups, while storage capacity grows horizontally by adding more worker nodes.

Files are divided into large immutable blocks, typically 128 MB or 256 MB, far larger than traditional filesystem blocks of 4 KB. This design maximizes sequential I/O throughput and minimizes metadata overhead. A 10 GB file creates only 40 to 80 block entries in the NameNode instead of millions of entries. Each block is replicated across multiple DataNodes (default 3 copies) with rack aware placement: one replica on the local rack, another on the same rack, and a third on a different rack to survive both individual machine failures and entire rack failures.

The consistency model is write once, read many with single writer semantics. Only one client can write to a file at a time, enforced through leases granted by the NameNode. Writers stream data to a pipeline of replica DataNodes, and acknowledgments chain back only after all replicas persist the data with checksums. Readers ask the NameNode for block locations, then stream directly from DataNodes, preferably choosing nodes on the same host or rack to minimize network traffic and maximize aggregate throughput across thousands of parallel readers.

💡 Key Takeaways

•NameNode stores entire namespace in RAM consuming roughly 300 to 500 bytes per file and per block. A 256 GB NameNode can manage approximately 200 to 300 million total objects.

•DataNodes send heartbeats every 3 seconds and block reports every 6 hours. Missing 10 consecutive heartbeats (30 seconds) marks a node dead and triggers re replication.

•Block size of 128 MB to 256 MB reduces metadata footprint by 30,000x compared to 4 KB blocks, and enables sequential disk throughput of 100 to 300 MB/s per task.

•Rack aware placement with replication factor 3 ensures at least one replica survives rack failure while keeping 2 of 3 writes local to one rack, reducing cross rack bandwidth by 33%.

•Single writer lease prevents concurrent modifications. Lease timeout is typically 60 seconds; if a writer crashes, recovery blocks new writers until lease expires.

📌 Examples

Yahoo early deployments ran thousands of DataNodes with petabyte scale HDFS clusters, achieving multi GB/s aggregate read throughput by parallelizing 100 to 200 MB/s per task disk streams across thousands of map tasks.

Meta data warehouse on Hive uses hundreds of petabytes of HDFS storage. To manage a 50 PB logical dataset, 3x replication requires 150 PB physical capacity versus approximately 70 PB with 10+4 erasure coding, saving roughly $5 to $10 million in storage hardware costs annually.

← Back to HDFS Architecture & Operations Overview