Big Data Systems • HDFS Architecture & OperationsMedium⏱️ ~3 min
Write Pipeline and Consistency Guarantees
HDFS enforces single writer semantics through a lease mechanism managed by the NameNode. When a client opens a file for writing, the NameNode grants an exclusive lease, typically valid for 60 seconds and renewable with heartbeats. Only the lease holder can write or append to the file. If the client crashes, the lease eventually expires and the NameNode performs lease recovery, allowing a new writer to take over. This prevents concurrent writes that could corrupt data but introduces availability delays during recovery, typically 30 to 60 seconds before another client can proceed.
The write path forms a pipeline across replica DataNodes. The client streams a block to the first DataNode, which simultaneously writes to local disk and forwards packets to the second DataNode, which forwards to the third. Each DataNode validates checksums and acknowledges upstream only after data is persisted. The client considers the write successful when it receives the final acknowledgment from the tail of the pipeline, ensuring all replicas have durably stored the data. This daisy chain pattern overlaps network transmission and disk writes, achieving throughput close to a single disk stream while guaranteeing consistency.
Readers obtain block locations from the NameNode and read directly from DataNodes, preferring the closest replica by network topology. HDFS provides read after close consistency: once a file is closed, all subsequent readers see the complete data. During an open write session, readers may not see the latest appended data until the writer explicitly calls hflush or hsync, which forces pipeline acknowledgments and makes data visible. This relaxed model trades off real time visibility for higher write throughput, suitable for batch processing workloads where files are written once and read many times after completion.
💡 Key Takeaways
•Lease timeout of 60 seconds means writer failure causes 30 to 60 second unavailability window during recovery before a new writer can proceed, impacting latency sensitive append workloads.
•Pipeline acknowledgment ensures all 3 replicas persist data before client success. Total write latency is roughly single disk latency (5 to 10 ms) plus network round trips (1 to 2 ms per hop), typically 10 to 20 ms end to end.
•Checksum validation per 512 byte chunk detects corruption. On mismatch, client automatically retries from a different replica. If all replicas are corrupt, data is lost, making periodic scrubbing critical.
•Single writer semantics prevent concurrent modifications but create bottlenecks for append heavy workloads. Multiple writers to different files achieve parallelism; consider sharding output across files.
•Read after close consistency means readers must wait for file close to see complete data. For streaming use cases needing visibility during writes, explicit hflush calls make data visible but reduce throughput by 20 to 40% due to frequent pipeline flushes.
📌 Examples
A MapReduce job writing 10 GB output creates a 128 MB block pipeline. With 100 MB/s disk throughput, writing one block takes approximately 1.3 seconds. Pipeline overlap keeps all three DataNodes writing simultaneously, maintaining near single disk speed.
Google File System (GFS), the precursor to HDFS, used primary lease holders and similar pipeline replication. Published GFS papers cited lease durations of 60 seconds with automatic renewal, and recovery taking tens of seconds on client failure.