Replication vs Erasure Coding Trade-offs

Standard HDFS replication factor 3 (RF=3) stores three full copies of every block across different DataNodes and racks. This provides fast writes (pipeline three nodes), fast reads (choose closest replica), and fast recovery (copy one replica to a new node). The cost is 200% storage overhead: 100 TB of logical data consumes 300 TB of physical disks. For hot data accessed frequently, this trade off is acceptable because read performance and availability justify the cost.

Erasure coding (EC) splits a block into data chunks and parity chunks, storing them across multiple nodes. A 10+4 EC scheme divides a block into 10 data chunks and 4 parity chunks. You can lose any 4 chunks and still reconstruct the original block. Storage overhead drops to approximately 40% (14 chunks for 10 units of data), reducing a 100 TB dataset from 300 TB physical with RF=3 to roughly 140 TB with EC, a 53% capacity savings. At hyperscale, this translates to millions of dollars in disk, power, and rack space savings.

The downside is computational and I/O cost. Encoding requires CPU cycles to compute parity (Reed Solomon or similar algorithms). Decoding during normal reads is free (read data chunks directly), but degraded reads after a node failure require fetching multiple chunks and computing reconstruction, often reading 1.4x to 2x more data than RF=3 and consuming 10x to 20x more CPU. Write latency increases because the client or a coordinator must compute parity and write to more nodes (14 instead of 3). Recovery after failures is slower: reconstructing a lost chunk requires reading 10 chunks instead of copying one replica.

Production deployments use hybrid policies: replicate hot, frequently accessed data for performance and availability; apply erasure coding to warm or cold data accessed infrequently where lower performance is acceptable. Migration policies monitor access patterns and automatically transition data from RF=3 to EC after 30 or 90 days of inactivity.

💡 Key Takeaways

•Erasure coding reduces storage cost by approximately 50% but increases write latency by 2x to 4x due to parity computation and writing to more nodes. Best for cold data written once and rarely read.

•Degraded reads (reading after node failure) with EC require reconstructing data from parity, consuming 10x to 20x more CPU and reading 1.4x to 2x more data. Replication simply reads from another replica with no overhead.

•Network bandwidth during recovery: replicating one 128 MB block copies 128 MB. Reconstructing one EC chunk reads 10 chunks totaling approximately 1.28 GB and recomputes, consuming 10x network and storage I/O.

•At Meta scale, switching 50 PB from RF=3 (150 PB physical) to 10+4 EC (70 PB physical) saves 80 PB capacity, approximately $160 to $400 million in total cost of ownership over 3 years including disks, power, cooling, and rack space.

•Hybrid policies mitigate downsides: use RF=3 for data accessed in the last 30 days (fast access, higher cost), automatically migrate to EC for data older than 30 days (slower access, lower cost). Monitor access patterns to tune migration thresholds.

📌 Examples

Meta f4 warm blob storage uses erasure coding for infrequently accessed photos and videos. Published case studies showed 10+4 EC reduced physical storage overhead from 3.0x to approximately 1.4x, cutting capacity requirements roughly in half for tens of petabytes, with acceptable recovery performance for cold reads.

Google Colossus evolved from GFS by adopting pervasive erasure coding for cold data. This enabled exabyte scale clusters with better utilization. For example, storing 1 EB logical with RF=3 requires 3 EB physical, while EC at 1.4x requires only 1.4 EB, saving 1.6 EB of disks and corresponding power and cooling.

← Back to HDFS Architecture & Operations Overview