Erasure Coding Read and Write Path: Performance Trade-offs

The Write Path: Encoding and Distribution
Writing with erasure coding involves three steps. First, buffer enough data to fill a stripe (the set of fragments that form one encoding unit). For a 10+4 scheme with 1MB fragments, buffer 10MB. Second, compute the 4 parity fragments by running the data through Reed Solomon encoding. Third, write all 14 fragments to 14 different nodes in parallel.
Write latency equals the slowest fragment write plus encoding time. If 13 nodes respond in 5ms but one takes 50ms, the write takes 50ms. Tail latency dominates. Some systems wait for only k fragments to confirm (10 of 14) and write the rest asynchronously, trading consistency for latency.
The Read Path: Parallel Fragment Fetching
Reading starts by requesting fragments from nodes. The minimum is k fragments (10 of 14). Optimized systems request from all 14 nodes and use the first 10 responses. This hedges against slow nodes: if 3 nodes are slow, the other 11 still complete the read quickly.
When all k data fragments arrive intact, no decoding is needed. Just concatenate them. When some data fragments are missing, decode by solving the mathematical equations. Decoding has similar CPU cost to encoding: 100-500ms per gigabyte.
Small Object Inefficiency
Erasure coding requires a minimum stripe size. A 10+4 scheme with 64KB minimum fragments means the minimum object is 640KB. Storing a 1KB file wastes 639KB of padding. Small files either get padded (wasting space) or bundled into archives before erasure coding.
Systems optimize by using different schemes for different sizes. Small objects use 3x replication (simpler, no minimum size). Large objects use erasure coding. The crossover point depends on fragment size and replication factor, typically 1-10MB.
Throughput Versus Latency
Erasure coding optimizes for throughput over latency. Writing 10GB as one operation amortizes encoding cost across all data. Writing 10,000 x 1MB files incurs encoding overhead per file. Sequential bulk workloads (backups, data lakes, media storage) fit perfectly. Random small write workloads do not.
⚠️ Key Trade-off: Erasure coding adds encoding/decoding CPU cost and minimum stripe size overhead. These costs amortize well for large sequential workloads but dominate for small random IO.

💡 Key Takeaways

✓Write path: buffer data to stripe size, compute parity fragments, write all fragments in parallel to different nodes

✓Write latency equals slowest fragment plus encoding time; tail latency dominates distributed writes

✓Read path: fetch k fragments (10 of 14), decode only if data fragments missing; hedged reads request all nodes

✓Small objects waste space due to minimum stripe size; use replication for small files, erasure coding for large

✓Throughput optimized: bulk sequential writes amortize encoding cost; random small writes suffer overhead per operation

📌 Interview Tips

1Explain the stripe buffering requirement. Writing 100KB to a 10MB stripe means buffering until you have 10MB, or padding the rest. This is why erasure coded storage has write amplification for small files.

2When discussing read latency, mention hedged reads. Request all 14 fragments, use first 10 responses, cancel the rest. This eliminates tail latency from slow nodes at the cost of extra network traffic.

3For systems with mixed file sizes, suggest the hybrid approach: replicate files under 1MB, erasure code files over 1MB. The storage cost of 3x replication for small files is minimal compared to the operational simplicity.

← Back to Erasure Coding & Durability Overview