Erasure Coding Read and Write Path: Performance Trade-offs
The Write Path: Encoding and Distribution
Writing with erasure coding involves three steps. First, buffer enough data to fill a stripe (the set of fragments that form one encoding unit). For a 10+4 scheme with 1MB fragments, buffer 10MB. Second, compute the 4 parity fragments by running the data through Reed Solomon encoding. Third, write all 14 fragments to 14 different nodes in parallel.
Write latency equals the slowest fragment write plus encoding time. If 13 nodes respond in 5ms but one takes 50ms, the write takes 50ms. Tail latency dominates. Some systems wait for only k fragments to confirm (10 of 14) and write the rest asynchronously, trading consistency for latency.
The Read Path: Parallel Fragment Fetching
Reading starts by requesting fragments from nodes. The minimum is k fragments (10 of 14). Optimized systems request from all 14 nodes and use the first 10 responses. This hedges against slow nodes: if 3 nodes are slow, the other 11 still complete the read quickly.
When all k data fragments arrive intact, no decoding is needed. Just concatenate them. When some data fragments are missing, decode by solving the mathematical equations. Decoding has similar CPU cost to encoding: 100-500ms per gigabyte.
Small Object Inefficiency
Erasure coding requires a minimum stripe size. A 10+4 scheme with 64KB minimum fragments means the minimum object is 640KB. Storing a 1KB file wastes 639KB of padding. Small files either get padded (wasting space) or bundled into archives before erasure coding.
Systems optimize by using different schemes for different sizes. Small objects use 3x replication (simpler, no minimum size). Large objects use erasure coding. The crossover point depends on fragment size and replication factor, typically 1-10MB.
Throughput Versus Latency
Erasure coding optimizes for throughput over latency. Writing 10GB as one operation amortizes encoding cost across all data. Writing 10,000 x 1MB files incurs encoding overhead per file. Sequential bulk workloads (backups, data lakes, media storage) fit perfectly. Random small write workloads do not.