Learn→Data Storage Formats & Optimization→Compression Algorithms Trade-offs→2 of 5

Data Storage Formats & Optimization • Compression Algorithms Trade-offsEasy⏱️ ~3 min

How Compression Algorithms Work: Building Blocks and Common Codecs

The Two Core Techniques: Most compression algorithms combine two fundamental approaches. First, dictionary or LZ style compression finds repeated sequences in data and encodes them as references to earlier occurrences. When you see "the quick brown fox" multiple times, instead of storing it repeatedly, you store it once and use pointers for subsequent instances.

Second, entropy coding assigns shorter codes to frequently appearing symbols and longer codes to rare ones. Methods like Huffman coding, arithmetic coding, or Asymmetric Numeral Systems (ANS) implement this principle. If the letter "e" appears 100 times and "z" appears twice, "e" gets a short code.

The Codec Spectrum: Different algorithms make different trade offs across this foundation. Understanding where each codec sits on the speed versus ratio spectrum helps you choose correctly.

1
Speed Focused (LZ4, Snappy): Compress at hundreds of MB/s per core with ratios around 1.5x to 2.5x. Perfect when CPU budget is tight and latency matters.
2
Balanced (zlib/gzip, Zstd): Classic zlib delivers 3x ratio with moderate CPU cost. Zstandard (Zstd) achieves zlib level ratios while running 3 to 5 times faster, or better ratios at the same speed.
3
Ratio Focused (XZ, bzip2): Achieve 5x to 10x on text and logs but consume high CPU and take seconds for gigabyte scale data. Used for cold archives.
Why Zstandard Stands Out: Zstd represents modern codec design by combining wide windows (many megabytes versus zlib's 32 KB), branchless decoding that reduces CPU branch mispredictions, and ANS based entropy coding. The result: at the same ratio as zlib, Zstd compresses 3 to 5 times faster and decompresses roughly 2 times faster.

"The codec choice isn't about finding the 'best' algorithm. It's about matching the algorithm's profile to your workload's CPU budget, latency constraints, and read/write ratio."
Practical Example: A logging pipeline ingesting 5 GB/s might use Snappy at the producer level, which can compress at 400 MB/s per core and adds under 1 millisecond to p99 latency. This halves network traffic from producers to brokers. For long term storage, the same data gets recompressed with Zstd at a higher level, achieving 4x ratio and cutting storage costs substantially. The data is written once with high compression, then read many times where fast decompression matters.

💡 Key Takeaways

✓Dictionary compression finds repeated sequences and encodes them as references, while entropy coding assigns shorter codes to frequent symbols

✓LZ4 and Snappy prioritize speed (hundreds of MB/s per core) with 2x ratios, adding under 1 ms to p99 latency

✓Zstandard achieves zlib level 3x to 4x ratios while running 3 to 5 times faster on compression and 2 times faster on decompression

✓Zstd uses wide windows (many MB versus 32 KB), branchless decoding, and ANS entropy coding for better CPU efficiency

✓XZ and bzip2 reach 5x to 10x ratios but take seconds per gigabyte, suitable only for cold archives where read frequency is low

📌 Interview Tips

1A 5 GB/s logging pipeline uses Snappy at producers (400 MB/s per core, 2x ratio) to halve network traffic with minimal latency impact

2The same pipeline recompresses to Zstd for storage, achieving 4x ratio and turning 200 TB daily into 50 TB, saving 150 TB per day

3Zlib with a 32 KB window limits historical context, while Zstd with multi MB windows can reference patterns from much earlier in the stream

← Back to Compression Algorithms Trade-offs Overview