Learn→Data Storage Formats & Optimization→Compression Algorithms Trade-offs→1 of 5

Data Storage Formats & Optimization • Compression Algorithms Trade-offsEasy⏱️ ~3 min

What is Compression and Why Does It Matter at Scale?

Definition
Compression reduces data size by encoding information more efficiently. It trades CPU cycles for smaller storage and faster network transfers, critical when dealing with petabytes of data.
The Core Problem: At large scale, a single product can generate tens of petabytes of logs monthly and serve millions of requests per second. Without compression, three things become prohibitive: storage costs, replication traffic, and query latency.

Think of it this way: If you're storing 200 terabytes of raw event data per day and replicating it three times for durability, that's 600 TB daily just for one pipeline. Over a year, you're paying for storage and network capacity for 219 petabytes.

How Compression Helps: By shrinking data before storage or transmission, you reduce all three bottlenecks. A 4x compression ratio turns that 219 petabytes into roughly 55 petabytes, a massive cost difference.

The Three Key Metrics: Every compression algorithm is measured by three dimensions. First, compression ratio: original size divided by compressed size, typically ranging from 2x to 10x in practice. A 1 GB file compressed to 250 MB has a 4x ratio. Second, compression speed: measured in megabytes per second (MB/s) that one CPU core can compress. Third, decompression speed: how fast you can expand the data back, also in MB/s per core.

Typical Compression Impact
2x to 10x
RATIO RANGE
70%
NETWORK SAVINGS

The fundamental insight is that compression lets you move the bottleneck. If storage and network capacity grow slower than your data volume, compression becomes essential infrastructure, not optional optimization.

💡 Key Takeaways

✓Compression ratio divides original size by compressed size, typically achieving 2x to 10x reduction in practice

✓Three core metrics define any codec: compression ratio, compression speed (MB/s per core), and decompression speed

✓At scale, compression directly reduces storage costs, replication traffic, and query latency by shrinking data before it moves

✓Trading CPU cycles for smaller data becomes essential when storage and network capacity grow slower than data volume

✓A large pipeline generating 200 TB daily can save hundreds of petabytes annually with effective compression

📌 Interview Tips

1A social network ingesting 5 GB/s of raw events (5 million events per second) can halve network traffic with 2x compression, reducing infrastructure from 40 Gbit to 20 Gbit links

2Storing 200 TB daily with 4x compression saves 150 TB per day, which when replicated 3 times and stored for a year, creates massive cost differences

3A 1 GB file compressed to 250 MB demonstrates a 4x compression ratio, reducing both storage footprint and transfer time proportionally

← Back to Compression Algorithms Trade-offs Overview