Data Storage Formats & OptimizationEncoding Strategies (Dictionary, RLE, Delta)Medium⏱️ ~3 min

Production Scale: BigQuery, Snowflake, and Data Lake Encoding

The Scale Challenge: Consider a data warehouse with 5 petabytes of logical data and daily ingestion of 5 terabytes. Typical interactive BI workloads might run 500 to 1,000 queries per minute, with a service level objective (SLO) of p50 latency under 2 seconds and p95 under 5 seconds. Full scans of raw data at this scale are impossible within that latency, even with hundreds of nodes. The warehouse must reduce the bytes scanned per query, often by 5 to 20 times. Columnar storage is the first step, since most analytical queries touch only 10 to 20 percent of columns. Within each column, these encodings reduce size further. Real World Numbers: For a country column with 200 possible values across a billion rows, dictionary encoding can reduce storage by 10 to 50 times compared to raw strings. If the column is also sorted by country, RLE on top of the dictionary indices can add another 2 to 5 times reduction, because you now encode runs like (ID 7, count 50 million).
Typical Query Performance Impact
LOGICAL SCAN RATE
100-500 MB/s
EFFECTIVE RATE
Several GB/s
On numeric time series, such as metrics at Meta or Netflix, delta encoding plus bit packing and an optional general compressor like Zstandard can achieve over 90 percent reduction. Facebook Gorilla style compression for time series uses delta of delta for timestamps and XOR for floating point values. That allowed them to store billions of metrics in memory with approximately 1 byte per point for timestamps and a few bytes for values, serving queries with p95 latencies in the tens of milliseconds. Cost Implications: From a cost perspective, at cloud storage prices around 20 dollars per terabyte per month, improving compression from 2 times to 8 times on 5 petabytes of raw data saves hundreds of thousands of dollars per year. That is why vendors invest heavily in sophisticated encoding strategies. In the query path, the engine reads encoded column chunks from disk or object storage at 100 to 500 megabytes per second per core. Because the data is compressed and encoded, the effective logical scan rate can be several gigabytes per second per core. Vectorized execution engines process encoded integers directly for predicates and aggregations, sometimes without full decode, which keeps CPU cycles and cache misses low.
✓ In Practice: This is how systems can scan billions of rows and still render a dashboard chart in under 1 second for common cases. The combination of columnar layout, smart encoding, and vectorized execution makes interactive analytics at petabyte scale economically viable.
💡 Key Takeaways
Data warehouses with 5 petabytes and 500 to 1,000 queries per minute need 5 to 20 times byte reduction to meet p95 latency SLOs under 5 seconds
Dictionary plus RLE on sorted categorical columns achieves 20 to 250 times total compression: 10 to 50 times from dictionary, 2 to 5 times from RLE on top
Facebook Gorilla compression for time series uses delta of delta and XOR encoding to achieve approximately 1 byte per timestamp point and p95 latencies in tens of milliseconds
Improving compression from 2x to 8x on 5 petabytes at 20 dollars per terabyte per month saves hundreds of thousands of dollars annually in storage costs
Vectorized engines process encoded integers directly for filters and aggregations without full decode, achieving effective scan rates of several gigabytes per second per core
📌 Examples
1BigQuery country column: 200 values across 1 billion rows compressed 10 to 50 times via dictionary, then 2 to 5 times more via RLE after sorting
2Meta metrics storage: billions of time series points stored at approximately 1 byte per timestamp using delta of delta encoding
3Parquet data lake: typical workload scans 10 to 20 percent of columns at 100 to 500 MB/s per core physical rate but several GB/s effective logical rate
4Snowflake warehouse: 5 petabyte dataset with 5 terabyte daily ingestion serving 500 to 1,000 queries per minute under 2 second p50 latency
← Back to Encoding Strategies (Dictionary, RLE, Delta) Overview