How Do Columnar Encodings and Compression Codecs Achieve 10x or Greater Data Reduction?

Columnar systems apply multiple layers of compression tailored to each column's characteristics. The first layer is encoding, which exploits semantic patterns in the data. Dictionary encoding replaces repeated string values with small integer identifiers: a Country column with millions of "United States" entries stores the string once in a dictionary and references it everywhere with a 2 byte id, often yielding 5 to 50 times reduction for low cardinality columns like country codes, device types, or status enums.

Run Length Encoding (RLE) compresses consecutive repeated values into a single entry plus count. When data is sorted by timestamp or user identifier, columns like subscription status or region become highly repetitive within blocks. Instead of storing "premium" ten thousand times, RLE stores "premium, count equals 10000" in a few bytes. Frame of Reference and delta encoding work for numeric columns: if order identifiers increment from 1,000,000 to 1,001,000, store the base (1,000,000) once and deltas (0, 1, 2, etc.) as tiny integers that compress to 4 to 16 bits with bit packing.

The second layer applies general purpose compression codecs to the encoded bytes. LZ4 and Snappy decompress at 1 to 5 gigabytes per second per core, making them ideal for hot data where Central Processing Unit (CPU) is the bottleneck. Zstandard (ZSTD) achieves better compression ratios (often 1.5 to 2 times better than LZ4) but decompresses slower, suitable for cold storage or when network bandwidth limits scans. ClickHouse commonly processes billions of rows per second per node using LZ4 on well encoded columns, delivering aggregate scan throughput of 10 to 100 gigabytes per second across clusters.

In practice, systems auto select encodings per column based on sampled statistics. LinkedIn and Uber's Apache Pinot uses dictionary encoding plus bitmaps for dimensions and delta encoding for metrics, serving user facing analytics with p95 latencies under 100 milliseconds while scanning millions to billions of rows. The compound effect of encoding plus codec compression routinely delivers 3 to 10 times overall reduction, with high cardinality or poorly sorted columns dropping toward 1 to 2 times and low cardinality, sorted columns reaching 10 to 100 times.

💡 Key Takeaways

•Dictionary encoding replaces repeated strings with integer ids, achieving 5 to 50 times reduction for low cardinality columns like country, device type, or status

•Run Length Encoding (RLE) stores consecutive repeated values as single entry plus count, highly effective when data is sorted by dimensions like timestamp or userId

•Frame of Reference and delta encoding reduce numeric columns to small deltas (4 to 16 bits with bit packing) when values increment slowly, common for order ids or timestamps

•General purpose codecs layer on top: LZ4 and Snappy decompress at 1 to 5 gigabytes per second per core for hot data; Zstandard (ZSTD) provides 1.5 to 2 times better ratios for cold storage

•Compound effect delivers 3 to 10 times compression typically, with well sorted low cardinality columns reaching 10 to 100 times and high cardinality unsorted columns dropping to 1 to 2 times

📌 Examples

LinkedIn Apache Pinot: dictionary encoding plus bitmaps on dimension columns and delta encoding on metrics enable p95 query latencies under 100 milliseconds scanning billions of rows

ClickHouse with LZ4: vectorized columnar engine processes billions of rows per second per node, delivering 10 to 100 gigabytes per second aggregate scan throughput when data sorted for maximal RLE and zone map pruning

Meta Presto on ORC: per stripe dictionary and RLE encodings yield 3 to 8 times compression; queries see over 90% stripe pruning on selective predicates, converting multi minute scans to seconds

← Back to Columnar Storage & Compression Overview