Learn→Data Storage Formats & Optimization→Encoding Strategies (Dictionary, RLE, Delta)→2 of 5

Data Storage Formats & Optimization • Encoding Strategies (Dictionary, RLE, Delta)Medium⏱️ ~3 min

How Dictionary and RLE Encoding Work at Scale

The Implementation Mechanics:

These encodings are applied per column chunk or page, not globally. A typical pattern in Parquet or Snowflake breaks each column into pages of 64 KB to 1 MB uncompressed. For each page, the system profiles the data, decides which encoding to use, then encodes and optionally compresses it.

Dictionary Encoding Process:

The encoder makes a first pass over the page to collect distinct values up to a size limit. Values go into a hash table. Once the dictionary is built, it gets serialized. A second pass replaces each value with its dictionary index. The index bit width adapts to dictionary size: 8 bits for up to 256 distinct values, 16 bits for up to 65,536. These indices are bit packed to avoid wasting bytes.

1
Profile: Scan column page to collect distinct values (e.g., 150 unique country codes)
2
Build dictionary: Store each unique value once with an ID (requires 8 bits per ID for 150 values)
3
Replace: Substitute each occurrence with bit packed integer ID
4
Store: Serialize dictionary plus packed ID stream together in page
RLE Hybrid Scheme:

RLE is often combined with bit packing. The encoder scans the sequence (often dictionary indices) and detects runs. For runs longer than a threshold (typically 4 to 8 values), it emits an RLE segment containing the value and run length. For scattered values, it emits a bit packed block of raw values.

Compression Impact
10-50x
DICTIONARY
2-5x
RLE ON TOP
Query Time Optimization:

Query engines exploit these encodings without always fully decoding. For dictionary encoding, filters like country = 'US' are translated into dictionary_id IN {7}, and the scan checks integer equality. Aggregations like COUNT DISTINCT operate on dictionary IDs and dictionary size instead of values. For RLE, a filter like status = 'ACTIVE' can skip entire runs of other statuses with a single comparison.

At larger scale, systems implement adaptive encoding selection. During ingestion, the engine samples part of each column, evaluates candidate encodings by estimating compressed sizes, then picks the best. This decision can be revisited during later compaction when more data is available.

💡 Key Takeaways

✓Dictionary encoding uses two passes: first to build the dictionary hash table, second to replace values with bit packed integer IDs matching dictionary size

✓RLE hybrid encoding switches between run segments (value, count) for repetitive sequences and bit packed blocks for scattered values based on a run length threshold of 4 to 8

✓Bit packing adjusts integer width dynamically: 8 bits for up to 256 distinct values, 16 bits for up to 65,536, avoiding wasted bytes

✓Query engines can filter on encoded data directly: translating <code>country = 'US'</code> to integer ID checks without full decode improves scan performance

✓Adaptive encoding selection during ingestion samples columns and estimates compression ratios to choose the best strategy per page

📌 Interview Tips

1Dictionary for 200 unique countries in 1 billion rows: store 200 strings once, then 1 billion 8 bit IDs instead of 1 billion full strings

2RLE on sorted status column: (ACTIVE, 50000000) (EXPIRED, 30000000) stores 2 segments instead of 80 million individual values

3Combined approach: dictionary encode country column to IDs, sort by country, then RLE the IDs for multiplicative compression

4Filter pushdown: <code>WHERE status = 'ACTIVE'</code> becomes <code>WHERE dict_id = 5</code> checked against RLE segments, skipping non matching runs

← Back to Encoding Strategies (Dictionary, RLE, Delta) Overview