Learn→Data Storage Formats & Optimization→Encoding Strategies (Dictionary, RLE, Delta)→4 of 5

Data Storage Formats & Optimization • Encoding Strategies (Dictionary, RLE, Delta)Hard⏱️ ~3 min

Choosing Encoding Strategies: When to Use What

The Decision Framework:

Using these encodings is not free. You trade storage and I/O savings for additional CPU and complexity, and sometimes for reduced flexibility. The decision depends on your data characteristics, query patterns, and operational constraints.

Dictionary Encoding
Moderate cardinality (under 50,000 distinct values), stable distribution
vs
Plain Encoding
High cardinality (millions of distinct values), uniform distribution
Dictionary Encoding Trade-offs:

Dictionary encoding works best when cardinality is moderate and stable. If a column has millions of distinct values and distribution is close to uniform, the dictionary is large and the integer IDs need many bits. At some point, the cost of building and storing the dictionary exceeds the benefit. Some engines switch away from dictionary encoding when cardinality exceeds a threshold, for example 50,000 or 100,000 distinct values per page. Others keep a local per page dictionary rather than a global one to limit memory.

Dictionary thrashing occurs in streaming or frequently updated data. If new distinct values keep arriving (for example new SKUs or feature flags), the dictionary can grow beyond CPU cache and degrade performance. Some systems cap dictionary size and eventually spill rare values to a secondary structure or switch encoding modes for new data segments.

RLE Trade-offs:

RLE trades off read performance for update flexibility. It is ideal for mostly append only, sorted data. If you frequently update individual rows in the middle of a run stored as (value, count 1000000), a single change may require splitting that into multiple segments. That leads to write amplification and fragmentation. Many analytical systems accept this, because they favor bulk append and periodic compaction over row level updates.

Delta Encoding Trade-offs:

Delta encoding assumes correlation between adjacent values. If the sequence is noisy or unordered, deltas may not be smaller, and they can be larger when expressed relative to a base. That wastes bits and can reduce compressibility. Time series systems therefore sometimes reorder data or batch similar series together to preserve monotonic patterns.

"The decision is not 'use dictionary everywhere.' It is: what is my cardinality, am I sorted, and what is my write pattern?"
Compared to General Purpose Compressors:

These domain specific encodings are usually cheaper to decode than general purpose compressors like Gzip or Zstandard. However, they do not remove all redundancy. Most columnar formats first apply these encodings, then apply a generic compressor. The trade off is tuning for CPU budget. For very CPU constrained systems, you may choose lighter encodings and a faster but weaker compressor, such as LZ4, to keep p99 latency within your SLO.

When NOT to Use These Encodings:

Skip dictionary encoding when cardinality per page exceeds 50,000 to 100,000 distinct values or when the compression ratio estimate falls below 1.1 times. Skip RLE when data is unsorted or values alternate frequently. Skip delta encoding when values are unordered or variance is too high. In these cases, fall back to plain encoding or lightweight generic compression.

💡 Key Takeaways

✓Dictionary encoding should be skipped when cardinality exceeds 50,000 to 100,000 distinct values per page or when compression ratio falls below 1.1 times

✓RLE is ideal for append only sorted data but creates write amplification when updating individual rows in the middle of runs stored as (value, count 1000000)

✓Delta encoding fails on unordered or noisy sequences where deltas are not smaller than original values, wasting bits and reducing compressibility

✓Domain specific encodings are cheaper to decode than Gzip or Zstandard but often combined with lightweight generic compressors like LZ4 for CPU constrained systems

✓Adaptive selection during ingestion samples columns and estimates compression ratios, revisiting decisions during compaction when more data is available

📌 Interview Tips

1User ID column with 1 billion unique values: dictionary would be larger than raw data, use plain encoding instead

2Status column frequently updated mid run: single change to (ACTIVE, 1000000) requires splitting into multiple segments, causing write amplification

3Timestamps arriving out of order due to clock skew: deltas become large and irregular, breaking delta encoding assumptions

4CPU constrained system with p99 latency SLO under 3 seconds: use lighter encodings plus LZ4 instead of heavier Zstandard compression

← Back to Encoding Strategies (Dictionary, RLE, Delta) Overview