Learn→Data Storage Formats & Optimization→Encoding Strategies (Dictionary, RLE, Delta)→5 of 5

Data Storage Formats & Optimization • Encoding Strategies (Dictionary, RLE, Delta)Hard⏱️ ~3 min

Failure Modes and Edge Cases in Encoding Strategies

Dictionary Encoding Breakdown:

Dictionary encoding breaks down on very high cardinality or adversarial data. Imagine a user_id column with 1 billion unique values in a batch, or free text fields. A dictionary would require storing almost every value, plus an ID, plus mapping overhead. In the worst case, the encoded representation can be larger than the raw, especially once you account for dictionary metadata and alignment.

To avoid this, engines often sample the column first and fall back to plain encoding when the compression ratio is below a threshold, such as less than 1.1 times. This sampling adds latency during ingestion but prevents catastrophic space expansion.

❗ Remember: Dictionary thrashing occurs when new distinct values keep arriving in streaming data. The dictionary grows beyond CPU cache, degrading performance. Systems cap dictionary size and switch encoding modes for new segments.
RLE Worst Case:

RLE has a classical worst case: alternating values like ABABAB. Instead of compressing, RLE representation becomes larger because you store a count for every element. In practice, columnar systems apply RLE only when runs exceed a minimum length, for example 4 or 8 consecutive values.

Another failure mode is unsorted data. If you apply RLE before sorting by the RLE candidate column, you rarely get long runs. This is why sort or clustering keys are critical design choices in warehouses. A poorly chosen clustering key can waste potential compression by a factor of 5 to 10 times.

RLE Compression Variance
SORTED DATA
5-10x
vs
UNSORTED DATA
1.1x
Delta Encoding Pitfalls:

Delta encoding can misbehave when values are not ordered by time or by magnitude. For example, if timestamps arrive out of order due to clock skew or network reordering, deltas may become large and irregular. This breaks assumptions about small differences and causes poor compression and sometimes overflow problems if not handled with larger integer types.

Engines either buffer and reorder within a time window or fall back to plain encoding when variance is too high. The buffering adds memory pressure and latency, creating a trade off between compression quality and ingestion throughput.

Compatibility and Evolution:

A subtle failure mode is compatibility and evolution. Once data is written with a particular encoding, changing encodings requires either on the fly decode and re encode or keeping multiple formats. If the encoding implementation has a bug, it can corrupt huge swaths of historical data.

Production systems therefore version encodings and maintain strict backward compatibility rules, which adds complexity to the storage layer. For example, Parquet has multiple versions of RLE and bit packing schemes. Queries must handle mixed encodings across segments written at different times.

Monitoring and Operational Reality:

To keep p99 query latency within, say, 3 seconds, you track per column compression ratio, CPU time spent in decode, cache hit rates, and the percentage of data that uses each encoding. When you see decode CPU dominating or compression ratios falling below targets, you consider changing sort keys, partitioning, or encoding strategies.

⚠️ Common Pitfall: Encoding bugs can corrupt historical data at massive scale. Production systems version encodings and enforce backward compatibility to allow safe evolution and rollback.

This operational aspect is what differentiates a theoretical design from a production ready encoding layer in a FAANG scale data system. You need continuous monitoring, adaptive thresholds, and graceful fallbacks to handle the edge cases that inevitably appear at petabyte scale.

💡 Key Takeaways

✓Dictionary encoding worst case with 1 billion unique user IDs creates larger encoded representation than raw due to dictionary metadata and alignment overhead

✓RLE compression on unsorted data achieves only 1.1 times compression versus 5 to 10 times on sorted data, making clustering key choice critical

✓Timestamps arriving out of order due to clock skew cause large irregular deltas that break compression assumptions and risk integer overflow

✓Engines sample columns during ingestion and fall back to plain encoding when estimated compression ratio is below 1.1 times to avoid space expansion

✓Encoding versioning and backward compatibility rules are mandatory because implementation bugs can corrupt petabytes of historical data with no easy recovery

📌 Interview Tips

1Free text field in dictionary: each unique comment creates new dictionary entry, causing dictionary to exceed original data size

2Alternating status values ABABAB: RLE stores (A,1)(B,1)(A,1)(B,1) which is larger than plain encoding

3Distributed logs with clock skew: timestamps 1000, 5000, 2000, 6000 produce deltas [0, 4000, negative 3000, 4000] that are not smaller

4Mixed encoding segments: query must decode Parquet v1 RLE from 2020 alongside v2 hybrid RLE from 2023 in same column scan

← Back to Encoding Strategies (Dictionary, RLE, Delta) Overview