Data Storage Formats & Optimization • Encoding Strategies (Dictionary, RLE, Delta)Easy⏱️ ~3 min
Understanding Encoding Strategies: Dictionary, RLE, and Delta
Definition
Encoding strategies are compression techniques that exploit patterns in data to reduce storage size and speed up queries. The three fundamental types are Dictionary, Run Length Encoding (RLE), and Delta encoding.
status or country, creating long runs of the same value naturally.
Delta Encoding targets ordered numeric values that change gradually. Instead of storing 1000000, 1000010, 1000013, you store base 1000000 and deltas [0, 10, 3]. These smaller numbers fit in fewer bits and compress better. This pattern dominates time series and monotonically increasing IDs like timestamp or order_id.
✓ In Practice: Modern columnar systems like Parquet and Snowflake combine these encodings to get multiplicative benefits. A country column might use dictionary encoding, then apply RLE to the dictionary IDs after sorting.
These encodings are not just academic. They are how petabyte scale warehouses make queries fast enough to render dashboard charts in under 1 second while keeping storage costs manageable.💡 Key Takeaways
✓Dictionary encoding replaces repeated values with small integer IDs that reference a dictionary, saving space when cardinality is moderate (less than 50,000 distinct values per page)
✓Run Length Encoding (RLE) stores consecutive identical values as (value, count) pairs, most effective on sorted or clustered columns where long runs naturally occur
✓Delta encoding stores a base value plus small differences for ordered numeric sequences, reducing bit width and improving compression for time series and monotonic IDs
✓Modern systems combine these encodings multiplicatively: dictionary plus RLE on sorted categorical columns can achieve 10 to 50 times compression
✓The primary goal is reducing I/O bytes scanned, not CPU savings, since analytical workloads are bottlenecked by memory bandwidth and disk throughput
📌 Examples
1Dictionary: Country column with 200 unique values across 1 billion rows compresses 10 to 50 times compared to raw strings
2RLE: Sorted status column with runs like (ACTIVE, 50 million) (EXPIRED, 30 million) instead of storing each individual value
3Delta: Timestamps 1000000, 1000010, 1000013 become base 1000000 plus deltas [0, 10, 3] using fewer bits
4Combined: Dictionary encode country IDs, sort by country, then RLE the dictionary IDs for 2 to 5 times additional compression