Data Storage Formats & OptimizationEncoding Strategies (Dictionary, RLE, Delta)Easy⏱️ ~3 min

Understanding Encoding Strategies: Dictionary, RLE, and Delta

Definition
Encoding strategies are compression techniques that exploit patterns in data to reduce storage size and speed up queries. The three fundamental types are Dictionary, Run Length Encoding (RLE), and Delta encoding.
The Core Problem: Large analytical systems need to scan terabytes of data to answer dashboard queries within 1 to 3 seconds. The bottleneck is not CPU but I/O and memory bandwidth. Reading fewer bytes from disk or object storage directly translates to faster queries and lower costs. Three Pattern Exploiting Approaches: Dictionary Encoding targets repeated values, especially in categorical columns. Instead of storing "United States" 500 million times as a string, you store it once in a dictionary and reference it with a small integer like 7. This works because string comparisons are expensive and strings take more space, while integer comparisons are cache friendly and take fixed space. Run Length Encoding targets consecutive sequences of identical values. Instead of storing A, A, A, A, you store (A, 4). This is powerful when data is sorted by a column like status or country, creating long runs of the same value naturally. Delta Encoding targets ordered numeric values that change gradually. Instead of storing 1000000, 1000010, 1000013, you store base 1000000 and deltas [0, 10, 3]. These smaller numbers fit in fewer bits and compress better. This pattern dominates time series and monotonically increasing IDs like timestamp or order_id.
✓ In Practice: Modern columnar systems like Parquet and Snowflake combine these encodings to get multiplicative benefits. A country column might use dictionary encoding, then apply RLE to the dictionary IDs after sorting.
These encodings are not just academic. They are how petabyte scale warehouses make queries fast enough to render dashboard charts in under 1 second while keeping storage costs manageable.
💡 Key Takeaways
Dictionary encoding replaces repeated values with small integer IDs that reference a dictionary, saving space when cardinality is moderate (less than 50,000 distinct values per page)
Run Length Encoding (RLE) stores consecutive identical values as (value, count) pairs, most effective on sorted or clustered columns where long runs naturally occur
Delta encoding stores a base value plus small differences for ordered numeric sequences, reducing bit width and improving compression for time series and monotonic IDs
Modern systems combine these encodings multiplicatively: dictionary plus RLE on sorted categorical columns can achieve 10 to 50 times compression
The primary goal is reducing I/O bytes scanned, not CPU savings, since analytical workloads are bottlenecked by memory bandwidth and disk throughput
📌 Examples
1Dictionary: Country column with 200 unique values across 1 billion rows compresses 10 to 50 times compared to raw strings
2RLE: Sorted status column with runs like (ACTIVE, 50 million) (EXPIRED, 30 million) instead of storing each individual value
3Delta: Timestamps 1000000, 1000010, 1000013 become base 1000000 plus deltas [0, 10, 3] using fewer bits
4Combined: Dictionary encode country IDs, sort by country, then RLE the dictionary IDs for 2 to 5 times additional compression
← Back to Encoding Strategies (Dictionary, RLE, Delta) Overview