Data Storage Formats & Optimization • ORC Format & OptimizationMedium⏱️ ~3 min
How ORC Stripe Architecture Works
The Core Mechanism:
ORC organizes data into stripes, which are self contained units typically 64 MB to 256 MB of uncompressed data. Within each stripe, ORC divides rows into smaller row groups of 10,000 to 20,000 rows. The key insight is that both stripe level and row group level statistics enable multi level pruning.
When you write data to ORC, the writer buffers incoming records until it accumulates enough rows to fill a stripe. For each column, it analyzes the data characteristics and selects an encoding strategy. A string column with only 100 distinct values across 5 million rows gets dictionary encoding: the 100 unique strings are stored once, and each row stores a small integer reference. An integer column with long runs of repeated values gets run length encoding, storing just the value and count.
Statistics at Multiple Granularities:
ORC computes and stores statistics at two levels. First, stripe level statistics cover all rows in the stripe: minimum value, maximum value, total count, and null count for each column. Second, row group statistics provide the same metrics for each 10,000 to 20,000 row segment within the stripe.
Real Performance Numbers:
At Meta, ORC optimizations on a 600 million row dataset reduced single column query wall time by 3.5 to 4 times compared to older readers. CPU time dropped by about 4 times. Scaling to a dataset 10,000 times larger, the improvements held: 3.5 to 4.5 times faster wall time and 4.5 to 6.5 times lower CPU usage.
The combination of column pruning, predicate pushdown, and optimized encoding delivered these gains. Some workloads with aggressive filtering saw up to 30 times effective speedup because the engine avoided decoding most data entirely.
1
Stripe pruning: Query engine reads file footer, checks stripe statistics. Stripe with
timestamp range 2024-01-01 to 2024-01-15 is skipped entirely if query filters for dates after 2024-02-01.2
Row group pruning: For remaining stripes, engine reads row group statistics. Row group with
user_id range 1000 to 5000 is skipped if query filters for user_id greater than 10000.3
Data reading: Only for surviving row groups does engine decompress and decode actual column data.
Meta ORC Performance Gains
4x
FASTER WALL TIME
6.5x
LOWER CPU
💡 Key Takeaways
✓Stripes are 64 to 256 MB self contained units; row groups within stripes are 10,000 to 20,000 rows, enabling two level pruning hierarchy
✓Dictionary encoding stores unique values once, then references them by small integers; effective when column has 50 to 1000 distinct values across millions of rows
✓Stripe level statistics enable coarse pruning; row group statistics enable fine grained pruning within selected stripes, together skipping 80 to 95 percent of data
✓Meta measured 3.5 to 6.5 times performance improvement on real datasets ranging from 600 million to 6 trillion rows
📌 Examples
1Stripe with <code>order_date</code> min 2024-01-01, max 2024-01-15 is skipped when query filters <code>order_date >= 2024-03-01</code>
2String column <code>country_code</code> with 200 distinct values encoded as dictionary: 200 strings stored once, 10 million rows store 1 byte integer references
3Query requesting columns <code>user_id</code>, <code>revenue</code>, <code>timestamp</code> from 200 column table reads only 3 column streams per stripe