What is ORC Format?

Definition
Optimized Row Columnar (ORC) is a columnar file format designed for efficient analytics on massive datasets in data lakes, used by query engines like Presto, Trino, Hive, and Spark.
The Problem It Solves:

Imagine you have a table with 10 billion rows and 200 columns stored in a traditional row oriented format. When an analyst queries for just 3 columns, the system still reads all 200 columns from disk, wasting enormous amounts of input/output (I/O), network bandwidth, and CPU time. Worse, row oriented formats compress poorly because adjacent data has mixed types and patterns.

How ORC Works:

ORC flips the storage model. Instead of storing rows together, it stores each column separately in large chunks called stripes. A typical stripe contains 64 MB to 256 MB of uncompressed data. Within each stripe, ORC groups rows into smaller units of 10,000 to 20,000 rows each.

Because each column contains homogeneous data, ORC applies specialized compression per column. String columns with low cardinality get dictionary encoding. Integer columns with repeated values get run length encoding. This achieves much better compression ratios than generic row compression.

The Key Innovation:

ORC stores rich statistics at multiple levels: minimum, maximum, count, and null count for each column in every stripe and row group. Query engines use these statistics to skip entire stripes that cannot possibly match filter conditions. This is called predicate pushdown.

Column Pruning Impact
WITHOUT ORC
200 cols
→
WITH ORC
3 cols

When you read only 3 columns from a 200 column table, you avoid reading 98.5% of the data. Combined with stripe skipping from predicate pushdown, queries can avoid 80 to 95 percent of disk reads entirely.

💡 Key Takeaways

✓ORC stores columns separately within large stripes (64 to 256 MB), enabling query engines to read only requested columns instead of entire rows

✓Each column uses specialized encoding: dictionary encoding for strings, run length encoding for repeated integers, achieving better compression than row formats

✓Statistics (min, max, count, null count) stored per stripe and row group enable predicate pushdown to skip 80 to 95 percent of irrelevant data

✓Designed for read heavy analytics on petabyte scale data lakes where storage, CPU, and network are all bottlenecks

📌 Interview Tips

1Query requesting 3 columns from 200 column table reads only 1.5% of data instead of 100%

2Stripe with timestamp range 2024-01-01 to 2024-01-15 is entirely skipped when query filters for dates after 2024-02-01

3String column with 50 distinct values across 10 million rows compresses to dictionary of 50 entries plus integer references

← Back to ORC Format & Optimization Overview