Data Storage Formats & Optimization • ORC Format & OptimizationEasy⏱️ ~3 min
What is ORC Format?
Definition
Optimized Row Columnar (ORC) is a columnar file format designed for efficient analytics on massive datasets in data lakes, used by query engines like Presto, Trino, Hive, and Spark.
Column Pruning Impact
WITHOUT ORC
200 cols
→
WITH ORC
3 cols
💡 Key Takeaways
✓ORC stores columns separately within large stripes (64 to 256 MB), enabling query engines to read only requested columns instead of entire rows
✓Each column uses specialized encoding: dictionary encoding for strings, run length encoding for repeated integers, achieving better compression than row formats
✓Statistics (min, max, count, null count) stored per stripe and row group enable predicate pushdown to skip 80 to 95 percent of irrelevant data
✓Designed for read heavy analytics on petabyte scale data lakes where storage, CPU, and network are all bottlenecks
📌 Examples
1Query requesting 3 columns from 200 column table reads only 1.5% of data instead of 100%
2Stripe with timestamp range 2024-01-01 to 2024-01-15 is entirely skipped when query filters for dates after 2024-02-01
3String column with 50 distinct values across 10 million rows compresses to dictionary of 50 entries plus integer references