Data Storage Formats & Optimization • ORC Format & OptimizationMedium⏱️ ~3 min
ORC in Production Data Pipelines
The Production Flow:
In real analytics stacks at companies like Meta, Netflix, and LinkedIn, ORC fits into a multi stage pipeline. Events arrive at high velocity: 100,000 to 1 million events per second through streaming systems like Kafka. These are initially written in flexible formats like JSON or Avro for fast ingestion and schema evolution. Then batch jobs compact and transform this raw data into ORC for query optimization.
Consider a typical daily batch process. Overnight, a Spark or Hive job reads 24 hours of raw Avro data, applies transformations, partitions by date and region, and writes ORC files to object storage like Amazon S3, Google Cloud Storage (GCS), or Hadoop Distributed File System (HDFS). Each partition might contain 50 to 200 ORC files, each 128 MB to 512 MB compressed. Query engines like Presto or Trino then read directly from these ORC files when analysts run queries.
Scale and Impact:
The numbers are massive. A single large table might have 10 billion rows per day, 365 partitions per year, and 200 columns. That's 3.65 trillion rows annually. Storing this naively in row format would consume petabytes and make queries impossibly slow. With ORC, the same data compresses 3 to 5 times better, and queries that previously took minutes complete in seconds.
✓ In Practice: A Presto cluster handling 500 queries per minute at p95 latency of 30 seconds can, with optimized ORC usage, support 1,500 queries per minute at similar latency, or reduce cluster size by 2 to 3 times for the same throughput.
Lazy Reads and Two Phase Execution:
Advanced query engines implement lazy reads, which dramatically amplify ORC benefits. In the first phase, the engine reads only columns needed for filters and joins: typically identifiers like user_id, timestamps like event_time, and maybe category codes. It evaluates predicates and eliminates 90 to 99 percent of rows.
Only in the second phase does it read heavy payload columns like long text fields, JSON blobs, or wide arrays for the surviving rows. Meta reported that lazy reads alone yielded up to 18 times speedup on some workloads. Combined with predicate pushdown, synthetic benchmarks showed up to 80 times total speedup.
Operational Considerations:
You need compaction processes. Streaming jobs that flush ORC files every few minutes create thousands of tiny files. This causes metadata overhead and task scheduling explosion. Compaction jobs periodically rewrite 1000 small files into 10 large files, preserving sort order if beneficial for range pruning.
Partitioning strategy matters. Partitioning by high cardinality keys like user_id creates too many partitions. Partitioning by low cardinality keys like date or region provides coarse pruning at directory level, then ORC statistics handle fine grained pruning within each partition.
Query Throughput Impact
BEFORE
500 qpm
→
AFTER
1500 qpm
💡 Key Takeaways
✓Production pipelines use dual format strategy: Avro or JSON for raw ingestion (flexible, fast writes), then ORC for curated query optimized tables (efficient reads)
✓Lazy reads execute in two phases: first read only filter columns to eliminate 90 to 99 percent of rows, then read heavy payload columns only for survivors, yielding 18 to 80 times speedup
✓Compaction is critical: streaming jobs creating thousands of 1 MB files cause metadata and scheduling overhead; rewrite into tens of 128 to 512 MB files
✓Proper partitioning by low cardinality keys (date, region) provides coarse directory level pruning; ORC statistics handle fine grained pruning within partitions
📌 Examples
1Daily batch job reads 24 hours of Avro events (10 billion rows), transforms, partitions by date and region, writes 200 ORC files per partition to S3
2Query filtering for <code>event_date >= 2024-03-01</code> and <code>user_id > 1000000</code> first reads only <code>event_date</code> and <code>user_id</code> columns, eliminates 95% of rows, then reads <code>event_payload</code> for remaining 5%
3Compaction job rewrites 5000 ORC files (avg 2 MB each) from hourly streaming into 50 files (200 MB each) overnight