Query Planning: How Metadata Pruning Accelerates Reads

The Metadata Leverage Effect:
At petabyte scale with millions of files, reading the table of contents is often more expensive than reading the actual data. Iceberg inverts the typical data lake problem by storing rich statistics in manifest files so query engines can eliminate most files without touching them. This transforms query planning from minutes to milliseconds.

Statistics Per Data File:
Each manifest entry for a data file includes detailed metadata. First, basic metrics: row count, total size in bytes. Second, per column statistics: null count, minimum value, maximum value. Third, partition values if the table is partitioned. A manifest file groups thousands of these entries together.

When a query arrives, the engine reads the table metadata file (a few kilobytes) to find the current snapshot, then reads the manifest list (tens of kilobytes) to find relevant manifest files, then reads those manifest files (megabytes total) to discover which data files to scan. This metadata traversal happens before any data file is opened.

Concrete Pruning Example:
Consider a table with 1 million Parquet files totaling 100 TB, storing click events over two years. The table is partitioned by date. You query for events on 2024-01-15 where user_id is between 1000000 and 2000000.

Pruning Cascade
1M files
INITIAL
1,400 files
AFTER DATE FILTER
140 files
AFTER USER_ID FILTER

First, partition pruning eliminates files not in the date range. With daily partitions over 730 days, you keep only 1,400 files for that single day. Second, column statistics pruning checks user_id min and max in each manifest entry. Files where max user_id is below 1000000 or min user_id is above 2000000 are skipped. This eliminates 90% of remaining files, leaving 140 files to actually scan.

The query engine reads perhaps 50 MB of manifest data (manifest files often compress well and cache effectively) in 100 to 200 milliseconds, then scans 140 files of 512 MB each (72 GB total) from object storage. With 100 parallel readers at 100 MB/sec each, data scan completes in under 10 seconds.

Hidden Partitioning Advantage:
Unlike Hive tables where users must explicitly filter on partition columns, Iceberg applies partition pruning automatically. You can partition by date internally, but query just by timestamp. Iceberg transforms the timestamp filter to a date filter and prunes accordingly. You can even evolve partitioning over time: old data partitioned daily, new data partitioned hourly. Queries work seamlessly across both.

⚠️ Common Pitfall: If you write millions of tiny files (1 to 10 MB each), manifest files grow huge because each file needs an entry. Query planning slows from sub second to 10+ seconds as engines read hundreds of megabytes of manifest data. Run compaction jobs to merge small files into 512 MB to 1 GB target sizes.
The Scale Numbers:
Netflix operates Iceberg tables with billions of rows and petabytes of data. Query planning for typical analytical queries completes in 50 to 500 milliseconds, pruning 95% to 99% of files. Apple reports similar pruning ratios, achieving interactive query latency even on datasets with hundreds of thousands of files. This performance requires disciplined file sizing and regular manifest compaction to keep metadata overhead in check.

💡 Key Takeaways

✓Manifest files store min and max values for every column in every data file, enabling aggressive pruning during query planning without reading actual data files.

✓Typical pruning eliminates 95% to 99% of files at petabyte scale, reducing a scan of 1 million files to 10,000 or fewer files in 100 to 200 milliseconds of metadata reading.

✓Hidden partitioning decouples the physical layout from the query interface. Users query by any column, and Iceberg automatically applies partition and statistics based pruning.

✓Metadata bloat is the enemy. With millions of small files, manifest files grow to hundreds of megabytes, pushing planning time from milliseconds to seconds or worse. Compaction is essential.

✓Columnar statistics work best for sorted or clustered data. If <code>user_id</code> is randomly distributed in each file, min and max span the full range, and statistics pruning has no effect.

📌 Interview Tips

1A table with 500,000 files stores transaction data partitioned by date and clustered by <code>customer_id</code>. Query for transactions on a specific date where <code>customer_id</code> equals 42. Date pruning reduces to 700 files. Statistics pruning on <code>customer_id</code> (min is 1, max is 100 in each file) reduces to 7 files. Total planning time: 80 milliseconds.

2An e-commerce table with 2 million files (200 TB) is queried for orders in the last 7 days where <code>order_total</code> exceeds 1000 dollars. Partition pruning by timestamp reduces to 10,000 files. Column statistics on <code>order_total</code> eliminate files where max is below 1000, leaving 500 files. Data scan of 250 GB completes in 3 seconds with 100 parallel readers.

← Back to Apache Iceberg Table Format Overview