Data Lakes & Lakehouses • Lakehouse Architecture (Delta, Iceberg, Hudi)Medium⏱️ ~3 min
Production Scale: Partitioning and Small File Problem
The Small File Explosion:
When you ingest data continuously, you naturally create many small files. A streaming job writing every 5 minutes generates 288 batches per day. If each batch creates 10 files, you have 2,880 files daily, or 1 million files per year for a single table. Object storage handles this fine, but query planning suffers. Reading metadata for 1 million files can take 5 to 30 seconds, pushing your dashboard query p99 latency from 2 seconds to 20+ seconds.
Columnar formats like Parquet also perform poorly with small files. Parquet uses row groups (typically 128 MB to 1 GB uncompressed) for efficient column scanning. A 10 MB file has minimal row groups, reducing compression ratios and increasing scan overhead. At query time, opening 1,000 small files means 1,000 S3 API calls and context switches, vs opening 10 large files.
Compaction Strategy:
All three formats provide compaction or optimization commands. Delta Lake has
Query Planning Performance
1M SMALL FILES
20 sec
→
10K OPTIMIZED FILES
800 ms
OPTIMIZE which rewrites small files into larger ones, targeting 512 MB to 1 GB per file. Iceberg and Hudi have similar compaction jobs. You typically run these on a schedule: hourly for hot partitions, daily for warm partitions, weekly for cold data.
The trade off is cost. Compaction reads and rewrites data, consuming compute. For a 100 TB table with 50% daily churn, you might rewrite 50 TB per day. At 10 cents per TB scanned and written, that's 10 dollars daily. But this keeps query performance acceptable, which may save you 100 dollars in dashboard query costs and engineer time debugging slow queries.
Partitioning Design:
Partitioning determines how data is physically organized. A poorly designed partition scheme amplifies the small file problem. Partitioning by user_id for 10 million users means 10 million partition directories, each with potentially small files. This makes metadata operations slow and query planning expensive.
Better partition schemes align with query patterns. For time series data, partition by date or hour. Queries filtering by date skip entire partitions. For multi dimensional queries, consider composite partitions like region and date, or use techniques like Z ordering (Delta) or hidden partitioning (Iceberg) where the format manages partition transforms internally.
⚠️ Common Pitfall: Over partitioning creates more problems than it solves. Aim for partitions with at least 1 GB of data each. If your daily data volume is 100 GB, partitioning by hour (24 partitions, 4 GB each) is reasonable. Partitioning by minute (1,440 partitions, 70 MB each) guarantees small file chaos.
Hidden Partitioning (Iceberg):
Iceberg supports partition evolution, meaning you can change partition schemes without rewriting data. If you initially partition by day but later realize hour is better, Iceberg can apply the new transform to future writes while old data remains in daily partitions. Queries work transparently because partition logic lives in metadata, not hardcoded in file paths.💡 Key Takeaways
✓Streaming ingestion creates small files: 288 daily batches × 10 files each = 2,880 files daily, leading to 1 million+ files per year and query planning times of 5 to 30 seconds
✓Compaction rewrites small files into 512 MB to 1 GB files, improving query planning from 20 seconds to under 1 second, but costs compute to rewrite data
✓Over partitioning (e.g., by high cardinality columns like <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">user_id</code>) creates millions of small partitions; aim for 1+ GB per partition for optimal performance
✓Iceberg supports partition evolution: change partition schemes without rewriting data, with transforms stored in metadata rather than hardcoded file paths
📌 Examples
1A 100 TB table with 50% daily churn requires rewriting 50 TB per day for compaction. At 10 cents per TB, that is 10 dollars daily in compute, but saves 100 dollars in slow query costs
2Partitioning a 100 GB daily dataset by hour creates 24 partitions at 4 GB each (good). Partitioning by minute creates 1,440 partitions at 70 MB each (bad, guarantees small files)
3Uber uses Hudi compaction on a schedule: hourly for hot partitions seeing constant updates, daily for warm data, weekly for cold archives