Data Storage Formats & Optimization • File-level Partitioning StrategiesMedium⏱️ ~3 min
The Small File Problem and Compaction
What Breaks at Scale: The small file problem is the most common operational failure in partitioned data lakes. It happens when streaming ingestion or frequent micro batches write thousands of tiny files, typically 1 to 10 MB each, across many partitions. While total data volume might be modest, the sheer file count becomes the bottleneck.
Here is why this matters. Modern query engines like Presto, Spark, or Trino issue one metadata request per file to get statistics and schema. If a partition contains 10,000 small files instead of 40 optimally sized files, the planning phase alone can take 30 seconds instead of 1 second. Object stores like Amazon S3 charge per request, so listing operations on millions of files can add hundreds of dollars per day in unexpected costs.
How Compaction Works: Compaction is a background process that periodically scans partitions for small files and rewrites them into larger, optimally sized files. A typical compaction job runs hourly or daily and targets partitions with file counts above a threshold, often 200 to 500 files.
The process reads all small files in a partition, merges their data while preserving sort order if applicable, then writes out new files of 128 to 512 MB each. The old small files are marked for deletion but kept temporarily to support time travel queries. After a grace period (typically 7 days), they are permanently removed.
Real Production Strategy: Airbnb runs compaction jobs every 6 hours on active partitions (last 3 days) and daily on older partitions. They target 256 MB compressed file size and maintain file counts under 300 per partition. This keeps query planning under 2 seconds for typical queries while managing a multi petabyte data lake.
The write path also matters. Streaming sinks buffer data for 5 to 15 minutes before flushing files, balancing latency against file size. Some systems use a bucketing strategy where each writer handles a fixed set of partition combinations, reducing the number of simultaneously open files from thousands to dozens.
Query Planning Impact
40 FILES (256MB)
1 sec
→
10,000 FILES (1MB)
30 sec
⚠️ Common Pitfall: Teams often notice the small file problem only after scale grows 10x. A system that writes 100 files per hour works fine initially but creates 876,000 files per year. Without compaction, queries degrade from seconds to minutes as metadata overhead dominates.
Monitoring Key Metrics: Track files per partition (alert above 500), average file size (alert below 64 MB), and partition planning time (alert above 5 seconds). Also monitor compaction lag: the delay between data write and compaction completion. If compaction cannot keep up with ingestion rate, you need to increase compaction parallelism or adjust write buffering.💡 Key Takeaways
✓Small files (under 10 MB) cause query planning to dominate execution time, turning 1 second plans into 30 second waits when file counts reach thousands per partition
✓Compaction rewrites small files into 128 to 512 MB optimized files hourly or daily, keeping file counts under 300 to 500 per partition for sub 2 second planning
✓Streaming sinks buffer data for 5 to 15 minutes before flushing to balance write latency against file size, preventing micro batches from creating thousands of tiny files
✓Object storage charges per request, so 10,000 list operations daily on millions of files can add hundreds of dollars in unexpected monthly costs
📌 Examples
1Airbnb compacts partitions every 6 hours for recent data and daily for older data, maintaining 256 MB files and under 300 files per partition across petabytes of data
2A ride sharing platform initially wrote 1 MB files every minute, accumulating 876,000 files yearly, causing query planning to degrade from 2 seconds to 45 seconds before implementing hourly compaction