Storage Layout and Compaction: File Sizing, Partitioning, and the Small Files Problem

Storage layout directly impacts query performance and cost at scale. The goal is to write large, columnar, compressed files in well-partitioned directories while avoiding the small files problem that degrades metadata operations and query planning.

Columnar formats like Parquet store each column separately, enabling projection pruning. If analysts query 10 out of 200 columns, columnar layouts reduce Input/Output (I/O) by 90 to 95 percent compared to row-oriented formats. Combine this with compression (Snappy or Zstd) and you achieve 5 to 10 times size reduction from raw JSON. Partitioning directories by date, hour, and high-cardinality business dimensions (e.g., hashed tenant or region) spreads load and enables partition pruning. For example, querying last 24 hours with hourly partitions scans 24 directories instead of months of data, cutting scan costs proportionally.

File sizing is critical. Writing thousands of sub-10 megabyte files per partition causes the small files problem: query engines spend more time listing and opening files than reading data. Metadata overhead explodes and BI queries scan millions of objects, each with per-file open latency. Aim for 128 to 512 megabyte files. If a micro-batch receives 30 million events over 5 minutes at 100,000 events per second, that is 30 gigabytes raw. After 8x compression to Parquet, you write 3 to 4 gigabytes. With 256 megabyte target file size, that yields 12 to 16 files per partition per window, which is healthy.

Compaction merges small files into larger ones. Run compaction every few hours or daily depending on write rate. Modern table formats like Apache Iceberg, Hudi, and Delta provide transaction logs and atomic commits, allowing compaction to run concurrently with reads and writes without corrupting snapshots. They also support snapshot isolation, time travel, and schema evolution. Without these, coordinating writers and managing partial failures is brittle.

In practice, Amazon teams land raw data to a bronze zone as small append-only files, then compact to silver as columnar Parquet with target file sizes of 256 to 512 megabytes. Gold zones pre-aggregate or denormalize for serving. Periodic compaction keeps file counts low and query latency predictable, even under bursty write patterns during peak traffic.

💡 Key Takeaways

✓Columnar formats with projection pruning reduce I/O by 90 to 95 percent when querying a subset of columns. Combine with compression for 5 to 10 times size reduction from raw JSON.

✓Partition by date, hour, and high-cardinality dimensions (e.g., hashed tenant mod 64) to spread load and enable partition pruning, cutting scan costs proportionally.

✓Target 128 to 512 MB file sizes. Thousands of sub-10 MB files cause the small files problem: metadata overhead and per-file open latency dominate, degrading query performance.

✓Modern table formats like Apache Iceberg, Hudi, and Delta provide transaction logs and atomic commits, enabling concurrent compaction, snapshot isolation, and schema evolution.

✓Amazon pattern: land raw to bronze as append-only files, compact to silver Parquet every 5 to 15 minutes with 256 to 512 MB targets, and aggregate to gold for serving.

📌 Interview Tips

1Micro-batch sizing: 100k events/s × 300 seconds = 30M events at 1 KB raw = 30 GB. After 8x Parquet compression, write 3.75 GB. With 256 MB target, produce 15 files per partition per 5 minute window.

2Query efficiency: scanning 10 columns out of 200 with columnar layout cuts I/O by ~95%. Partition pruning on last 24 hours (24 partitions) instead of months reduces scan by 10 to 100 times, saving dollars per query at scale.

← Back to ETL Pipelines & Data Integration Overview