Big Data Systems • Data Lakes & LakehousesHard⏱️ ~3 min
Lakehouse Performance Optimization: Compaction, Partitioning, and Data Skipping
Achieving warehouse like performance in a lakehouse requires careful optimization of physical data layout, metadata management, and query execution patterns. Three core techniques drive performance: compaction, strategic partitioning, and data skipping via statistics.
Compaction merges small files into larger optimized files, targeting 256 MB to 1 GB per file to match distributed Input/Output (I/O) throughput while minimizing per file overhead. Copy on write strategies rewrite entire files during upserts, yielding simple fast reads but expensive writes. Merge on read appends delta logs and defers merging to background compaction, trading some read complexity for lower write latency and reduced write amplification. Under heavy ingestion (hundreds of TB per day as seen at Uber), background compaction can fall behind, increasing read amplification where queries must merge base files plus multiple delta layers. Adaptive scheduling prioritizes hot partitions and autoscales compaction compute to prevent backlog.
Partitioning strategy directly impacts query performance and parallelism. Time only partitions like date=2025-01-15 hotspot recent data and create imbalanced parallelism. High cardinality keys like user_id generate millions of partitions, overwhelming metadata systems. Composite partitioning combines time with a hash of high cardinality keys (for example, date plus hash of user_id modulo 256) to distribute load and improve pruning. Partition evolution allows changing partition keys over time without rewriting historical data, keeping partitioning in table metadata rather than strictly encoded in directory paths. Clustering or Z ordering within partitions co locates correlated columns, improving range scans and join performance by 5x to 10x for selective predicates.
Data skipping uses per file statistics (min, max, null counts, bloom filters) stored in manifests to prune files without reading them. Columnar formats like Parquet embed statistics per column per row group. At query time, predicates like WHERE user_id = 12345 eliminate files where the user_id range excludes 12345, reducing scanned bytes from terabytes to gigabytes. This converts per TB scanned pricing in serverless SQL engines from tens of dollars per query to single dollars. Statistics must be refreshed during compaction or at defined intervals to remain accurate as data evolves.
💡 Key Takeaways
•Target file sizes of 256 MB to 1 GB to match distributed I/O throughput while minimizing per file metadata overhead, critical for scan performance
•Composite partitioning (date plus hash of user_id modulo 256) distributes load and improves pruning vs time only or high cardinality keys that create hotspots or millions of partitions
•Data skipping via per file statistics (min, max, null counts, bloom filters) reduces scanned bytes from TB to GB, converting query costs from tens of dollars to single dollars
•Copy on write rewrites files (fast reads, expensive writes) vs merge on read appends deltas (fast writes, slower reads requiring merge), choose based on read vs write ratio
•Clustering or Z ordering co locates correlated columns within partitions, improving range scans and joins by 5x to 10x for selective predicates
📌 Examples
Uber petabyte scale lakehouse: background compaction merges small files from streaming ingestion (hundreds of TB per day), adaptive scheduling prioritizes hot partitions to prevent read amplification
Query with WHERE user_id = 12345 on 10TB table: with statistics prunes 95% of files, scans 500GB, costs $2.50 at $5 per TB vs $50 without pruning
Netflix Iceberg migration: composite partitioning (date plus hash modulo 128) replaced millions of single date partitions, reduced hotspots, improved query parallelism by 10x