Data Storage Formats & Optimization • File-level Partitioning StrategiesHard⏱️ ~3 min
Failure Modes: Skew, Late Data, and Evolution
Data Skew Breaking Parallelism: Partition skew occurs when data distribution across partitions is highly uneven. Suppose you partition by
Mitigation strategies include hybrid partitioning where you split large partitions further. For example,
region and 70% of traffic comes from "US". The US partitions will be 3x to 5x larger than others. When a query processes all regions, execution time is dominated by the slowest partition, and parallelism becomes useless.
The math is brutal. If you have 100 workers processing 100 partitions, but one partition contains 500 GB while others average 50 GB, that single partition takes 10x longer to process. Your 100 worker cluster effectively becomes a 1 worker system for that query because 99 workers finish early and sit idle waiting for the straggler.
Skewed Partition Impact
99 workers
IDLE IN 2 MIN
1 worker
RUNS 20 MIN
region=US can be subdivided into region=US/subregion=West, region=US/subregion=East. Another approach is to add hash bucketing within skewed partitions: region=US/bucket=047 where bucket is hash(user_id) mod 64. This distributes the US traffic across 64 buckets while keeping other regions as single partitions.
Late Arriving Data: Time based partitions assume data arrives promptly. In reality, mobile devices go offline, upstream systems lag, or batch jobs retry. If your partition is dt=2024-12-25 and events for that date trickle in for 3 days, you face a choice: reopen and rewrite the partition, or maintain a separate late data partition.
Reopening partitions breaks immutability assumptions. Downstream incremental jobs that already processed dt=2024-12-25 must detect and reprocess it. This requires tracking partition modification timestamps and complicates dependency management. At scale, continuous partition rewrites also interfere with compaction jobs and increase storage churn.
The alternative is a dedicated late data strategy. Some systems maintain dt=2024-12-25/late=true partitions and periodically merge them with main partitions. Others accept eventual consistency and document that queries must union current and late partitions for complete accuracy. Airbnb uses a 3 day grace period: data arriving within 3 days goes to the main partition, older arrivals go to a backfill partition that is processed separately.
❗ Remember: Late data is not an edge case at scale. In systems processing billions of mobile events daily, 2% to 5% of events arrive over 24 hours late, and 0.5% arrive over 1 week late.
Partition Evolution Challenges: After operating for a year, teams often realize the original partition scheme was wrong. Maybe you partitioned only by date but now need hourly granularity. Or you partitioned by user_id directly and metadata exploded to 10 million partitions.
Repartitioning 500 TB is not trivial. Full rewrite takes days to weeks of compute time and doubles storage during the migration. Systems like Apache Iceberg support partition evolution where new data uses the new scheme and old data keeps the old scheme. Queries transparently handle mixed layouts by applying appropriate pruning logic to each segment. This avoids rewriting everything but adds complexity to query planning.
A common failure mode is planning timeouts when partition schemas mix. If you have 50,000 old hourly partitions plus 10,000 new daily partitions, and a query spans both, the planner must build a unified execution plan across heterogeneous layouts. This can push planning time from seconds to tens of seconds, especially in systems with naive metadata handling. The solution is to gradually backfill old data using background jobs, retiring mixed schemas within a bounded time window like 90 days.💡 Key Takeaways
✓Skewed partitions where one partition is 5x to 10x larger than others destroy parallelism, causing 99 workers to sit idle while 1 worker processes the oversized partition for 10x longer
✓Late arriving data affects 2% to 5% of mobile events at scale, forcing choice between reopening partitions (breaking immutability) or maintaining separate late data partitions with 3 day grace periods
✓Partition evolution after realizing wrong initial scheme requires either expensive full rewrites of 500+ TB or complex mixed schema handling where old and new layouts coexist during gradual migration
✓Mixed partition schemas during evolution can push query planning from under 1 second to over 10 seconds when spanning 50,000 old partitions plus 10,000 new partitions
📌 Examples
1A dataset partitioned by region where 70% of 500 GB daily data lands in region=US creates a 350 GB partition that takes 20 minutes to process while 99 other regional partitions finish in 2 minutes
2Airbnb uses 3 day grace period for late data: events arriving within 3 days merge to main partition, older arrivals go to separate backfill partition processed independently to avoid continuous rewrites