Data Storage Formats & Optimization • File-level Partitioning StrategiesHard⏱️ ~3 min
Choosing Partitions: Trade-Offs and Decision Framework
The Core Trade-Off: Every partitioning decision is a balance between query pruning efficiency and operational overhead. Finer partitions (hourly instead of daily) improve pruning and enable granular retention policies, but explode partition counts, metadata size, and operational complexity. Coarser partitions simplify management but force queries to scan more irrelevant data.
The math matters. Consider a dataset with 100 GB per day. Daily partitions mean a query filtering for 1 specific hour scans all 100 GB, wasting 95.8% of I/O. Hourly partitions reduce this to 4.2 GB, but create 24x more directories. Over 2 years, that is 17,520 hourly partitions versus 730 daily ones. Metadata catalogs like Hive Metastore or AWS Glue struggle above 50,000 to 100,000 partitions, causing listing operations to timeout.
Range vs Hash Decision Criteria: Choose range partitioning when query patterns filter by ranges (dates, ID ranges, numeric values) and you can tolerate hot partitions. Choose hash partitioning when write load must be perfectly balanced and most queries scan full datasets anyway.
The hidden cost of hash partitioning is query performance. If you hash partition by
Cardinality Sweet Spot: Aim for 50 to 500 unique values per partition dimension. Below 50 creates significant skew risk. Above 500 explodes metadata and file counts. For high cardinality dimensions like
Daily Partitions
365 partitions/year, simpler ops, but scans 24 hours of data for 1 hour query
vs
Hourly Partitions
8,760 partitions/year, precise pruning, but 24x metadata overhead
user_id into 128 buckets, a query filtering for users 1000 to 2000 must scan all 128 buckets because those users are randomly distributed. This turns a selective query into a full scan. Range partitioning by user_id would allow scanning only relevant ID ranges, but creates severe skew if some users generate 100x more events.
Decision Framework: Start by analyzing query logs. If 90% of queries filter by date ranges, make date your primary partition. If 70% also filter by a categorical dimension with moderate cardinality (50 to 300 values), add it as secondary. Avoid more than 2 to 3 partition levels unless you have strong evidence they are all heavily used.
For write heavy workloads where queries rarely filter by specific dimensions, prefer date plus hash bucket. This balances write throughput while keeping partition counts manageable. A system ingesting 100,000 writes per second can distribute load across 128 hash buckets per date, giving 780 writes per second per bucket, which is easily handled by modern streaming sinks.
"The decision is not 'partition by everything users might filter on.' It is: what filters appear in 80% of queries, and what is the minimum partition count that captures those patterns?"
user_id or device_id, always use hash bucketing to cap unique values at 64, 128, or 256 buckets.
When to Change Strategy: Repartitioning is expensive but sometimes necessary. Signs you need to repartition include query planning time exceeding 10 seconds, partition counts above 50,000, or discovering 80% of queries filter by a dimension you did not partition on. Modern table formats like Iceberg support partition evolution, allowing mixed layouts where old data uses one scheme and new data uses another, avoiding full rewrites.💡 Key Takeaways
✓Finer partitions (hourly vs daily) improve pruning from 100 GB to 4.2 GB per query but create 24x metadata overhead, with 17,520 partitions over 2 years versus 730
✓Hash partitioning balances write load perfectly across 64 to 128 buckets but turns selective queries into full scans because data is randomly distributed across all buckets
✓Choose partition dimensions appearing in 80% of queries with moderate cardinality (50 to 500 unique values), avoiding both skew risk (under 50) and metadata explosion (over 500)
✓Repartitioning becomes necessary when partition counts exceed 50,000, query planning exceeds 10 seconds, or 80% of queries filter by non partitioned dimensions
📌 Examples
1A query filtering 1 hour from daily partitioned 100 GB dataset scans all 100 GB (95.8% waste), while hourly partitions reduce scan to 4.2 GB but create 8,760 partitions yearly
2System with 100,000 writes/second using date plus 128 hash buckets distributes to 780 writes/second per bucket, avoiding hot partition bottlenecks while keeping partitions manageable