Data Warehousing Fundamentals • Cost Optimization StrategiesMedium⏱️ ~3 min
How Data Layout Reduces Query Costs
The Problem: Without intelligent data organization, every query scans the entire dataset. If you store 2 years of events (730 days, 200 TB total) in a single monolithic table, a dashboard showing "last 7 days of US traffic" must read all 200 TB to find the relevant rows. At $5 per TB scanned, that simple dashboard costs $1,000 per refresh.
The Solution: Strategic Data Layout
Data layout strategies physically organize data to minimize what queries must read. Three techniques dominate: partitioning, clustering, and columnar storage.
Real Impact Example: A dashboard queries "last 30 days US traffic" from a table with 365 days and 20 regions (7,300 possible partition/region combinations).
Without partitioning: Scans all 200 TB, costs $1,000 per query.
With date partitioning only: Scans 30 days worth, approximately 6 TB (30 out of 365 days), costs $30.
With date AND region clustering: Query engine skips non-US data blocks within those 30 partitions, scans roughly 0.3 TB (1 out of 20 regions), costs $1.50.
1
Partitioning: Split tables into physically separate chunks based on a key dimension. Most commonly date or ingestion time. If you ingest 1 TB per day, date partitioning creates 730 partitions for 2 years of data.
2
Clustering: Within each partition, sort and co-locate rows by secondary dimensions like
customer_id or region. This improves range scans and joins without creating massive partition counts.3
Columnar format: Store data by column instead of by row. Queries selecting 5 columns from a 50 column table read only 10 percent of the data. Compression is 3x to 10x better because similar values sit together.
Query Cost Reduction
UNPARTITIONED
$1000
→
DATE PARTITIONED
$30
→
DATE + CLUSTERED
$1.50
⚠️ Common Pitfall: Over-partitioning by high cardinality keys like
Columnar compression is equally powerful. A typical analytics table might have 50 columns but queries average 8 columns. Columnar storage means you read 16 percent of raw data. Combined with compression ratios of 5x to 10x, your actual I/O and storage costs drop by 30x to 60x compared to uncompressed row storage.user_id creates millions of tiny files. Query planners must touch thousands of files, actually increasing cost and latency. Keep partition counts in the hundreds, not millions.💡 Key Takeaways
✓Date partitioning on a 200 TB table with 2 years of data creates 730 partitions of roughly 1 TB each, reducing typical query scans from 200 TB to single digit TB
✓Clustering by secondary dimensions like region or customer within partitions cuts scanned data by another 10x to 20x without exploding partition counts
✓Columnar storage with compression achieves 3x to 10x space savings and allows queries to read only needed columns, cutting I/O by 5x to 10x
✓Over-partitioning by high cardinality keys creates millions of tiny files, forcing query planners to open thousands of files and actually increasing cost
📌 Examples
1A production table storing 1 TB per day for 2 years uses date partitioning. A query for last 7 days scans 7 TB instead of 730 TB, saving $3,615 per query at $5 per TB.
2BigQuery recommends keeping partition counts in the hundreds. A table partitioned by <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">user_id</code> with 5 million users creates 5 million partitions, causing query planning overhead that can exceed scan costs.