How Data Layout Reduces Query Costs

The Problem: Without intelligent data organization, every query scans the entire dataset. If you store 2 years of events (730 days, 200 TB total) in a single monolithic table, a dashboard showing "last 7 days of US traffic" must read all 200 TB to find the relevant rows. At $5 per TB scanned, that simple dashboard costs $1,000 per refresh.

The Solution: Strategic Data Layout

Data layout strategies physically organize data to minimize what queries must read. Three techniques dominate: partitioning, clustering, and columnar storage.

1
Partitioning: Split tables into physically separate chunks based on a key dimension. Most commonly date or ingestion time. If you ingest 1 TB per day, date partitioning creates 730 partitions for 2 years of data.
2
Clustering: Within each partition, sort and co-locate rows by secondary dimensions like customer_id or region. This improves range scans and joins without creating massive partition counts.
3
Columnar format: Store data by column instead of by row. Queries selecting 5 columns from a 50 column table read only 10 percent of the data. Compression is 3x to 10x better because similar values sit together.
Real Impact Example: A dashboard queries "last 30 days US traffic" from a table with 365 days and 20 regions (7,300 possible partition/region combinations).

Without partitioning: Scans all 200 TB, costs $1,000 per query.

With date partitioning only: Scans 30 days worth, approximately 6 TB (30 out of 365 days), costs $30.

With date AND region clustering: Query engine skips non-US data blocks within those 30 partitions, scans roughly 0.3 TB (1 out of 20 regions), costs $1.50.

Query Cost Reduction
UNPARTITIONED
$1000
→
DATE PARTITIONED
$30
→
DATE + CLUSTERED
$1.50
⚠️ Common Pitfall: Over-partitioning by high cardinality keys like user_id creates millions of tiny files. Query planners must touch thousands of files, actually increasing cost and latency. Keep partition counts in the hundreds, not millions.

Columnar compression is equally powerful. A typical analytics table might have 50 columns but queries average 8 columns. Columnar storage means you read 16 percent of raw data. Combined with compression ratios of 5x to 10x, your actual I/O and storage costs drop by 30x to 60x compared to uncompressed row storage.

💡 Key Takeaways

✓Date partitioning on a 200 TB table with 2 years of data creates 730 partitions of roughly 1 TB each, reducing typical query scans from 200 TB to single digit TB

✓Clustering by secondary dimensions like region or customer within partitions cuts scanned data by another 10x to 20x without exploding partition counts

✓Columnar storage with compression achieves 3x to 10x space savings and allows queries to read only needed columns, cutting I/O by 5x to 10x

✓Over-partitioning by high cardinality keys creates millions of tiny files, forcing query planners to open thousands of files and actually increasing cost

📌 Interview Tips

1A production table storing 1 TB per day for 2 years uses date partitioning. A query for last 7 days scans 7 TB instead of 730 TB, saving $3,615 per query at $5 per TB.

2BigQuery recommends keeping partition counts in the hundreds. A table partitioned by <code style="padding: 2px 6px; background: #f5f5f5; border: 1px solid #ddd; border-radius: 3px; font-family: monospace; font-size: 0.9em;">user_id</code> with 5 million users creates 5 million partitions, causing query planning overhead that can exceed scan costs.

← Back to Cost Optimization Strategies Overview