Query Optimization at Warehouse Scale

The Scale Challenge:
Consider a product analytics warehouse at a large company storing 5 to 10 petabytes of event data in a columnar system like BigQuery, Snowflake, or Redshift. A typical dashboard loads 10 to 20 charts, each firing 1 to 3 queries. With hundreds of concurrent analysts, you might see thousands of concurrent queries and tens of thousands per minute during peak hours. The performance target is often p50 latency 1 to 2 seconds and p95 under 5 seconds.

Raw full scans would never meet this. The system stacks multiple optimization layers that work together.

Partitioning and Pruning:
Data gets partitioned by date and sometimes by customer or region. When a query filters for the last 7 days from a table with 3 years of history, the optimizer prunes 99% of partitions immediately by checking partition metadata. Instead of scanning 150 petabyte days worth of data, it scans only 7 days. This single technique often provides 100x to 1000x speedup.

Partition Pruning Impact
WITHOUT PRUNING
1000 partitions
→
WITH PRUNING
7 partitions
Columnar Storage and Predicate Pushdown:
Data is stored column wise with heavy compression. Only columns referenced in the query are read from storage. A query selecting 3 columns from a 50 column table reads 6% of the data. Metadata services store min and max values per column per file, enabling data skipping. If a file's max value for price is 50 and the query filters for price > 100, the entire file is skipped without reading a single byte.

Result Caching:
Before any optimization begins, the engine checks a result cache. For high reuse dashboards accessed by multiple analysts, cache hits take p50 from 2 seconds down to under 50 milliseconds. The cache key includes query text and data version, ensuring correctness while dramatically improving repeat access patterns.

Distributed Execution:
Systems like Snowflake and Redshift use massively parallel processing (MPP). A large query might run across 16 to 128 worker nodes. Each worker scans a subset of data and computes partial aggregates. The coordinator combines results. This parallelism turns a 60 second single node query into a 2 second distributed query with proper work distribution.

✓ In Practice: At companies like Meta and Google, internal warehouse systems achieve sub second p95 latency on petabyte scale tables by combining date partitioning (pruning 99% of data), columnar storage (reading only needed columns), result caching (50ms cache hits), and distributed execution across hundreds of workers.

💡 Key Takeaways

✓Warehouse scale systems handle thousands of concurrent queries against petabytes of data by stacking optimization layers: partitioning, columnar storage, caching, and distributed execution

✓Partition pruning eliminates 99% or more of data for time range queries by checking partition metadata, turning potential petabyte scans into gigabyte scans

✓Columnar storage with predicate pushdown reads only necessary columns and skips entire files using min/max statistics when filters cannot match, reducing I/O by 90% to 99%

✓Result caching for high reuse dashboards can improve p50 latency from 2 seconds to under 50 milliseconds, while MPP execution across 16 to 128 workers parallelizes remaining work

📌 Interview Tips

1A Netflix style analytics warehouse with 10 petabytes of events: dashboard query filtering last 7 days from 3 years of data prunes from 1000 partitions to 7, scans only 3 of 50 columns due to columnar format, and completes in 1.5 seconds p95 across 32 workers

2E-commerce order analytics: query counting orders by region hits result cache for repeated dashboard access, returning in 45ms instead of recomputing 2 second aggregation across 500 million orders

← Back to Query Optimization Techniques Overview