Data Warehousing FundamentalsQuery Optimization TechniquesMedium⏱️ ~3 min

Query Optimization at Warehouse Scale

The Scale Challenge: Consider a product analytics warehouse at a large company storing 5 to 10 petabytes of event data in a columnar system like BigQuery, Snowflake, or Redshift. A typical dashboard loads 10 to 20 charts, each firing 1 to 3 queries. With hundreds of concurrent analysts, you might see thousands of concurrent queries and tens of thousands per minute during peak hours. The performance target is often p50 latency 1 to 2 seconds and p95 under 5 seconds. Raw full scans would never meet this. The system stacks multiple optimization layers that work together. Partitioning and Pruning: Data gets partitioned by date and sometimes by customer or region. When a query filters for the last 7 days from a table with 3 years of history, the optimizer prunes 99% of partitions immediately by checking partition metadata. Instead of scanning 150 petabyte days worth of data, it scans only 7 days. This single technique often provides 100x to 1000x speedup.
Partition Pruning Impact
WITHOUT PRUNING
1000 partitions
WITH PRUNING
7 partitions
Columnar Storage and Predicate Pushdown: Data is stored column wise with heavy compression. Only columns referenced in the query are read from storage. A query selecting 3 columns from a 50 column table reads 6% of the data. Metadata services store min and max values per column per file, enabling data skipping. If a file's max value for price is 50 and the query filters for price > 100, the entire file is skipped without reading a single byte. Result Caching: Before any optimization begins, the engine checks a result cache. For high reuse dashboards accessed by multiple analysts, cache hits take p50 from 2 seconds down to under 50 milliseconds. The cache key includes query text and data version, ensuring correctness while dramatically improving repeat access patterns. Distributed Execution: Systems like Snowflake and Redshift use massively parallel processing (MPP). A large query might run across 16 to 128 worker nodes. Each worker scans a subset of data and computes partial aggregates. The coordinator combines results. This parallelism turns a 60 second single node query into a 2 second distributed query with proper work distribution.
✓ In Practice: At companies like Meta and Google, internal warehouse systems achieve sub second p95 latency on petabyte scale tables by combining date partitioning (pruning 99% of data), columnar storage (reading only needed columns), result caching (50ms cache hits), and distributed execution across hundreds of workers.
💡 Key Takeaways
Warehouse scale systems handle thousands of concurrent queries against petabytes of data by stacking optimization layers: partitioning, columnar storage, caching, and distributed execution
Partition pruning eliminates 99% or more of data for time range queries by checking partition metadata, turning potential petabyte scans into gigabyte scans
Columnar storage with predicate pushdown reads only necessary columns and skips entire files using min/max statistics when filters cannot match, reducing I/O by 90% to 99%
Result caching for high reuse dashboards can improve p50 latency from 2 seconds to under 50 milliseconds, while MPP execution across 16 to 128 workers parallelizes remaining work
📌 Examples
1A Netflix style analytics warehouse with 10 petabytes of events: dashboard query filtering last 7 days from 3 years of data prunes from 1000 partitions to 7, scans only 3 of 50 columns due to columnar format, and completes in 1.5 seconds p95 across 32 workers
2E-commerce order analytics: query counting orders by region hits result cache for repeated dashboard access, returning in 45ms instead of recomputing 2 second aggregation across 500 million orders
← Back to Query Optimization Techniques Overview