Data Warehousing Fundamentals • Query Optimization TechniquesEasy⏱️ ~2 min
What is Query Optimization?
Definition
Query Optimization is the process of transforming a database query into the most efficient execution plan possible, minimizing resource consumption while returning the same correct results.
SELECT * FROM orders WHERE customer_id = 123, there are many ways the database could execute it. It could scan every single row in the table. It could use an index on customer_id. It could even scan a materialized view. Each approach gives the same answer but at dramatically different costs.
Without optimization, a naive approach would scan a 10 TB table on cloud storage taking tens of seconds to minutes. Product teams typically expect interactive analytics under 1 to 3 seconds p95 for dashboards and under 100 milliseconds for operational reads. The gap between these expectations and raw scan performance is what query optimization bridges.
The Three Stage Pipeline:
Query Processing Stages
STAGE 1
Parse
→
STAGE 2
Logical Opt
→
STAGE 3
Physical Opt
💡 Key Takeaways
✓Query optimization transforms a logical query into the most efficient physical execution plan, bridging the gap between tens of seconds raw scan time and sub second interactive requirements
✓The three stage pipeline consists of parsing into logical plans, rewriting for logical efficiency (pushing filters down, simplifying expressions), then choosing physical algorithms and access paths
✓Cost based optimizers build statistics on data distribution (row counts, distinct values, histograms) and estimate resource costs (I/O, CPU, memory, network) to select minimum cost plans
✓Without optimization, scanning a 10 TB warehouse table could take minutes, but optimized plans with proper partitioning and indexing achieve 1 to 3 second response times
📌 Examples
1A query filtering recent orders from a 10 TB table: naive full scan takes 60+ seconds, but with date partitioning the optimizer prunes 99% of data and completes in under 2 seconds
2Joining two tables with 1 million and 100 rows: optimizer estimates costs and chooses to broadcast the small table to all workers rather than shuffle the large table, reducing network transfer from gigabytes to megabytes