What is Query Optimization?

Definition
Query Optimization is the process of transforming a database query into the most efficient execution plan possible, minimizing resource consumption while returning the same correct results.
The Core Problem:
When you write a query like SELECT * FROM orders WHERE customer_id = 123, there are many ways the database could execute it. It could scan every single row in the table. It could use an index on customer_id. It could even scan a materialized view. Each approach gives the same answer but at dramatically different costs.

Without optimization, a naive approach would scan a 10 TB table on cloud storage taking tens of seconds to minutes. Product teams typically expect interactive analytics under 1 to 3 seconds p95 for dashboards and under 100 milliseconds for operational reads. The gap between these expectations and raw scan performance is what query optimization bridges.

The Three Stage Pipeline:Query Processing Stages
STAGE 1
Parse
→
STAGE 2
Logical Opt
→
STAGE 3
Physical Opt

First, parsing turns query text into a logical plan showing what operations to perform. Second, logical optimization rewrites this into an equivalent but cheaper form, like pushing filters closer to data sources or simplifying expressions. Third, physical optimization chooses actual algorithms, such as which index to use, what join order and algorithm to apply, how to partition work across compute nodes, and whether to use cached results or scan fresh data.

Cost Based vs Rule Based:
Modern engines primarily use cost based optimization. They collect statistics about your data, such as how many rows exist, how values are distributed, and how many distinct values appear in each column. The optimizer then estimates the cost of different execution plans in terms of Input/Output operations, CPU cycles, memory usage, and network transfers. It picks the plan with minimum estimated cost. Some systems also support rule based optimization or user hints when cost models become unreliable with complex schemas.

💡 Key Takeaways

✓Query optimization transforms a logical query into the most efficient physical execution plan, bridging the gap between tens of seconds raw scan time and sub second interactive requirements

✓The three stage pipeline consists of parsing into logical plans, rewriting for logical efficiency (pushing filters down, simplifying expressions), then choosing physical algorithms and access paths

✓Cost based optimizers build statistics on data distribution (row counts, distinct values, histograms) and estimate resource costs (I/O, CPU, memory, network) to select minimum cost plans

✓Without optimization, scanning a 10 TB warehouse table could take minutes, but optimized plans with proper partitioning and indexing achieve 1 to 3 second response times

📌 Interview Tips

1A query filtering recent orders from a 10 TB table: naive full scan takes 60+ seconds, but with date partitioning the optimizer prunes 99% of data and completes in under 2 seconds

2Joining two tables with 1 million and 100 rows: optimizer estimates costs and chooses to broadcast the small table to all workers rather than shuffle the large table, reducing network transfer from gigabytes to megabytes

← Back to Query Optimization Techniques Overview