Data Warehousing FundamentalsQuery Optimization TechniquesMedium⏱️ ~3 min

How Query Optimizers Choose Execution Plans

Statistics Drive Everything: The optimizer cannot choose a good plan without understanding your data. It needs to know how many rows live in each table, how values are distributed across columns, and how selective your filters will be. Systems maintain statistics like number of rows, number of distinct values per column, min and max values per partition, histograms showing value distribution, and correlation between commonly joined columns. Consider a table with 10 million user records. If a query filters by status = 'active', the optimizer needs to estimate how many rows match. If statistics show 9.5 million users are active, that filter is not selective at all. The optimizer might choose a full table scan. But if only 50,000 users are active (0.5%), an index lookup becomes far cheaper. Enumerating Physical Plans: After logical rewrites, the optimizer generates candidate physical plans. For a three table join, there are multiple join orders to consider. Should it join A with B first, then add C? Or B with C first, then A? Each order has different intermediate result sizes and therefore different costs.
1
Join Algorithm Selection: Choose between hash join (fast for equality joins), merge join (efficient when inputs are sorted), or nested loop join (best when one side is very small).
2
Access Path Selection: For each table, decide between full scan, index scan, or hitting a materialized view.
3
Distribution Strategy: In distributed systems, decide whether to broadcast a small table to all workers or repartition both sides by join key.
The Cost Model in Action: Each candidate plan gets a cost estimate. For a hash join between 1 million row table A and 100,000 row table B, the optimizer estimates reading both tables (I/O cost), building a hash table for B (CPU and memory cost), then probing with A's rows (CPU cost). If table A has an index on the join key and statistics show high selectivity, an index nested loop might be cheaper despite the algorithmic complexity.
Plan Comparison Example
Plan A
COST: 15000
Plan B
COST: 3500
The optimizer picks Plan B with cost 3500 over Plan A with cost 15000. These cost units typically represent a weighted combination of expected I/O operations, CPU instructions, and data volume transferred. The actual execution time difference might be 8 seconds versus 400 milliseconds.
⚠️ Common Pitfall: Statistics can become stale as data grows or changes distribution. If the optimizer thinks a filter is highly selective based on old statistics but it now matches millions of rows, the chosen plan can be catastrophically slow. Most systems require periodic statistics updates via background jobs or explicit commands.
💡 Key Takeaways
Optimizers rely on statistics like row counts, distinct value counts, and histograms to estimate how selective filters are and how large intermediate results will be
Physical plan enumeration considers join order permutations, join algorithms (hash, merge, nested loop), access paths (scan vs index), and distribution strategies (broadcast vs repartition)
Cost models translate each plan into estimated resource consumption, typically combining I/O operations, CPU cycles, memory usage, and network transfer into a single cost metric
Stale statistics cause the most common optimization failures, where the optimizer chooses a plan suitable for old data distribution that performs catastrophically on current data
📌 Examples
1Joining customers (1M rows) with orders (10M rows) on customer ID: optimizer sees customers table is much smaller, chooses to build hash table from customers (200 MB memory), then probe with orders, completing join in 4 seconds
2Query with filter on indexed column where statistics show 0.1% selectivity: optimizer chooses index scan touching 10,000 rows instead of full table scan of 10 million rows, reducing execution from 12 seconds to 300 milliseconds
← Back to Query Optimization Techniques Overview