Search & Ranking SystemsQuery Parsing & OptimizationMedium⏱️ ~2 min

Cost Based Optimization: Join Order and Access Path Selection

Cost Based Optimization

Cost based optimizers enumerate execution plans and select the lowest cost option. Consider this query: SELECT o.id, c.name, p.title FROM orders o JOIN customers c ON o.customer_id = c.id JOIN products p ON o.product_id = p.id WHERE c.country = 'US'. With orders at 10 million rows, customers at 1 million rows, and products at 50,000 rows, the optimizer must decide: join orders to customers first, or customers to products first? The cost model estimates CPU cycles, memory, disk I/O, and network transfer for each candidate plan.

Join Order Impact

Join order dramatically affects intermediate result sizes. For the query above, if we filter customers by country = 'US' first (reducing 1M to 300K rows), then join to orders, we process fewer rows. Bad order: orders JOIN customers JOIN products processes 10M rows before any filtering. Good order: (customers WHERE country='US') JOIN orders JOIN products starts with 300K rows. For n tables, there are n factorial possible orders: 5 tables means 120 orders, 10 tables means 3.6 million. Optimizers prune to 10,000 to 100,000 candidates to keep optimization under 100 milliseconds.

Access Path Selection

Access path selection chooses how to read each table. For SELECT * FROM orders WHERE customer_id = 12345 on a 10 million row table with an index on customer_id, the optimizer compares: index scan reads ~100 matching rows directly; sequential scan reads all 10 million rows. The decision depends on selectivity. At 0.001% selectivity (100 of 10M rows), index scan wins. At 30% selectivity (3M rows), sequential scan is faster because random I/O from index lookups costs more than reading sequentially. The crossover is typically 5% to 15% depending on storage type.

Join Algorithm Selection

Three join algorithms compete. Nested loop: for each row in outer table, scan inner table. Works well when inner has an index: SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id with index on customers.id completes in 50 to 200 milliseconds. Hash join: build hash table from smaller input, probe with larger. Optimal for equi joins without indexes. Merge join: requires both inputs sorted on join key, efficient when data is pre sorted. Without an index, nested loop on 10M rows takes 10+ seconds versus 200 milliseconds for hash join.

💡 Key Takeaways
Cost based optimization evaluates CPU, memory, I/O, and network for each plan; for SELECT ... FROM orders JOIN customers JOIN products, join order determines if we process 10M or 300K rows first
For n tables, n factorial join orders exist (10 tables = 3.6 million); optimizers prune to 10,000 to 100,000 candidates to stay under 100 milliseconds optimization time
Access path depends on selectivity: index scan for WHERE customer_id = 12345 on 10M rows (0.001% selectivity) beats sequential scan; crossover at 5% to 15%
Nested loop with index completes in 50 to 200 milliseconds; without index on 10M rows takes 10+ seconds versus 200 milliseconds for hash join
Hash join builds hash table from smaller input then probes; optimal for equi joins like ON o.customer_id = c.id without indexes
📌 Interview Tips
1Walk through join order decision: orders (10M) JOIN customers (1M) JOIN products (50K) with WHERE customers.country = 'US'. Filter customers first (1M to 300K), then join to orders.
2Explain selectivity crossover: SELECT * FROM orders WHERE status = 'pending' on 10M rows. If 0.1% match (10K rows), use index. If 30% match (3M rows), sequential scan is faster.
3Compare join algorithms with the same query: nested loop with index on customers.id = 200ms, hash join without index = 200ms, nested loop without index = 10+ seconds.
← Back to Query Parsing & Optimization Overview
Cost Based Optimization: Join Order and Access Path Selection | Query Parsing & Optimization - System Overflow