Cost Based Optimization: Join Order and Access Path Selection
Cost Based Optimization
Cost based optimizers enumerate execution plans and select the lowest cost option. Consider this query: SELECT o.id, c.name, p.title FROM orders o JOIN customers c ON o.customer_id = c.id JOIN products p ON o.product_id = p.id WHERE c.country = 'US'. With orders at 10 million rows, customers at 1 million rows, and products at 50,000 rows, the optimizer must decide: join orders to customers first, or customers to products first? The cost model estimates CPU cycles, memory, disk I/O, and network transfer for each candidate plan.
Join Order Impact
Join order dramatically affects intermediate result sizes. For the query above, if we filter customers by country = 'US' first (reducing 1M to 300K rows), then join to orders, we process fewer rows. Bad order: orders JOIN customers JOIN products processes 10M rows before any filtering. Good order: (customers WHERE country='US') JOIN orders JOIN products starts with 300K rows. For n tables, there are n factorial possible orders: 5 tables means 120 orders, 10 tables means 3.6 million. Optimizers prune to 10,000 to 100,000 candidates to keep optimization under 100 milliseconds.
Access Path Selection
Access path selection chooses how to read each table. For SELECT * FROM orders WHERE customer_id = 12345 on a 10 million row table with an index on customer_id, the optimizer compares: index scan reads ~100 matching rows directly; sequential scan reads all 10 million rows. The decision depends on selectivity. At 0.001% selectivity (100 of 10M rows), index scan wins. At 30% selectivity (3M rows), sequential scan is faster because random I/O from index lookups costs more than reading sequentially. The crossover is typically 5% to 15% depending on storage type.
Join Algorithm Selection
Three join algorithms compete. Nested loop: for each row in outer table, scan inner table. Works well when inner has an index: SELECT * FROM orders o JOIN customers c ON o.customer_id = c.id with index on customers.id completes in 50 to 200 milliseconds. Hash join: build hash table from smaller input, probe with larger. Optimal for equi joins without indexes. Merge join: requires both inputs sorted on join key, efficient when data is pre sorted. Without an index, nested loop on 10M rows takes 10+ seconds versus 200 milliseconds for hash join.