Search & Ranking SystemsQuery Parsing & OptimizationMedium⏱️ ~2 min

Cost Based Optimization: Join Order and Access Path Selection

Cost based optimizers enumerate alternative execution plans and select one based on a cost model that estimates Central Processing Unit (CPU), memory, Input/Output (I/O), and network resources. Join order dominates cost in multi join queries because different orders can produce dramatically different intermediate result sizes. For a query joining tables A, B, and C, joining selective tables first minimizes downstream work. The optimizer uses dynamic programming to explore join orders, often capping enumeration at 10 to 12 joins due to exponential complexity and switching to greedy heuristics beyond that threshold. Each join node requires access path decisions: should the system scan the full table, use an index seek, or leverage a hash probe? This depends on selectivity estimates derived from statistics like histograms, Number of Distinct Values (NDV) sketches, and multi column correlation data. Google Spanner changed a production join order to keep the large table local and probe a remote small dimension, cutting latency by approximately 5x by avoiding expensive cross region Remote Procedure Calls (RPCs) in the inner loop. In distributed systems, the optimizer must model data locality, shuffle volume, and memory constraints. Amazon Redshift co locates joins when distribution keys match, avoiding multi gigabyte network shuffles. Misspecified distribution keys force repartition joins that can slow queries by 10 to 100x, turning second scale queries into minute scale operations on 1 to 10 terabyte (TB) fact tables. Google BigQuery broadcasts small dimension tables (tens to hundreds of megabytes (MB)) to all workers to avoid terabyte scale shuffles; misestimating the small side forces expensive repartitioning and adds minutes of latency.
💡 Key Takeaways
Join order dominates cost in multi table queries. Google Spanner optimized a production join to keep large tables local, reducing latency by approximately 5x by avoiding cross region RPCs
Dynamic programming explores join orders up to 10 to 12 tables before switching to greedy heuristics due to exponential complexity, trading optimality for bounded planning time
Amazon Redshift co located joins avoid shuffles but misspecified distribution keys cause 10 to 100x slowdowns (seconds to minutes) on 1 to 10 TB tables due to multi gigabyte network transfers
Google BigQuery broadcasts small sides (under hundreds of MB) to avoid terabyte shuffles. Misestimation forces repartition and adds minutes of latency
Access path selection chooses between full scan, index seek, or hash probe based on selectivity from histograms and NDV statistics. Cardinality misestimates of 10 to 1000x cause plan disasters
📌 Examples
Star schema query: Join fact table (1 TB) with dimensions (10 MB each). Broadcast dimensions and scan fact once, completing in 10 to 30 seconds. Repartitioning the fact table would shuffle terabytes and take minutes.
Amazon Redshift case: Fact table distributed on order_id, dimension on user_id. Join on user_id forces repartition. Redistributing the fact table on user_id fixed the distribution key mismatch and dropped query time from 5 minutes to 15 seconds.
Google Spanner production: Changed join order to (local_large JOIN remote_small) instead of (remote_small JOIN local_large), keeping heavy probes local and cutting latency from 50 milliseconds (ms) to under 10 ms per query.
← Back to Query Parsing & Optimization Overview
Cost Based Optimization: Join Order and Access Path Selection | Query Parsing & Optimization - System Overflow