Search & Ranking Systems • Query Parsing & OptimizationHard⏱️ ~3 min
Distributed Query Optimization: Locality, Shuffles, and Skew Handling
Distributed query optimization extends cost based optimization with models for data locality, network bandwidth, shuffle costs, and skew induced stragglers. The optimizer must decide whether to move computation to data (pushdown filters, co located joins) or move data to computation (broadcast or repartition joins), factoring in partition sizes, replication, and cross zone or cross region latency.
Google Spanner optimizes for data locality to minimize cross region Remote Procedure Call (RPC) latency, which adds 10 to 50 milliseconds (ms) per remote operator. By reordering joins to keep large tables local and probing small remote dimensions, production queries dropped from 50 ms to under 10 ms. Amazon Redshift relies on co located joins when distribution keys match, avoiding multi gigabyte (GB) shuffles. Misspecified distribution keys force repartition joins that shuffle both sides across the network, slowing queries by 10 to 100x (seconds to minutes on 1 to 10 terabyte (TB) tables).
Skew is the dominant distributed failure mode. A single hot key can funnel gigabytes to one task, causing Out Of Memory (OOM) errors or straggler tasks that extend overall query latency by 10 to 100x. Google BigQuery and Amazon Athena use adaptive repartitioning to detect and split heavy keys at runtime, and apply broadcast join optimization for small bounded tables (under hundreds of megabytes (MB)) to avoid shuffles entirely. The trade off is memory risk: broadcast replicates data to all workers, so underestimating the small side triggers OOM and query retries.
💡 Key Takeaways
•Google Spanner locality aware optimization keeps large tables local and probes small remote sides, cutting cross region join latency by approximately 5x (50 ms to under 10 ms per query)
•Amazon Redshift co located joins avoid shuffles when distribution keys match. Misspecified keys force multi GB repartition, causing 10 to 100x slowdowns (seconds to minutes on 1 to 10 TB tables)
•Google BigQuery broadcasts dimensions under hundreds of MB to all workers, avoiding terabyte shuffles. Misestimation forces expensive repartition and adds minutes of latency
•Skew causes straggler tasks: single hot key funnels gigabytes to one worker, triggering Out Of Memory (OOM) or extending query latency by 10 to 100x. Adaptive repartitioning splits heavy keys at runtime
•Trade off: broadcast saves shuffle cost but risks worker OOM if small side is larger than expected. Repartition scales but adds network and sort overhead, especially under skew
📌 Examples
Amazon Redshift distribution key fix: Fact table distributed on order_id, dimension on user_id. Join on user_id shuffled 5 TB and took 5 minutes. Redistributed fact on user_id, enabling co located join, dropped runtime to 15 seconds.
Google Spanner cross region join: Original plan joined remote small table first, then local large table, causing 20 milliseconds (ms) per row of cross region RPC. Reordered to (local JOIN remote), keeping probes local, reduced per query latency from 50 ms to under 10 ms.
BigQuery skew handling: E-commerce query joining orders (1 TB) with users (100 MB) on user_id. Single power user had 10 GB of orders, causing one task to run 100x longer. Adaptive repartitioning split that user across 10 tasks, reducing straggler time from 10 minutes to under 1 minute.