Distributed Query Optimization: Locality, Shuffles, and Skew Handling
Why Distributed Queries Are Slow
Distributed query optimization minimizes network transfers and data movement. For SELECT * FROM orders o JOIN products p ON o.product_id = p.id with orders (10GB) across 50 nodes and products (100MB) on separate nodes, the optimizer decides: shuffle both tables or broadcast products? Shuffling 10GB over 1 Gbps network takes 80 seconds. Broadcasting 100MB takes 4 seconds. Wrong choice turns a 5 second query into 90 seconds.
Broadcast vs Shuffle Latency
Broadcast join replicates the smaller table to all nodes. For SELECT * FROM orders o JOIN dim_products p ON o.product_id = p.id, broadcasting 100MB products completes in 4 seconds, then each node joins locally. Shuffle join repartitions both tables: 10GB shuffled takes 30 to 60 seconds with coordination overhead. Broadcast wins when dimension table is under 100 to 500MB. Co located joins avoid network entirely: both tables partitioned by product_id execute with zero network wait.
Skew Creates Stragglers
Data skew causes slow queries. For SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id, if 80% of orders belong to 1% of customers, hash partitioning sends 80% to a few nodes. With 100 nodes, even distribution finishes in 10 seconds. With skew, 99 nodes finish in 2 seconds and wait while one processes 80% for 800 seconds. Query takes 800 seconds waiting for the slowest. Key salting fixes this: GROUP BY customer_id || '_' || rand(0,9) spreads hot keys across 10 partitions, then aggregates partial results, reducing 800 seconds to ~80 seconds.
Cross Region Latency
Cross region queries add 50 to 150ms per network round trip. For SELECT * FROM us_east.orders JOIN eu_west.products ON ..., 10 coordination round trips add 500ms to 1.5 seconds before data processing starts. Transferring 10GB across regions at 100 Mbps takes 800 seconds versus 80 seconds within region. Push joins to the larger table's region and replicate small dimension tables locally.