Search & Ranking SystemsQuery Parsing & OptimizationHard⏱️ ~3 min

Distributed Query Optimization: Locality, Shuffles, and Skew Handling

Why Distributed Queries Are Slow

Distributed query optimization minimizes network transfers and data movement. For SELECT * FROM orders o JOIN products p ON o.product_id = p.id with orders (10GB) across 50 nodes and products (100MB) on separate nodes, the optimizer decides: shuffle both tables or broadcast products? Shuffling 10GB over 1 Gbps network takes 80 seconds. Broadcasting 100MB takes 4 seconds. Wrong choice turns a 5 second query into 90 seconds.

Broadcast vs Shuffle Latency

Broadcast join replicates the smaller table to all nodes. For SELECT * FROM orders o JOIN dim_products p ON o.product_id = p.id, broadcasting 100MB products completes in 4 seconds, then each node joins locally. Shuffle join repartitions both tables: 10GB shuffled takes 30 to 60 seconds with coordination overhead. Broadcast wins when dimension table is under 100 to 500MB. Co located joins avoid network entirely: both tables partitioned by product_id execute with zero network wait.

Skew Creates Stragglers

Data skew causes slow queries. For SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id, if 80% of orders belong to 1% of customers, hash partitioning sends 80% to a few nodes. With 100 nodes, even distribution finishes in 10 seconds. With skew, 99 nodes finish in 2 seconds and wait while one processes 80% for 800 seconds. Query takes 800 seconds waiting for the slowest. Key salting fixes this: GROUP BY customer_id || '_' || rand(0,9) spreads hot keys across 10 partitions, then aggregates partial results, reducing 800 seconds to ~80 seconds.

Cross Region Latency

Cross region queries add 50 to 150ms per network round trip. For SELECT * FROM us_east.orders JOIN eu_west.products ON ..., 10 coordination round trips add 500ms to 1.5 seconds before data processing starts. Transferring 10GB across regions at 100 Mbps takes 800 seconds versus 80 seconds within region. Push joins to the larger table's region and replicate small dimension tables locally.

💡 Key Takeaways
Shuffling 10GB over 1 Gbps takes 80 seconds; broadcasting 100MB takes 4 seconds; wrong join strategy turns 5 second query into 90 seconds
Broadcast threshold typically 100 to 500MB; co located joins (both tables partitioned by join key) execute with zero network wait
Skew causes stragglers: 80% of data to 1 node means query takes 800 seconds (waiting for slowest) instead of 10 seconds with even distribution
Key salting with GROUP BY customer_id || '_' || rand(0,9) spreads hot keys across 10 partitions, reducing 800 seconds to ~80 seconds
Cross region adds 50 to 150ms per round trip; 10GB at 100 Mbps takes 800 seconds cross region versus 80 seconds within region
📌 Interview Tips
1Explain straggler math: 100 nodes, 10 second expected. One node gets 80% of work at 1/100 capacity = 800 seconds. 99 nodes idle after 2 seconds, waiting for the slow one.
2Compare broadcast vs shuffle: 100MB broadcast = 4 seconds. Shuffling 10GB = 30 to 60 seconds due to coordination. Broadcast wins for small dimension tables.
3Walk through key salting: GROUP BY customer_id with hot customer creates one huge partition. Add || '_' || rand(0,9) to split across 10 partitions, then SUM partial results.
← Back to Query Parsing & Optimization Overview