Distributed Data Processing • Distributed Joins (Broadcast, Shuffle-Hash, Sort-Merge)Easy⏱️ ~3 min
Broadcast Hash Join: Trading Memory for Speed
How It Works: Broadcast hash join is the simplest and often fastest distributed join strategy. The query optimizer identifies a small table (the dimension side) and literally sends a complete copy of it to every worker node in the cluster. Each worker builds an in memory hash table from that broadcasted table, keyed by the join column. Then, each worker processes its local partition of the large table (the fact side), probing the hash table for each row without any network communication.
The magic is in what you avoid: no shuffling the large table. If you have a 2 terabyte table partitioned across 100 nodes, each node already has roughly 20 gigabytes. Those 20 gigabytes stay put. You only broadcast the small 5 gigabyte dimension once to all 100 nodes.
The Performance Win: On a 100 node cluster with 40 gigabit per second network, broadcasting 5 gigabytes to each node means roughly 500 gigabytes total transfer. That completes in under 2 minutes. The join stage that follows is embarrassingly parallel: each worker joins its 20 gigabyte partition with the 5 gigabyte hash table independently. Typical completion time for this join stage is 2 to 5 minutes at p95 latency.
Compare this to shuffle hash join where both the 2 terabyte and 5 gigabyte tables get hash partitioned and shuffled. You move approximately 2 terabytes across the network, which on the same cluster takes 10 to 15 minutes just for data transfer, before the join even starts.
Memory is the Constraint: The tradeoff is cluster wide memory consumption. Broadcasting a 5 gigabyte table to 100 workers creates 500 gigabytes of aggregate memory footprint. If each worker has 64 gigabytes of RAM with 70 percent allocated to the execution engine, that is roughly 45 gigabytes available per worker. A 5 gigabyte broadcast fits comfortably, but a 15 gigabyte broadcast on the same hardware would consume one third of available memory per node, leaving little room for task execution buffers, garbage collection, or concurrent queries.
Network Transfer Comparison
500 GB
BROADCAST (5 GB × 100 nodes)
4 TB
SHUFFLE BOTH SIDES
⚠️ Common Pitfall: Stale table statistics can be catastrophic. If the optimizer thinks a table is 2 gigabytes but it is actually 20 gigabytes, broadcasting it causes out of memory errors and executor crashes across the entire cluster.
When to Use: Broadcast hash join is ideal for star schema queries joining large fact tables with small dimension tables. If your dimension tables are under 2 gigabytes and you have adequate cluster memory, broadcast delivers 2 to 5 times lower latency than shuffle based alternatives. Interactive SQL workloads, dashboards, and ad hoc analytics heavily rely on broadcast joins to keep query times in the seconds to tens of seconds range rather than minutes.💡 Key Takeaways
✓Broadcast join sends a complete copy of the small table to every worker, creating an aggregate memory footprint of table size times number of nodes (5 gigabytes times 100 nodes equals 500 gigabytes total)
✓Avoids shuffling the large table entirely, reducing network transfer from terabytes to hundreds of gigabytes and cutting join latency by 2 to 5 times compared to shuffle joins
✓Each worker builds an in memory hash table from the broadcast table and probes it with local partition rows, making the join embarrassingly parallel with no network communication
✓Memory constraint is the critical limit: a 15 gigabyte broadcast on workers with 45 gigabytes available memory consumes one third of capacity, leaving little room for other operations
✓Stale statistics are catastrophic: if optimizer thinks table is 2 gigabytes but it is actually 20 gigabytes, broadcasting causes cluster wide out of memory failures
📌 Examples
1Star schema join at Databricks: 50 gigabyte fact table with 500 megabyte dimension completes in under 5 seconds p95 with broadcast, versus 20 to 30 seconds with shuffle
2Netflix hourly ETL: broadcasting 1 gigabyte device metadata and 5 gigabyte user profiles to 100 nodes (600 gigabytes total transfer) completes in under 2 minutes on 40 gigabit per second network
3Interactive dashboard query: joining 100 gigabyte sales data with 3 gigabyte product catalog using broadcast keeps query time under 10 seconds for end user responsiveness