Learn→Natural Language Processing Systems→Scalability (Model Parallelism, Batching)→6 of 6

Natural Language Processing Systems • Scalability (Model Parallelism, Batching)Medium⏱️ ~3 min

How Do You Choose Between Data Parallelism, Tensor Parallelism, and Pipeline Parallelism?

Choosing the right parallelism strategy depends on your model size, hardware topology, and performance goals. Each approach has distinct trade offs in simplicity, memory efficiency, communication cost, and scaling limits.

Data parallelism is the simplest. Each worker holds a full model copy and processes a different mini batch. Gradients are synchronized with an all reduce at the end of the backward pass. This scales well until communication time approaches compute time. For a 7 billion parameter model on 8 GPUs with strong interconnects, data parallelism alone can achieve near linear scaling. Communication cost is one all reduce of gradients per step, roughly 28 GB at bfloat16. At 50 GB per second effective bandwidth, this takes about 0.56 seconds. If compute per step is 5 seconds, communication overhead is only 11 percent. Use pure data parallelism when your model fits comfortably on one device and network bandwidth is strong.

Tensor parallelism splits individual operations within layers across devices. This removes per layer memory limits and can reduce per step time by parallelizing large matrix multiplications. However, it demands very high bandwidth, low latency interconnects because activations and partial results flow between devices within each layer. Tensor parallelism works best within a single node over NVLink or NVSwitch, which provide 600 GB per second aggregate bandwidth. Across nodes on 100 Gbps Ethernet, tensor parallelism bottlenecks quickly. Use tensor parallelism when a single layer exceeds device memory and you have fast intra node links.

Pipeline parallelism assigns layers to different stages and streams micro batches through the pipeline. This scales across nodes more gracefully than tensor parallelism because communication only happens between adjacent stages, not within every operation. Pipeline efficiency is micro batches divided by micro batches plus stages minus one. With 8 stages and 32 micro batches, efficiency is 82 percent. The trade is pipeline bubbles and increased activation memory or recomputation cost. Use pipeline parallelism when model depth is large and you can balance stages, and when you have enough micro batches to keep the pipeline full.

In practice, large models use 3D parallelism, combining all three. A 70 billion parameter model on 256 A100 GPUs might use tensor parallel size 8 within each node to leverage NVLink, pipeline parallel size 8 across nodes to divide layers, and data parallel size 4 to increase effective batch size. This maps parallelism to hardware topology, keeping high bandwidth communication on fast links and coarser synchronization on slower inter node links. Google TPU v4 pods and NVIDIA Megatron both adopt this pattern, tuning the parallelism degrees to match interconnect capabilities and memory constraints.

💡 Key Takeaways

•Data parallelism is simplest and scales well when model fits on one device and communication overhead is under 20 percent, requiring one all reduce of gradients per step

•Tensor parallelism splits layers across devices with fast interconnects like NVLink at 600 GB per second, but bottlenecks on slower inter node links like 100 Gbps Ethernet

•Pipeline parallelism divides layers into stages and achieves 82 percent efficiency with 32 micro batches and 8 stages, best when model is deep but individual layers fit on device

•3D parallelism combines tensor, pipeline, and data parallelism, mapping to hardware topology with high bandwidth operations on fast intra node links and coarse synchronization across nodes

•Real systems like Google TPU v4 and NVIDIA Megatron tune parallelism degrees to match interconnect capabilities, using tensor parallel 8 within nodes and pipeline parallel 8 across nodes

📌 Examples

A 7 billion parameter model fits on one 80 GB A100, use pure data parallelism across 8 GPUs with 11 percent communication overhead for 28 GB gradient all reduce at 50 GB per second

A 70 billion parameter model uses pipeline parallel 8 stages with 32 micro batches achieving 82 percent efficiency, each stage holding roughly 9 billion parameters

Tensor parallelism splits a 20 GB attention layer across 8 GPUs over NVLink, each GPU holds 2.5 GB and exchanges 5 GB activations in under 10 milliseconds

Meta trains a 175 billion parameter model with tensor parallel 8, pipeline parallel 16, and data parallel 4 on 512 A100 GPUs, matching parallelism to node topology for optimal performance

← Back to Scalability (Model Parallelism, Batching) Overview