Learn→Training Infrastructure & Pipelines→Distributed Training (Data/Model/Pipeline Parallelism)→3 of 6

Training Infrastructure & Pipelines • Distributed Training (Data/Model/Pipeline Parallelism)Hard⏱️ ~3 min

Model (Tensor) Parallelism: Splitting Layers Across Devices

What Tensor Parallelism Does
Model Parallelism, also called Tensor Parallelism (TP), partitions individual layers across multiple devices. Instead of replicating the entire model, each GPU holds a slice of weight matrices and computes a portion of the matrix multiplications. For example, in a transformer attention layer with a 12,288 dimension hidden size, 8 GPUs in a tensor parallel group each hold a 12,288 by 1,536 slice of the query projection matrix, computing their slice of the output in parallel.
Communication Pattern Challenges
The key challenge is the communication pattern. Each transformer layer requires two collective operations: one all gather or all reduce after the forward matrix multiply to combine partial results, and another during backpropagation to gather gradients. For a sequence length of 2,048 tokens with hidden size 12,288 and batch size 8, the activation tensor is approximately 8 times 2,048 times 12,288 times 2 bytes equals 402 MB per layer per collective. A 96 layer model performs 192 collectives per training step, moving tens of gigabytes. This is why tensor parallelism demands extremely high bandwidth, low latency interconnects like NVLink (600 GB/s between GPUs) or NVSwitch fabrics.
Production Implementation
NVIDIA Megatron LM pioneered production tensor parallelism, demonstrating 76 percent scaling efficiency on 512 V100 GPUs by keeping tensor parallel groups within NVLink connected sets of 8 GPUs. When tensor parallel communication crosses slower network links (InfiniBand at 200 Gbps or 25 GB/s), the per layer latency spikes collapse GPU utilization. The rule of thumb: confine tensor parallelism to the fastest hardware domain, typically 4 to 8 GPUs on the same node or blade.
When to Use Tensor Parallelism
Tensor parallelism shines when individual layers exceed device memory. A single feedforward layer in a large transformer can have weight matrices of 12,288 by 49,152 elements (approximately 1.2 GB in FP16), and the activations for long sequences or large batches blow past memory limits. Splitting across devices makes training feasible. The trade off is that you reduce the data parallel degree (fewer replicas means smaller effective batch size or more gradient accumulation steps) and add communication overhead that only pays off when memory constraints force your hand.

💡 Key Takeaways

✓Two collectives per layer: one all reduce after forward matmul to combine partial activations, another during backprop for gradients; a 96 layer model performs 192 collectives per step

✓Activation communication volume: For batch 8, sequence 2,048, hidden 12,288 in FP16, each layer moves approximately 402 MB per collective; requires high bandwidth interconnects

✓Topology constraint: Tensor parallelism collapses when crossing slow links; keep tensor parallel groups within NVLink (600 GB/s) or NVSwitch domains, never across InfiniBand (25 GB/s)

✓Memory benefit: Splits large weight matrices across devices; a 12,288 by 49,152 feedforward layer (1.2 GB) divided across 8 GPUs is only 150 MB per device

✓NVIDIA Megatron LM achieved 76 percent scaling efficiency on 512 V100s by confining tensor parallel groups to 8 GPU NVLink islands and using data parallelism across islands

✓Trade off: Reduces data parallel replicas and adds fine grained communication; only worthwhile when layers exceed single device memory or when very high bandwidth links are available

📌 Interview Tips

1Meta OPT 175B: Used tensor parallelism within nodes (8 A100 GPUs per node on NVSwitch), pipeline parallelism across nodes, achieving feasible training on 992 GPUs total

2Google Megatron Turing NLG 530B: Tensor parallel degree of 8 within NVLink connected groups, combined with pipeline and data parallelism across 4,480 A100 GPUs

3Colossal AI: Open source framework demonstrating 2D (tensor plus data) and 2.5D parallelism with careful CUDA kernel fusion to overlap communication and computation

← Back to Distributed Training (Data/Model/Pipeline Parallelism) Overview