Model (Tensor) Parallelism: Splitting Layers Across Devices
What Tensor Parallelism Does
Model Parallelism, also called Tensor Parallelism (TP), partitions individual layers across multiple devices. Instead of replicating the entire model, each GPU holds a slice of weight matrices and computes a portion of the matrix multiplications. For example, in a transformer attention layer with a 12,288 dimension hidden size, 8 GPUs in a tensor parallel group each hold a 12,288 by 1,536 slice of the query projection matrix, computing their slice of the output in parallel.
Communication Pattern Challenges
The key challenge is the communication pattern. Each transformer layer requires two collective operations: one all gather or all reduce after the forward matrix multiply to combine partial results, and another during backpropagation to gather gradients. For a sequence length of 2,048 tokens with hidden size 12,288 and batch size 8, the activation tensor is approximately 8 times 2,048 times 12,288 times 2 bytes equals 402 MB per layer per collective. A 96 layer model performs 192 collectives per training step, moving tens of gigabytes. This is why tensor parallelism demands extremely high bandwidth, low latency interconnects like NVLink (600 GB/s between GPUs) or NVSwitch fabrics.
Production Implementation
NVIDIA Megatron LM pioneered production tensor parallelism, demonstrating 76 percent scaling efficiency on 512 V100 GPUs by keeping tensor parallel groups within NVLink connected sets of 8 GPUs. When tensor parallel communication crosses slower network links (InfiniBand at 200 Gbps or 25 GB/s), the per layer latency spikes collapse GPU utilization. The rule of thumb: confine tensor parallelism to the fastest hardware domain, typically 4 to 8 GPUs on the same node or blade.
When to Use Tensor Parallelism
Tensor parallelism shines when individual layers exceed device memory. A single feedforward layer in a large transformer can have weight matrices of 12,288 by 49,152 elements (approximately 1.2 GB in FP16), and the activations for long sequences or large batches blow past memory limits. Splitting across devices makes training feasible. The trade off is that you reduce the data parallel degree (fewer replicas means smaller effective batch size or more gradient accumulation steps) and add communication overhead that only pays off when memory constraints force your hand.