Model Parallelism: Tensor and Pipeline Parallelism Explained
When One GPU Is Not Enough
Large models do not fit on a single GPU. A 70B parameter model in float16 requires 140GB of memory. The largest consumer GPUs have 24GB; even datacenter A100s max out at 80GB. To run these models, you must split them across multiple GPUs. This is model parallelism: distributing different parts of the same model across devices.
There are two main approaches. Tensor parallelism splits individual layers across GPUs. Each GPU holds a portion of every layer's weights and computes part of every operation. Pipeline parallelism splits the model by layers: GPU 1 has layers 1-20, GPU 2 has layers 21-40, etc. Data flows through GPUs sequentially, like an assembly line.
Tensor Parallelism
Tensor parallelism divides matrix operations across GPUs. A matrix multiply with a 4096x4096 weight matrix on 4 GPUs becomes four 1024x4096 operations. Each GPU computes its portion, then results are gathered and combined. This requires high bandwidth interconnect between GPUs (NVLink at 600GB/s, not PCIe at 64GB/s) because every layer needs communication.
The benefit is low latency: all GPUs work simultaneously on each token. The cost is communication overhead, which grows with the number of GPUs. Tensor parallelism typically scales efficiently up to 8 GPUs; beyond that, communication dominates.
Pipeline Parallelism
Pipeline parallelism assigns different layers to different GPUs. Token processing flows through the pipeline: GPU 1 computes layers 1-20, sends activations to GPU 2, which computes layers 21-40, and so on. The advantage is lower communication overhead - only activations transfer between GPUs, not gradients for every operation. The disadvantage is latency: a 4-stage pipeline adds 4x the single-stage latency for a single request.