Learn→Natural Language Processing Systems→Scalability (Model Parallelism, Batching)→3 of 6

Natural Language Processing Systems • Scalability (Model Parallelism, Batching)Hard⏱️ ~3 min

Model Parallelism: Tensor and Pipeline Parallelism Explained

When One GPU Is Not Enough
Large models do not fit on a single GPU. A 70B parameter model in float16 requires 140GB of memory. The largest consumer GPUs have 24GB; even datacenter A100s max out at 80GB. To run these models, you must split them across multiple GPUs. This is model parallelism: distributing different parts of the same model across devices.
There are two main approaches. Tensor parallelism splits individual layers across GPUs. Each GPU holds a portion of every layer's weights and computes part of every operation. Pipeline parallelism splits the model by layers: GPU 1 has layers 1-20, GPU 2 has layers 21-40, etc. Data flows through GPUs sequentially, like an assembly line.
Tensor Parallelism
Tensor parallelism divides matrix operations across GPUs. A matrix multiply with a 4096x4096 weight matrix on 4 GPUs becomes four 1024x4096 operations. Each GPU computes its portion, then results are gathered and combined. This requires high bandwidth interconnect between GPUs (NVLink at 600GB/s, not PCIe at 64GB/s) because every layer needs communication.
The benefit is low latency: all GPUs work simultaneously on each token. The cost is communication overhead, which grows with the number of GPUs. Tensor parallelism typically scales efficiently up to 8 GPUs; beyond that, communication dominates.
💡 Key Insight: Tensor parallelism reduces latency (GPUs work in parallel) but requires expensive NVLink interconnect. Pipeline parallelism works with slower interconnects but adds latency equal to the number of pipeline stages.
Pipeline Parallelism
Pipeline parallelism assigns different layers to different GPUs. Token processing flows through the pipeline: GPU 1 computes layers 1-20, sends activations to GPU 2, which computes layers 21-40, and so on. The advantage is lower communication overhead - only activations transfer between GPUs, not gradients for every operation. The disadvantage is latency: a 4-stage pipeline adds 4x the single-stage latency for a single request.

💡 Key Takeaways

✓70B parameter model needs 140GB memory in float16 - far exceeds single GPU capacity (24GB consumer, 80GB A100)

✓Tensor parallelism splits layers across GPUs for parallel computation but requires high-bandwidth NVLink (600GB/s)

✓Pipeline parallelism assigns layer groups to GPUs sequentially - works with slower interconnects but adds latency per stage

✓Tensor parallelism scales efficiently to 8 GPUs; beyond that communication overhead dominates

📌 Interview Tips

1Start with the memory math: 70B params at 2 bytes each = 140GB. No single GPU fits that. Model parallelism is required.

2Contrast tensor vs pipeline: tensor needs NVLink for every layer's communication, pipeline only transfers activations between stages.

3Mention the latency implication: pipeline with N stages adds N times the single-stage latency per request.

← Back to Scalability (Model Parallelism, Batching) Overview