Natural Language Processing SystemsScalability (Model Parallelism, Batching)Medium⏱️ ~2 min

What is Model Parallelism and Why Do We Need It?

Model parallelism splits a single neural network across multiple devices so that the combined memory and compute capacity of many accelerators can handle models too large for one chip. Unlike data parallelism where each worker holds a complete copy of the model, model parallelism divides the model itself into pieces. Consider a 70 billion parameter transformer in bfloat16 precision. The parameters alone occupy roughly 140 GB. Add optimizer states like Adam's two moment vectors stored in higher precision, and the total memory footprint can exceed 1.4 TB per replica. No single GPU can hold this. Model parallelism breaks this bottleneck by distributing layers, operations, or expert modules across devices. Three common patterns emerge. Tensor parallelism splits individual operations within a single layer, like dividing a matrix multiplication by columns or rows across devices. Pipeline parallelism assigns entire layers to different stages and streams micro batches through the pipeline like an assembly line. Mixture of Experts (MoE) routes each token to a small subset of expert layers, reducing active parameters and compute per forward pass. The choice depends on your bottleneck. If a single attention layer exceeds device memory, use tensor parallelism. If the model is deep but individual layers fit, use pipeline parallelism. If you want to scale capacity faster than cost, consider MoE. In practice, large scale training like Google's TPU v4 pods or Meta's LLaMA training uses 3D parallelism, combining tensor, pipeline, and data parallelism simultaneously with careful topology mapping to keep high bandwidth communication on fast interconnects like NVLink or custom fabrics.
💡 Key Takeaways
Model parallelism distributes a single model across devices when it exceeds single device memory, unlike data parallelism which replicates the entire model
Tensor parallelism splits individual operations like matrix multiplications across devices, requiring high bandwidth interconnects like NVLink with 600 GB per second throughput
Pipeline parallelism divides layers into stages with efficiency equal to micro batches divided by micro batches plus stages minus one, achieving 82% with 32 micro batches and 8 stages
Mixture of Experts routes tokens to a small subset of experts, reducing active FLOPs by up to 97% but introducing router overhead and potential load imbalance hotspots
Real systems combine techniques in 3D parallelism, like Meta using tensor parallel 8, pipeline parallel 8, and data parallel 4 for 256 GPU clusters training 70B models
📌 Examples
Google TPU v4 pods scale to thousands of chips using layout aware 3D parallelism with custom interconnects maintaining near linear scaling
A 70 billion parameter model requires 140 GB for parameters plus over 1 TB with optimizer states, forcing tensor parallel size 8 within an 8x A100 node connected by NVSwitch
NVIDIA Megatron uses tensor parallelism to split transformer attention and feedforward layers by columns and rows across GPUs sharing high speed NVLink fabric
OpenAI GPT models use pipeline parallelism with carefully balanced stages, where one slow stage taking 2x longer creates chronic idle time across the entire pipeline
← Back to Scalability (Model Parallelism, Batching) Overview
What is Model Parallelism and Why Do We Need It? | Scalability (Model Parallelism, Batching) - System Overflow