Learn→Training Infrastructure & Pipelines→Distributed Training (Data/Model/Pipeline Parallelism)→4 of 6
Training Infrastructure & Pipelines • Distributed Training (Data/Model/Pipeline Parallelism)Hard⏱️ ~3 min
Pipeline Parallelism: Scaling Model Depth Across Devices
Pipeline Parallelism (PP) partitions a model vertically by layers into stages, each assigned to a device or group of devices. Training data flows through these stages like an assembly line. To avoid idle time (called pipeline bubbles), the mini batch is split into multiple micro batches that are processed concurrently, so while stage 1 works on micro batch 3, stage 2 processes micro batch 2, and stage 3 handles micro batch 1.
The critical metric is bubble time, the fraction of time devices sit idle waiting for work. The bubble fraction is approximately (p minus 1) divided by (m plus p minus 1), where p is the number of pipeline stages and m is the number of micro batches. With 8 stages and only 8 micro batches, the bubble is 7 divided by 15, roughly 47 percent waste. Increasing to 64 micro batches drops the bubble to 7 divided by 71, about 10 percent. Google GPipe demonstrated 11 times speedup on 16 accelerators by aggressively micro batching, though at the cost of higher memory for storing micro batch activations.
Stage balance is equally critical. If one stage takes twice as long as others, it becomes the bottleneck and other stages idle waiting for it. For transformers, attention layers and feedforward layers have different compute profiles, and embedding layers can be memory bound. Production systems carefully profile each layer and adjust boundaries, sometimes splitting heavy layers across multiple devices in a hybrid pipeline plus tensor parallel configuration.
Pipeline parallelism excels when model depth is large, stage boundaries are easy to balance, and cross node bandwidth is limited. Meta used pipeline parallelism across nodes for OPT 175B, exploiting the fact that pipeline stages communicate less frequently than tensor parallel layers (only activation and gradient tensors at stage boundaries, not every layer). The trade off is the complexity of managing micro batch scheduling, gradient accumulation across micro batches, and activation memory for all in flight micro batches.
💡 Key Takeaways
•Bubble fraction formula: approximately (p minus 1) divided by (m plus p minus 1); with 8 stages and 64 micro batches, bubble is 7 divided by 71, about 10 percent idle time
•Micro batch trade off: More micro batches reduce bubble but increase memory for storing activations of all in flight micro batches; Google GPipe used aggressive micro batching for 11 times speedup on 16 devices
•Stage balance is critical: A single slow stage gates throughput; transformers require careful profiling because attention and feedforward layers have different compute characteristics
•Communication pattern: Pipeline stages exchange activations and gradients only at boundaries, not per layer, making it suitable for slower cross node links (InfiniBand 200 Gbps)
•Meta OPT 175B used pipeline parallelism across nodes (slower links) combined with tensor parallelism within nodes (fast NVLink) and data parallelism for replicas
•Gradient accumulation: Gradients from all micro batches are accumulated before weight update; must ensure correct averaging and synchronization across pipeline stages
📌 Examples
Google GPipe: Achieved 3.5 times speedup on 4 accelerators and 11 times on 16 for deep networks by splitting into pipeline stages and using micro batch sizes of 8 to 16
Microsoft DeepSpeed Pipeline: Implements 1F1B (one forward one backward) schedule to reduce memory by interleaving forward and backward passes instead of batching all forwards first
Meta OPT 175B: 992 A100 GPUs with 8 way tensor parallel within nodes, 16 way pipeline parallel across nodes, and 8 way data parallel for replicas (8 × 16 × 8 = 1024 logical topology)