Pipeline Parallelism: Scaling Model Depth Across Devices
How Pipeline Parallelism Works
Pipeline Parallelism (PP) partitions a model vertically by layers into stages, each assigned to a device or group of devices. Training data flows through these stages like an assembly line. To avoid idle time (called pipeline bubbles), the mini batch is split into multiple micro batches that are processed concurrently, so while stage 1 works on micro batch 3, stage 2 processes micro batch 2, and stage 3 handles micro batch 1.
Bubble Time Analysis
The critical metric is bubble time, the fraction of time devices sit idle waiting for work. The bubble fraction is approximately (p minus 1) divided by (m plus p minus 1), where p is the number of pipeline stages and m is the number of micro batches. With 8 stages and only 8 micro batches, the bubble is 7 divided by 15, roughly 47 percent waste. Increasing to 64 micro batches drops the bubble to 7 divided by 71, about 10 percent. Google GPipe demonstrated 11 times speedup on 16 accelerators by aggressively micro batching, though at the cost of higher memory for storing micro batch activations.
Stage Balance Criticality
Stage balance is equally critical. If one stage takes twice as long as others, it becomes the bottleneck and other stages idle waiting for it. For transformers, attention layers and feedforward layers have different compute profiles, and embedding layers can be memory bound. Production systems carefully profile each layer and adjust boundaries, sometimes splitting heavy layers across multiple devices in a hybrid pipeline plus tensor parallel configuration.
When Pipeline Parallelism Excels
Pipeline parallelism excels when model depth is large, stage boundaries are easy to balance, and cross node bandwidth is limited. Meta used pipeline parallelism across nodes for OPT 175B, exploiting the fact that pipeline stages communicate less frequently than tensor parallel layers (only activation and gradient tensors at stage boundaries, not every layer). The trade off is the complexity of managing micro batch scheduling, gradient accumulation across micro batches, and activation memory for all in flight micro batches.