Training Infrastructure & PipelinesDistributed Training (Data/Model/Pipeline Parallelism)Easy⏱️ ~2 min

What is Distributed Training and Why Do We Need It?

Distributed training splits deep learning workloads across multiple devices (GPUs or TPUs) because modern neural networks have grown beyond what a single device can handle. A single NVIDIA A100 GPU has 80 GB of memory, but training a 175 billion parameter model like GPT 3 requires roughly 700 GB just for the model weights in half precision (2 bytes per parameter), not counting optimizer states or activations. The memory problem gets worse when you account for the full training state. With the Adam optimizer in mixed precision, you need roughly 16 bytes per parameter: 2 bytes for FP16 weights, 2 bytes for gradients, 4 bytes for FP32 master weights, and 8 bytes for two FP32 momentum states. A 10 billion parameter model requires approximately 160 GB just for these states, exceeding single device capacity. Distributed training solves this through three complementary strategies. Data Parallelism (DP) replicates the model on multiple devices, each processing different training examples, then synchronizes gradients. Model Parallelism (also called Tensor Parallelism or TP) splits individual layers across devices, with each device computing part of the matrix operations. Pipeline Parallelism (PP) partitions the model vertically by layers, flowing micro batches through stages like an assembly line. Production systems combine all three, a technique called 3D parallelism, to train models with hundreds of billions of parameters.
💡 Key Takeaways
Memory bottleneck: A 10B parameter model needs approximately 160 GB for weights, gradients, and optimizer states (16 bytes per parameter), exceeding single GPU capacity
Data Parallelism replicates the entire model across devices; each device processes different data and gradients are synchronized via all reduce collective communication
Model (Tensor) Parallelism splits individual layers across devices; each computes a slice of matrix multiplications requiring two collective operations per transformer layer
Pipeline Parallelism divides model by layers into stages; micro batches flow through concurrently to hide bubble time, with bubble fraction approximately (p minus 1) divided by (m plus p minus 1)
Production systems use 3D parallelism combining all three strategies: Meta trained OPT 175B on 992 A100 GPUs using tensor parallel within nodes, pipeline across nodes, and data parallel replicas
Topology awareness is critical: keep high frequency tensor parallel communication within fast NVLink domains (400 to 600 GB/s), use slower InfiniBand (200 Gbps) for less frequent pipeline stage transfers
📌 Examples
NVIDIA Megatron LM trained 8.3B parameters on 512 V100 GPUs achieving 15.1 PFLOP/s at 76% scaling efficiency using tensor plus data parallelism
Microsoft and NVIDIA trained MT NLG 530B on 4,480 A100 80GB GPUs using 3D parallelism with tensor groups confined to NVLink connected devices
Google PaLM 540B trained on 6,144 TPU v4 chips using 2D and 3D device mesh sharding, mapping high bandwidth links to most frequent collectives
← Back to Distributed Training (Data/Model/Pipeline Parallelism) Overview