Learn→Training Infrastructure & Pipelines→Distributed Training (Data/Model/Pipeline Parallelism)→1 of 6

Training Infrastructure & Pipelines • Distributed Training (Data/Model/Pipeline Parallelism)Easy⏱️ ~2 min

What is Distributed Training and Why Do We Need It?

Definition
Distributed training splits deep learning workloads across multiple devices (GPUs or TPUs) because modern neural networks have grown beyond what a single device can handle. A single A100 GPU has 80 GB of memory, but training GPT-3 (175B parameters) requires roughly 700 GB just for model weights in half precision.
The Memory Problem
The memory problem gets worse when you account for the full training state. With the Adam optimizer in mixed precision, you need roughly 16 bytes per parameter: 2 bytes for FP16 weights, 2 bytes for gradients, 4 bytes for FP32 master weights, and 8 bytes for two FP32 momentum states. A 10 billion parameter model requires approximately 160 GB just for these states, exceeding single device capacity.
Three Complementary Strategies
Distributed training solves this through three complementary strategies. Data Parallelism (DP) replicates the model on multiple devices, each processing different training examples, then synchronizes gradients. Model Parallelism (also called Tensor Parallelism or TP) splits individual layers across devices, with each device computing part of the matrix operations. Pipeline Parallelism (PP) partitions the model vertically by layers, flowing micro batches through stages like an assembly line.
3D Parallelism in Production
Production systems combine all three strategies, a technique called 3D parallelism, to train models with hundreds of billions of parameters. The key insight is matching each parallelism strategy to the appropriate hardware topology: tensor parallelism within high bandwidth NVLink domains, pipeline parallelism across nodes, and data parallelism for throughput scaling.

💡 Key Takeaways

✓Memory bottleneck: A 10B parameter model needs approximately 160 GB for weights, gradients, and optimizer states (16 bytes per parameter), exceeding single GPU capacity

✓Data Parallelism replicates the entire model across devices; each device processes different data and gradients are synchronized via all reduce collective communication

✓Model (Tensor) Parallelism splits individual layers across devices; each computes a slice of matrix multiplications requiring two collective operations per transformer layer

✓Pipeline Parallelism divides model by layers into stages; micro batches flow through concurrently to hide bubble time, with bubble fraction approximately (p minus 1) divided by (m plus p minus 1)

✓Production systems use 3D parallelism combining all three strategies: Meta trained OPT 175B on 992 A100 GPUs using tensor parallel within nodes, pipeline across nodes, and data parallel replicas

✓Topology awareness is critical: keep high frequency tensor parallel communication within fast NVLink domains (400 to 600 GB/s), use slower InfiniBand (200 Gbps) for less frequent pipeline stage transfers

📌 Interview Tips

1NVIDIA Megatron LM trained 8.3B parameters on 512 V100 GPUs achieving 15.1 PFLOP/s at 76% scaling efficiency using tensor plus data parallelism

2Microsoft and NVIDIA trained MT NLG 530B on 4,480 A100 80GB GPUs using 3D parallelism with tensor groups confined to NVLink connected devices

3Google PaLM 540B trained on 6,144 TPU v4 chips using 2D and 3D device mesh sharding, mapping high bandwidth links to most frequent collectives

← Back to Distributed Training (Data/Model/Pipeline Parallelism) Overview