What is Distributed Training and Why Do We Need It?
The Memory Problem
The memory problem gets worse when you account for the full training state. With the Adam optimizer in mixed precision, you need roughly 16 bytes per parameter: 2 bytes for FP16 weights, 2 bytes for gradients, 4 bytes for FP32 master weights, and 8 bytes for two FP32 momentum states. A 10 billion parameter model requires approximately 160 GB just for these states, exceeding single device capacity.
Three Complementary Strategies
Distributed training solves this through three complementary strategies. Data Parallelism (DP) replicates the model on multiple devices, each processing different training examples, then synchronizes gradients. Model Parallelism (also called Tensor Parallelism or TP) splits individual layers across devices, with each device computing part of the matrix operations. Pipeline Parallelism (PP) partitions the model vertically by layers, flowing micro batches through stages like an assembly line.
3D Parallelism in Production
Production systems combine all three strategies, a technique called 3D parallelism, to train models with hundreds of billions of parameters. The key insight is matching each parallelism strategy to the appropriate hardware topology: tensor parallelism within high bandwidth NVLink domains, pipeline parallelism across nodes, and data parallelism for throughput scaling.