Data Parallelism for Training: Gradient Sync and Scaling
Scaling Training with Data Parallelism
Training a large model on billions of examples takes weeks on a single GPU. Data parallelism speeds this up by running multiple copies of the model on different GPUs, each processing different batches of data. With 8 GPUs, you process 8 batches simultaneously, reducing training time by roughly 8x.
The workflow: each GPU has a complete copy of the model. A large batch is split into mini-batches, one per GPU. Each GPU computes forward pass, loss, and gradients on its mini-batch. Gradients are synchronized across GPUs (typically averaged), and all GPUs update their model weights identically. Because all GPUs start with identical weights and apply identical updates, they stay synchronized.
Gradient Synchronization
The synchronization step is the bottleneck. After each mini-batch, gradients must be communicated between all GPUs. For a 7B parameter model, that is 14GB of gradient data (float16) transferred every step. With 8 GPUs on 25Gbps ethernet, synchronization takes ~4.5 seconds per step - likely longer than the computation itself.
All-reduce algorithms optimize this communication. Instead of each GPU sending its full gradients to a central coordinator, GPUs exchange partial sums in a tree or ring pattern. This reduces total data transfer and spreads load across the network. Libraries like NCCL implement efficient all-reduce for NVIDIA GPUs.
Practical Considerations
Effective batch size scales with GPU count. 8 GPUs with batch size 32 each means effective batch 256. Very large batches can hurt model quality - learning rate adjustments and warmup schedules become critical. Monitor training loss curves carefully when scaling; sudden divergence often indicates batch size issues.