Natural Language Processing SystemsScalability (Model Parallelism, Batching)Hard⏱️ ~3 min

Data Parallelism for Training: Gradient Sync and Scaling

Scaling Training with Data Parallelism

Training a large model on billions of examples takes weeks on a single GPU. Data parallelism speeds this up by running multiple copies of the model on different GPUs, each processing different batches of data. With 8 GPUs, you process 8 batches simultaneously, reducing training time by roughly 8x.

The workflow: each GPU has a complete copy of the model. A large batch is split into mini-batches, one per GPU. Each GPU computes forward pass, loss, and gradients on its mini-batch. Gradients are synchronized across GPUs (typically averaged), and all GPUs update their model weights identically. Because all GPUs start with identical weights and apply identical updates, they stay synchronized.

Gradient Synchronization

The synchronization step is the bottleneck. After each mini-batch, gradients must be communicated between all GPUs. For a 7B parameter model, that is 14GB of gradient data (float16) transferred every step. With 8 GPUs on 25Gbps ethernet, synchronization takes ~4.5 seconds per step - likely longer than the computation itself.

All-reduce algorithms optimize this communication. Instead of each GPU sending its full gradients to a central coordinator, GPUs exchange partial sums in a tree or ring pattern. This reduces total data transfer and spreads load across the network. Libraries like NCCL implement efficient all-reduce for NVIDIA GPUs.

⚠️ Key Trade-off: Data parallelism scales training throughput nearly linearly with GPUs, but only if network bandwidth supports gradient synchronization. Slow interconnects make adding GPUs counterproductive.

Practical Considerations

Effective batch size scales with GPU count. 8 GPUs with batch size 32 each means effective batch 256. Very large batches can hurt model quality - learning rate adjustments and warmup schedules become critical. Monitor training loss curves carefully when scaling; sudden divergence often indicates batch size issues.

💡 Key Takeaways
Data parallelism runs model copies on different GPUs processing different data batches - 8 GPUs gives roughly 8x speedup
Gradient sync is the bottleneck: 14GB gradient transfer for 7B model on 25Gbps ethernet takes ~4.5 seconds per step
All-reduce algorithms (ring, tree) exchange partial sums between GPUs rather than centralizing, reducing network load
Effective batch size scales with GPU count - very large batches can hurt quality without learning rate adjustments
📌 Interview Tips
1Describe the data parallel workflow: split batch across GPUs, compute gradients locally, synchronize gradients, update weights identically.
2Calculate the sync bottleneck: model size in bytes divided by network bandwidth equals sync time. This dominates at scale.
3Warn about large batch effects: 8 GPUs at batch 32 = effective batch 256. Learning rate schedules must adapt.
← Back to Scalability (Model Parallelism, Batching) Overview