Data Parallelism: Scaling Training Throughput
How Data Parallelism Works
Data Parallelism (DP) is the simplest and most widely used distributed training strategy. Each GPU holds a complete replica of the model and processes a different subset of the training batch. After computing gradients locally, all replicas synchronize via an all reduce operation that averages gradients across devices, ensuring every replica updates with identical weight changes.
Communication Cost Analysis
The communication cost scales with model size and device count. For a 1 billion parameter model in FP16 (2 bytes per parameter), each training step must all reduce approximately 2 GB of gradients. With a ring all reduce algorithm across 8 GPUs, each device transfers roughly 2 times (N minus 1) divided by N times S bytes, which equals approximately 3.5 GB. On 200 Gbps links (25 GB/s bandwidth), this takes a minimum of 140 milliseconds just for gradient exchange. Upgrading to 400 Gbps fabrics halves this to 70 milliseconds, but as models grow to 10B or 100B parameters, communication becomes the bottleneck.
FSDP and ZeRO Extensions
Data parallelism shines when models fit comfortably in single device memory and network bandwidth can handle gradient aggregation. PyTorch Fully Sharded Data Parallel (FSDP) and DeepSpeed ZeRO extend basic DP by sharding optimizer states, gradients, and even parameters across devices, reducing per device memory from 16 bytes per parameter to as low as 2 bytes per parameter when fully sharded across N devices. This enables 10B+ parameter models on 40 to 80 GB GPUs without switching to model or pipeline parallelism.
Straggler Amplification Failure Mode
The key failure mode is straggler amplification. Synchronous all reduce waits for the slowest replica, so a single GPU thermal throttling or experiencing PCIe errors gates the entire training step. As you scale from 8 to 64 to 512 GPUs, the probability of encountering a straggler in any given step increases, causing utilization to drop. Hierarchical all reduce (first within nodes over fast NVLink, then across nodes over InfiniBand) and gradient compression can mitigate this, but careful system monitoring is essential.