Training Infrastructure & PipelinesDistributed Training (Data/Model/Pipeline Parallelism)Medium⏱️ ~2 min

Data Parallelism: Scaling Training Throughput

Data Parallelism (DP) is the simplest and most widely used distributed training strategy. Each GPU holds a complete replica of the model and processes a different subset of the training batch. After computing gradients locally, all replicas synchronize via an all reduce operation that averages gradients across devices, ensuring every replica updates with identical weight changes. The communication cost scales with model size and device count. For a 1 billion parameter model in FP16 (2 bytes per parameter), each training step must all reduce approximately 2 GB of gradients. With a ring all reduce algorithm across 8 GPUs, each device transfers roughly 2 times (N minus 1) divided by N times S bytes, which equals approximately 3.5 GB. On 200 Gbps links (25 GB/s bandwidth), this takes a minimum of 140 milliseconds just for gradient exchange. Upgrading to 400 Gbps fabrics halves this to 70 milliseconds, but as models grow to 10B or 100B parameters, communication becomes the bottleneck. Data parallelism shines when models fit comfortably in single device memory and network bandwidth can handle gradient aggregation. Meta's PyTorch Fully Sharded Data Parallel (FSDP) and Microsoft's DeepSpeed ZeRO extend basic DP by sharding optimizer states, gradients, and even parameters across devices, reducing per device memory from 16 bytes per parameter to as low as 2 bytes per parameter when fully sharded across N devices. This enables 10B+ parameter models on 40 to 80 GB GPUs without switching to model or pipeline parallelism. The key failure mode is straggler amplification. Synchronous all reduce waits for the slowest replica, so a single GPU thermal throttling or experiencing PCIe errors gates the entire training step. As you scale from 8 to 64 to 512 GPUs, the probability of encountering a straggler in any given step increases, causing utilization to drop. Hierarchical all reduce (first within nodes over fast NVLink, then across nodes over InfiniBand) and gradient compression can mitigate this, but careful system monitoring is essential.
💡 Key Takeaways
Communication cost example: 1B parameter model requires all reduce of 2 GB gradients; with ring algorithm across 8 GPUs on 200 Gbps links, minimum 140 milliseconds per step for gradient sync alone
Scaling efficiency degrades as communication time approaches compute time: typically acceptable when gradient sync takes less than 20 to 30 percent of step time, problematic beyond that
ZeRO optimization shards optimizer states, gradients, and parameters across devices reducing per device memory from 16 bytes per parameter to approximately 2 bytes per parameter when fully sharded
Hierarchical all reduce exploits topology: first aggregate within nodes over NVLink (600 GB/s), then across nodes over InfiniBand (200 Gbps), reducing cross node traffic and latency
Straggler amplification: synchronous updates wait for slowest GPU; thermal throttling or PCIe errors on one device can reduce overall throughput by 10 to 20 percent in large clusters
Global batch size scales linearly with replicas: 8 GPUs with local batch 32 yields global batch 256; requires learning rate scaling and warmup to maintain convergence quality
📌 Examples
PyTorch Distributed Data Parallel (DDP): Standard implementation using NCCL all reduce; Meta reports near linear scaling up to 64 GPUs for ResNet 50 and BERT base with fast interconnects
DeepSpeed ZeRO Stage 2: Shards optimizer states and gradients across devices; enables training 10B parameter models on 16x 32GB V100s without model parallelism
NVIDIA NCCL optimized all reduce: Achieves 280 GB/s effective bandwidth across 8 A100 GPUs on NVSwitch, enabling sub 10ms gradient sync for 1B parameter models
← Back to Distributed Training (Data/Model/Pipeline Parallelism) Overview
Data Parallelism: Scaling Training Throughput | Distributed Training (Data/Model/Pipeline Parallelism) - System Overflow