Learn→Training Infrastructure & Pipelines→Distributed Training (Data/Model/Pipeline Parallelism)→5 of 6

Training Infrastructure & Pipelines • Distributed Training (Data/Model/Pipeline Parallelism)Hard⏱️ ~3 min

3D Parallelism and Topology Aware Mapping in Production

When You Need 3D Parallelism
3D parallelism combines Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) when a single parallelism strategy cannot fit the model or achieve target throughput. You need 3D when individual layers exceed single GPU memory (requiring TP), the full model depth is too large (requiring PP), and you still need multiple replicas for throughput (requiring DP). The design challenge is deciding how to allocate your total GPU count across these three dimensions.
Topology Aware Mapping Principle
The core principle is topology aware mapping: match communication frequency to interconnect bandwidth. Tensor parallel groups perform two collectives per layer (high frequency), so confine them to the fastest links like NVSwitch at 600 GB/s within a node. Pipeline stages exchange activations once per micro batch (medium frequency), so they can span slower InfiniBand links at 200 Gbps across nodes. Data parallel replicas all reduce gradients once per mini batch (lowest frequency), so they tolerate even higher latency across the cluster.
Configuration Example
For a 175 billion parameter model on 512 GPUs, you might choose TP=8, PP=8, DP=8. The reasoning: TP=8 fits within an 8 GPU node connected by NVSwitch, keeping high frequency collectives fast. PP=8 splits the model into 8 stages because the full depth exceeds what TP alone can handle memory wise. DP=8 uses the remaining dimension to maintain reasonable throughput, giving an effective batch size 8 times larger than a single replica. Adding ZeRO sharding across the DP dimension reduces per device memory from 16 bytes per parameter to 2 to 4 bytes, enabling this configuration on 80 GB GPUs.
Complexity vs Capability Trade-off
The trade offs are complexity versus capability. 3D parallelism requires careful stage balancing for PP, managing micro batch scheduling, and handling checkpoint sharding across all three dimensions. Communication overhead increases as you add more dimensions, reducing efficiency from the theoretical peak. However, without 3D, models beyond 50 to 100 billion parameters become infeasible on current hardware. Meta trained OPT 175B using this approach on 992 A100 GPUs, achieving 140 TFLOP/s per GPU (47 percent of peak), demonstrating that carefully tuned 3D parallelism makes large scale training practical despite the complexity.

💡 Key Takeaways

✓When to use 3D: Individual layers too large for single GPU (need TP), full model depth too deep (need PP), still need throughput (need DP); without all three, 100B+ parameter models infeasible

✓Topology mapping principle: Match communication frequency to bandwidth; TP (2 collectives/layer) on fast NVSwitch 600 GB/s, PP (per micro batch) on InfiniBand 200 Gbps, DP (per mini batch) cluster wide

✓Degree selection example: TP limited by fast interconnect topology (typically 4 to 8 GPUs per node), PP by model depth and stage balance, DP uses remaining GPUs for throughput scaling

✓Memory optimization: ZeRO shards optimizer states and gradients across DP dimension reducing 16 bytes per parameter to 2 to 4 bytes; enables 175B models on 80 GB GPUs with TP=8, PP=8, DP=8 on 512 devices

✓Trade off: 3D adds complexity (stage balancing, micro batch scheduling, checkpoint sharding) and communication overhead reducing efficiency to 40 to 50 percent of peak, but makes 100B+ models trainable

✓Production validation: Meta OPT 175B trained on 992 A100 GPUs with 3D parallelism achieved 140 TFLOP/s per GPU (47 percent of 312 TFLOP/s peak), demonstrating practical feasibility at scale

📌 Interview Tips

1Deciding TP degree: If you have 8 GPU nodes with NVSwitch, set TP=8 to keep collectives within node; crossing to slower links drops efficiency from 90 percent to 30 percent

2Deciding PP degree: A 96 layer transformer might use PP=8 (12 layers per stage) to fit memory after applying TP=8; must verify stages are balanced within 10 to 20 percent compute time

3Deciding DP degree: With 512 total GPUs, TP=8, PP=8, the remaining dimension is DP=8; this gives effective batch size 8x larger than single replica for throughput

← Back to Distributed Training (Data/Model/Pipeline Parallelism) Overview