World Size Agnostic Checkpoints and Elastic Recovery
The Brittleness Problem
Traditional checkpoints embed the number of GPUs (world size) and parallelism strategy directly into the checkpoint format by saving tensors with rank specific filenames or shapes. This creates a brittle coupling: if you trained on 256 GPUs with 8 way tensor parallelism and need to resume on 128 GPUs with 4 way parallelism, you face a costly reshape and repartition of the entire checkpoint offline.
World Size Agnostic Design
World size agnostic checkpoints solve this by storing logical parameter names and partition metadata instead of rank indexed state, enabling elastic recovery where you can resume training on a different number of GPUs without rewriting checkpoints. Each parameter tensor is saved with a global logical name and metadata describing how it was partitioned.
Resharding on Restore
On restore, a resharding algorithm reads these logical tensors and repartitions them according to the new world size and parallelism configuration. For example, a weight matrix tensor parallel split 8 ways on 256 GPUs is stored as 8 logical shards. On resume with 4 way tensor parallelism on 128 GPUs, the loader merges pairs of shards into 4 chunks.
Implementation Examples
Meta's FSDP and NVIDIA's Megatron both implement this pattern. FSDP checkpoints include a parameter flattening map that records original tensor shapes and shard boundaries independent of rank count. The restore path can reshape checkpoints across different parallelism configs. Elastic recovery is not free, adding 5 to 10 minutes to RTO for large models, and optimizer state resharding requires careful mapping to ensure numerical consistency.