Learn→Training Infrastructure & Pipelines→Model Checkpointing & Recovery→2 of 6

Training Infrastructure & Pipelines • Model Checkpointing & RecoveryMedium⏱️ ~3 min

Snapshot and Persist: The Two Phase Checkpointing Protocol

Two Phase Design
Production checkpointing systems split the operation into two distinct phases to minimize GPU idle time: snapshot and persist. The snapshot phase happens synchronously and briefly stalls training (typically 10 to 30 seconds) to create a consistent in memory copy of all training state at a global barrier. The persist phase then streams this snapshot to durable storage asynchronously while GPUs resume training. This separation is critical because writing 18 TB to remote storage at realistic throughput takes 180 seconds, but you only pay 20 seconds of actual stall time.
Snapshot Consistency
The snapshot must be atomic and consistent across all distributed ranks. Every GPU enters a barrier after completing its optimizer step and all reduce operations, ensuring everyone captures state at exactly the same global step counter. During snapshot, each rank serializes its partition of parameters and optimizer state into a pinned memory buffer, computes checksums for integrity, and records metadata like tensor shapes and partition boundaries.
Persist Phase
Persist happens in the background while training continues. Each rank launches async I/O threads that write its snapshot shard to a distributed filesystem or object storage under a unique checkpoint directory. The key to atomicity is writing a manifest file last. This manifest lists all shards with their checksums, sizes, global step, and schema version. Only after every shard is durably written does the system update a lightweight "latest" pointer to reference the new checkpoint.
Storage Backend Choice
Parallel filesystems like Lustre or GPFS offer high bandwidth (tens to hundreds of GB/s aggregate) but capacity is limited and expensive. Object storage provides unlimited capacity at lower cost but has per request overhead. Production systems optimize object storage writes using multipart uploads with large parts and writing many shards in parallel.

💡 Key Takeaways

✓Snapshot phase stalls training for 10 to 30 seconds to create consistent in memory copy at global barrier; persist phase writes to storage asynchronously over 2 to 5 minutes while training continues

✓Atomic commits require writing manifest file last after all shards verify; incomplete checkpoints without manifests are ignored on recovery, preventing corruption from partial writes

✓Each rank in a 64 GPU cluster writes its 11 GB shard in parallel; aggregate throughput of 100 GB/s allows 720 GB checkpoint to persist in under 10 seconds per rank, 180 seconds wall time

✓Object storage optimizations: multipart uploads with 100 MB to 1 GB parts, parallel writes from all nodes, and batching small tensors into larger blobs to reduce metadata overhead and request latency

✓Background persist can cause resource contention: async writes may exhaust host RAM (snapshot buffers) or saturate network, requiring throttling or serialization of concurrent checkpoint operations

📌 Interview Tips

1NVIDIA Megatron checkpointing: 8 way tensor parallel, 16 way pipeline parallel across 128 GPUs; each writes 140 GB shard to Lustre in 90 seconds at 1.5 GB/s per node, snapshot barrier adds 20s stall

2Meta PyTorch FSDP: 256 A100 GPUs checkpoint 175B model (1.4 TB total) by writing 5.6 GB per rank to S3 using 500 MB multipart uploads, achieving 80 GB/s aggregate (320 MB/s per node)

3Google JAX on TPU v4: checkpoint every 15 minutes to GCS; snapshot takes 12s, persist 3 minutes; manifest includes shard checksums and is written only after all 512 hosts confirm upload success

← Back to Model Checkpointing & Recovery Overview