Training Infrastructure & PipelinesModel Checkpointing & RecoveryMedium⏱️ ~3 min

Snapshot and Persist: The Two Phase Checkpointing Protocol

Production checkpointing systems split the operation into two distinct phases to minimize GPU idle time: snapshot and persist. The snapshot phase happens synchronously and briefly stalls training (typically 10 to 30 seconds) to create a consistent in memory copy of all training state at a global barrier, usually at the end of a training step after all gradient synchronization completes. The persist phase then streams this snapshot to durable storage asynchronously while GPUs resume training on the next batch. This separation is critical because writing 18 TB to remote storage at realistic throughput (100 GB/s) takes 180 seconds, but you only pay 20 seconds of actual stall time. The snapshot must be atomic and consistent across all distributed ranks. Every GPU enters a barrier after completing its optimizer step and all reduce operations, ensuring everyone captures state at exactly the same global step counter. During snapshot, each rank serializes its partition of parameters and optimizer state into a pinned memory buffer (GPU to CPU transfer optimized for bandwidth), computes checksums for integrity, and records metadata like tensor shapes and partition boundaries. For a 70B parameter model sharded across 64 GPUs, each rank snapshots roughly 11 GB in 15 to 25 seconds. Persist happens in the background while training continues. Each rank launches async I/O threads that write its snapshot shard to a distributed filesystem or object storage under a unique checkpoint directory (identified by global step and timestamp). The key to atomicity is writing a manifest file last. This manifest lists all shards with their checksums, sizes, global step, and schema version. Only after every shard is durably written and verified does the system update a lightweight "latest" pointer (often a small file or symlink) to reference the new checkpoint. If a failure occurs mid write, the incomplete checkpoint lacks a manifest and will be ignored during recovery. Storage backend choice dramatically affects throughput and cost. Parallel filesystems like Lustre or GPFS offer high bandwidth (tens to hundreds of GB/s aggregate) and POSIX semantics, making them ideal for frequent checkpoints, but capacity is limited and expensive. Object storage like S3 or Google Cloud Storage provides unlimited capacity at lower cost, but has per request overhead and eventual consistency challenges. Production systems optimize object storage writes using multipart uploads with large parts (100 MB to 1 GB) and write many shards in parallel to saturate network bandwidth. Meta and OpenAI report sustaining 50 to 150 GB/s checkpoint writes to object storage from large GPU clusters.
💡 Key Takeaways
Snapshot phase stalls training for 10 to 30 seconds to create consistent in memory copy at global barrier; persist phase writes to storage asynchronously over 2 to 5 minutes while training continues
Atomic commits require writing manifest file last after all shards verify; incomplete checkpoints without manifests are ignored on recovery, preventing corruption from partial writes
Each rank in a 64 GPU cluster writes its 11 GB shard in parallel; aggregate throughput of 100 GB/s allows 720 GB checkpoint to persist in under 10 seconds per rank, 180 seconds wall time
Object storage optimizations: multipart uploads with 100 MB to 1 GB parts, parallel writes from all nodes, and batching small tensors into larger blobs to reduce metadata overhead and request latency
Background persist can cause resource contention: async writes may exhaust host RAM (snapshot buffers) or saturate network, requiring throttling or serialization of concurrent checkpoint operations
📌 Examples
NVIDIA Megatron checkpointing: 8 way tensor parallel, 16 way pipeline parallel across 128 GPUs; each writes 140 GB shard to Lustre in 90 seconds at 1.5 GB/s per node, snapshot barrier adds 20s stall
Meta PyTorch FSDP: 256 A100 GPUs checkpoint 175B model (1.4 TB total) by writing 5.6 GB per rank to S3 using 500 MB multipart uploads, achieving 80 GB/s aggregate (320 MB/s per node)
Google JAX on TPU v4: checkpoint every 15 minutes to GCS; snapshot takes 12s, persist 3 minutes; manifest includes shard checksums and is written only after all 512 hosts confirm upload success
← Back to Model Checkpointing & Recovery Overview
Snapshot and Persist: The Two Phase Checkpointing Protocol | Model Checkpointing & Recovery - System Overflow