Snapshot and Persist: The Two Phase Checkpointing Protocol
Two Phase Design
Production checkpointing systems split the operation into two distinct phases to minimize GPU idle time: snapshot and persist. The snapshot phase happens synchronously and briefly stalls training (typically 10 to 30 seconds) to create a consistent in memory copy of all training state at a global barrier. The persist phase then streams this snapshot to durable storage asynchronously while GPUs resume training. This separation is critical because writing 18 TB to remote storage at realistic throughput takes 180 seconds, but you only pay 20 seconds of actual stall time.
Snapshot Consistency
The snapshot must be atomic and consistent across all distributed ranks. Every GPU enters a barrier after completing its optimizer step and all reduce operations, ensuring everyone captures state at exactly the same global step counter. During snapshot, each rank serializes its partition of parameters and optimizer state into a pinned memory buffer, computes checksums for integrity, and records metadata like tensor shapes and partition boundaries.
Persist Phase
Persist happens in the background while training continues. Each rank launches async I/O threads that write its snapshot shard to a distributed filesystem or object storage under a unique checkpoint directory. The key to atomicity is writing a manifest file last. This manifest lists all shards with their checksums, sizes, global step, and schema version. Only after every shard is durably written does the system update a lightweight "latest" pointer to reference the new checkpoint.
Storage Backend Choice
Parallel filesystems like Lustre or GPFS offer high bandwidth (tens to hundreds of GB/s aggregate) but capacity is limited and expensive. Object storage provides unlimited capacity at lower cost but has per request overhead. Production systems optimize object storage writes using multipart uploads with large parts and writing many shards in parallel.