Checkpoint Failure Modes and Atomic Commit Guarantees
Partial Checkpoint Writes
Checkpointing at scale introduces numerous failure modes that can corrupt training state or cause silent data loss. The most common is partial checkpoint writes: a job gets preempted or a node crashes mid checkpoint, leaving some shards written and others missing. Without careful design, the next recovery attempt might load an inconsistent mix of old and new state, causing immediate divergence or subtle accuracy degradation.
Atomic Commit Protocol
The standard atomic commit pattern writes all shards to a temporary staging location, then writes a manifest file listing every shard with its checksum. Only after verifying that all shards exist and checksums match does the system "commit" by updating a small "latest" pointer. Recovery logic always reads the latest pointer first, then loads only checkpoints with a valid manifest. This two phase commit ensures that a failure at any point during persist leaves the previous checkpoint intact.
Distributed Snapshot Consistency
Inconsistent distributed snapshots are another subtle failure mode. If ranks do not quiesce at a global barrier before snapshotting, some may capture state before an all reduce completes while others capture post all reduce values. On restore, gradients will be inconsistent across ranks, causing training to diverge within a few steps. RNG seeds must also be captured to avoid data shuffling divergence.
Dataloader State
Dataloader and input pipeline state is frequently overlooked. Without saving the exact iterator position or sample index, resuming training will either re consume already seen data or skip data entirely, shifting metrics by several percentage points. Production systems save per rank sample offsets and shuffle RNG state as part of the checkpoint.