Training Infrastructure & PipelinesModel Checkpointing & RecoveryHard⏱️ ~3 min

Checkpoint Failure Modes and Atomic Commit Guarantees

Checkpointing at scale introduces numerous failure modes that can corrupt training state or cause silent data loss. The most common is partial checkpoint writes: a job gets preempted or a node crashes mid checkpoint, leaving some shards written and others missing. Without careful design, the next recovery attempt might load an inconsistent mix of old and new state, causing immediate divergence or subtle accuracy degradation that surfaces hours later. Atomic commit protocols prevent this by treating each checkpoint as a transaction: either the entire checkpoint is valid and usable, or it is ignored. The standard atomic commit pattern writes all shards to a temporary staging location (e.g., a directory named by checkpoint step plus a random suffix), then writes a manifest file listing every shard with its checksum, size, and metadata. Only after verifying that all shards exist and checksums match does the system "commit" the checkpoint by writing or updating a small "latest" pointer (a file, symlink, or versioned object). Recovery logic always reads the latest pointer first, then loads only checkpoints with a valid manifest. If a checkpoint directory lacks a manifest, it is incomplete and gets cleaned up by a garbage collection pass. This two phase commit ensures that a failure at any point during persist leaves the previous checkpoint intact and discoverable. Inconsistent distributed snapshots are another subtle failure mode. If ranks do not quiesce at a global barrier before snapshotting, some may capture state before an all reduce completes while others capture post all reduce values. On restore, gradients will be inconsistent across ranks, causing training to diverge or produce NaNs within a few steps. Mitigation requires a strict protocol: all ranks enter a barrier after optimizer.step completes, then snapshot at the exact same global step counter. Random number generator (RNG) seeds must also be captured to avoid data shuffling divergence; if rank 0 resumes with a different shuffle seed than it had at checkpoint time, it will see different data, breaking reproducibility. Dataloader and input pipeline state is frequently overlooked. Without saving the exact iterator position or sample index, resuming training will either re consume already seen data (biasing the model toward those examples) or skip data entirely (reducing effective dataset size). For large corpus training, this can shift metrics by several percentage points. Production systems save per rank sample offsets, shard boundaries, and shuffle RNG state as part of the checkpoint. Streaming datasets from object storage require transactional consumption: each checkpoint records the last committed byte offset or sequence number per shard, so resume can pick up exactly where it left off. Async persistence introduces resource contention edge cases. If the background checkpoint write from step N is still in progress when step N+10 triggers the next checkpoint, you risk exhausting host RAM with two concurrent snapshot buffers, or saturating network bandwidth and slowing training throughput. Mitigation strategies include: serialize checkpoints (start next only after previous commit finishes), throttle bandwidth for checkpoint writes, or drop a checkpoint cycle if the previous one is still running. OpenAI and NVIDIA clusters monitor checkpoint write latency and alert if persist duration exceeds twice the expected time, indicating storage degradation or network issues that could compromise recovery guarantees.
💡 Key Takeaways
Atomic commit protocol: write all shards to temp location with checksums, write manifest.json last, commit by renaming directory and updating latest pointer; incomplete checkpoints without manifest are ignored
Inconsistent snapshots from missing global barrier cause gradient divergence and NaNs within steps; mitigation requires all ranks snapshot at same global step after optimizer.step completes, including RNG state
Dataloader state loss causes data re consumption or skipping, shifting metrics by several percent; save per rank sample offsets, shard boundaries, and shuffle seed to enable exact resume
Async persistence contention: overlapping checkpoint writes can exhaust host RAM (two 600 GB snapshot buffers) or saturate 100 Gbps network, slowing training; serialize checkpoints or throttle bandwidth
Checksum validation on restore detects silent corruption from bitflips or storage bugs; Meta FSDP and NVIDIA NeMo compute SHA256 per shard and verify before loading any state
📌 Examples
Google TPU training: checkpoint write interrupted at shard 412 of 512; no manifest written, recovery skips incomplete dir and loads previous valid checkpoint from 30 minutes prior
Meta OPT training divergence: ranks did not barrier before snapshot, some captured pre allreduce gradients; after resume, loss spiked to 15.2 from 2.4 within 50 steps, caught by automated loss anomaly detector
OpenAI training on spot instances: dataloader did not save iterator state, after preemption resume re saw first 5% of dataset, causing 0.3 point perplexity regression discovered in post training analysis
← Back to Model Checkpointing & Recovery Overview