Learn→Training Infrastructure & Pipelines→Model Checkpointing & Recovery→5 of 6

Training Infrastructure & Pipelines • Model Checkpointing & RecoveryHard⏱️ ~3 min

Checkpoint Failure Modes and Atomic Commit Guarantees

Partial Checkpoint Writes
Checkpointing at scale introduces numerous failure modes that can corrupt training state or cause silent data loss. The most common is partial checkpoint writes: a job gets preempted or a node crashes mid checkpoint, leaving some shards written and others missing. Without careful design, the next recovery attempt might load an inconsistent mix of old and new state, causing immediate divergence or subtle accuracy degradation.
Atomic Commit Protocol
The standard atomic commit pattern writes all shards to a temporary staging location, then writes a manifest file listing every shard with its checksum. Only after verifying that all shards exist and checksums match does the system "commit" by updating a small "latest" pointer. Recovery logic always reads the latest pointer first, then loads only checkpoints with a valid manifest. This two phase commit ensures that a failure at any point during persist leaves the previous checkpoint intact.
Distributed Snapshot Consistency
Inconsistent distributed snapshots are another subtle failure mode. If ranks do not quiesce at a global barrier before snapshotting, some may capture state before an all reduce completes while others capture post all reduce values. On restore, gradients will be inconsistent across ranks, causing training to diverge within a few steps. RNG seeds must also be captured to avoid data shuffling divergence.
Dataloader State
Dataloader and input pipeline state is frequently overlooked. Without saving the exact iterator position or sample index, resuming training will either re consume already seen data or skip data entirely, shifting metrics by several percentage points. Production systems save per rank sample offsets and shuffle RNG state as part of the checkpoint.

💡 Key Takeaways

✓Atomic commit protocol: write all shards to temp location with checksums, write manifest.json last, commit by renaming directory and updating latest pointer; incomplete checkpoints without manifest are ignored

✓Inconsistent snapshots from missing global barrier cause gradient divergence and NaNs within steps; mitigation requires all ranks snapshot at same global step after optimizer.step completes, including RNG state

✓Dataloader state loss causes data re consumption or skipping, shifting metrics by several percent; save per rank sample offsets, shard boundaries, and shuffle seed to enable exact resume

✓Async persistence contention: overlapping checkpoint writes can exhaust host RAM (two 600 GB snapshot buffers) or saturate 100 Gbps network, slowing training; serialize checkpoints or throttle bandwidth

✓Checksum validation on restore detects silent corruption from bitflips or storage bugs; Meta FSDP and NVIDIA NeMo compute SHA256 per shard and verify before loading any state

📌 Interview Tips

1Google TPU training: checkpoint write interrupted at shard 412 of 512; no manifest written, recovery skips incomplete dir and loads previous valid checkpoint from 30 minutes prior

2Meta OPT training divergence: ranks did not barrier before snapshot, some captured pre allreduce gradients; after resume, loss spiked to 15.2 from 2.4 within 50 steps, caught by automated loss anomaly detector

3OpenAI training on spot instances: dataloader did not save iterator state, after preemption resume re saw first 5% of dataset, causing 0.3 point perplexity regression discovered in post training analysis

← Back to Model Checkpointing & Recovery Overview