ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Medium⏱️ ~2 min

Checkpointing and Fault Tolerance for Interruptible Workloads

Checkpointing: Periodically saving training state (model weights, optimizer state, epoch progress) to durable storage so training can resume after interruption. Without checkpointing, spot instance termination means losing all progress since the last checkpoint—potentially hours of GPU time.

Checkpoint Frequency Trade-offs

Frequent checkpoints (every 5 minutes) minimize lost work but add overhead: saving a 10GB model to cloud storage takes time and network bandwidth. Infrequent checkpoints (every hour) minimize overhead but risk losing significant progress on interruption. The optimal frequency depends on: interruption probability (higher probability = more frequent), checkpoint size (larger models = longer save time), and training cost (expensive GPUs = higher cost of lost work). A common starting point: checkpoint every 15-30 minutes for training jobs.

What to Checkpoint

Model weights: The trained parameters. Required to resume training. Optimizer state: Momentum, adaptive learning rate accumulators. Without this, optimizer restarts cold and training may diverge. Training state: Current epoch, batch index, learning rate schedule position. Enables resuming exactly where you stopped. Random state: RNG seeds for reproducibility. Ensures resumed training matches what would have happened without interruption. Missing any component degrades resume quality.

Interruption Handling

Cloud providers give 2-minute warning before spot termination. Use this time to: save a final checkpoint (even if not scheduled), gracefully stop training (finish current batch), and clean up resources. Implement termination notice handlers that trigger emergency checkpoint on warning signal. For distributed training, coordinate checkpoint across all workers—a partial checkpoint is useless if some workers saved and others did not.

Resume Verification: After implementing checkpointing, verify it works: intentionally kill a training job, resume from checkpoint, and confirm loss curve continues smoothly without regression.

💡 Key Takeaways
Checkpoint model weights, optimizer state, training state, and random seeds
Optimal frequency: checkpoint every 15-30 minutes for typical training jobs
Use 2-minute termination warning to save emergency checkpoint
📌 Interview Tips
110GB model checkpoint to cloud storage takes significant time and bandwidth
2Test checkpointing: kill job intentionally, resume, verify loss curve continues
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview