Learn→Training Infrastructure & Pipelines→Model Checkpointing & Recovery→1 of 6

Training Infrastructure & Pipelines • Model Checkpointing & RecoveryEasy⏱️ ~2 min

What Is Model Checkpointing and Why It Matters at Scale

Definition
Model checkpointing is the periodic capture of complete training state so you can resume a job with minimal lost progress. For large models, this means preserving model parameters, optimizer state, learning rate scheduler position, global step counters, random seeds, mixed precision scaler state, and data pipeline position.
The Scale of the Problem
At production scale, the stakes are enormous. A 70 billion parameter model with Adam optimizer generates checkpoints around 600 GB to 1.1 TB. A trillion parameter model checkpoint hits roughly 18 TB (2 bytes per parameter for weights in bfloat16, plus about 16 bytes per parameter for Adam's moment estimates in fp32). Training these models costs hundreds of thousands of dollars in GPU time, so losing even an hour of progress is financially significant.
The Fundamental Trade-off
The fundamental trade off is Recovery Point Objective (RPO) versus overhead. RPO is the maximum compute you are willing to lose, measured as the time since your last checkpoint. If you checkpoint every 30 minutes and fail, you lose at most 30 minutes of work. But checkpointing too frequently creates overhead: writing an 18 TB checkpoint at 100 GB/s aggregate throughput takes 180 seconds, during which your expensive GPUs may be underutilized.
Recovery Time Objective
Recovery Time Objective (RTO) measures how quickly you can resume after failure: reading checkpoint shards from storage, restoring states across hundreds of GPUs, reinitializing communication collectives, and warming caches. Production systems aim for RTO under 10 minutes for models up to 100B parameters.

💡 Key Takeaways

✓Complete training state includes model parameters (2 TB for 1T params in bf16), optimizer state (16 bytes per param for Adam, totaling 16 TB), schedulers, RNG seeds, and data pipeline position

✓Recovery Point Objective (RPO) of 30 minutes means losing at most 30 minutes of compute on failure; Recovery Time Objective (RTO) under 10 minutes enables fast resume on production clusters

✓Optimal checkpoint frequency follows sqrt(MTBF × write time): with 6 hour MTBF and 120 second writes, checkpoint every 30 to 40 minutes to balance overhead and lost work

✓Sharded checkpoints split an 18 TB checkpoint across hundreds of GPUs, each writing a few GB; at 100 GB/s aggregate throughput, persist in 180 seconds instead of hours on a single writer

✓Meta's Fully Sharded Data Parallel (FSDP) and NVIDIA's Megatron save world size agnostic checkpoints, allowing restore on different GPU counts (e.g., trained on 256 GPUs, resume on 128)

📌 Interview Tips

170B parameter model: 140 GB weights (bf16) + 560 GB optimizer (Adam fp32) + 20 GB scheduler/misc = 720 GB total checkpoint, written every 30 minutes on preemptible cloud GPUs

2OpenAI GPT scale training: trillion parameter model produces 18 TB checkpoints saved to object storage in parallel from 1024 GPUs, each writing 18 GB shard in under 3 minutes at 100 MB/s per node

3Google TPU pods checkpoint every 15 minutes to handle preemption rates of multiple interruptions per day; atomic manifest written last ensures partial writes are ignored on recovery

← Back to Model Checkpointing & Recovery Overview