Training Infrastructure & Pipelines • Model Checkpointing & RecoveryEasy⏱️ ~2 min
What Is Model Checkpointing and Why It Matters at Scale
Model checkpointing is the periodic capture of complete training state so you can resume a job with minimal lost progress. Think of it as hitting save in a video game. For a large language model, this means preserving model parameters, optimizer state (momentum and variance buffers for Adam), learning rate scheduler position, global step counters, random number generator seeds, mixed precision scaler state, and even the exact position in your data pipeline. Without checkpoints, a hardware failure on day 3 of a 7 day training run means starting over from scratch.
At production scale, the stakes are enormous. A 70 billion parameter model with Adam optimizer generates checkpoints around 600 GB to 1.1 TB. A trillion parameter model checkpoint hits roughly 18 TB (2 bytes per parameter for weights in bfloat16, plus about 16 bytes per parameter for Adam's first and second moment estimates in fp32). Training these models costs hundreds of thousands of dollars in GPU time, so losing even an hour of progress is financially significant. Google, Meta, and OpenAI all design their training infrastructure around the assumption that failures will occur multiple times per week on large clusters.
The fundamental trade off is Recovery Point Objective (RPO) versus overhead. RPO is the maximum compute you are willing to lose, measured as the time since your last checkpoint. If you checkpoint every 30 minutes and fail, you lose at most 30 minutes of work. But checkpointing too frequently creates overhead: writing an 18 TB checkpoint at 100 GB/s aggregate throughput takes 180 seconds, during which your expensive GPUs may be underutilized. The optimal checkpoint interval approximately follows sqrt(MTBF × checkpoint write time), where Mean Time Between Failures (MTBF) is how often your cluster experiences interruptions.
Recovery Time Objective (RTO) measures how quickly you can resume after failure: reading checkpoint shards from storage, restoring states across hundreds of GPUs, reinitializing communication collectives, and warming caches. Production systems aim for RTO under 10 minutes for models up to 100B parameters. Modern approaches use sharded checkpoints where each GPU writes only its partition, cutting per node I/O from terabytes to gigabytes and enabling parallel writes that saturate high throughput storage systems.
💡 Key Takeaways
•Complete training state includes model parameters (2 TB for 1T params in bf16), optimizer state (16 bytes per param for Adam, totaling 16 TB), schedulers, RNG seeds, and data pipeline position
•Recovery Point Objective (RPO) of 30 minutes means losing at most 30 minutes of compute on failure; Recovery Time Objective (RTO) under 10 minutes enables fast resume on production clusters
•Optimal checkpoint frequency follows sqrt(MTBF × write time): with 6 hour MTBF and 120 second writes, checkpoint every 30 to 40 minutes to balance overhead and lost work
•Sharded checkpoints split an 18 TB checkpoint across hundreds of GPUs, each writing a few GB; at 100 GB/s aggregate throughput, persist in 180 seconds instead of hours on a single writer
•Meta's Fully Sharded Data Parallel (FSDP) and NVIDIA's Megatron save world size agnostic checkpoints, allowing restore on different GPU counts (e.g., trained on 256 GPUs, resume on 128)
📌 Examples
70B parameter model: 140 GB weights (bf16) + 560 GB optimizer (Adam fp32) + 20 GB scheduler/misc = 720 GB total checkpoint, written every 30 minutes on preemptible cloud GPUs
OpenAI GPT scale training: trillion parameter model produces 18 TB checkpoints saved to object storage in parallel from 1024 GPUs, each writing 18 GB shard in under 3 minutes at 100 MB/s per node
Google TPU pods checkpoint every 15 minutes to handle preemption rates of multiple interruptions per day; atomic manifest written last ensures partial writes are ignored on recovery