What Is Model Checkpointing and Why It Matters at Scale
The Scale of the Problem
At production scale, the stakes are enormous. A 70 billion parameter model with Adam optimizer generates checkpoints around 600 GB to 1.1 TB. A trillion parameter model checkpoint hits roughly 18 TB (2 bytes per parameter for weights in bfloat16, plus about 16 bytes per parameter for Adam's moment estimates in fp32). Training these models costs hundreds of thousands of dollars in GPU time, so losing even an hour of progress is financially significant.
The Fundamental Trade-off
The fundamental trade off is Recovery Point Objective (RPO) versus overhead. RPO is the maximum compute you are willing to lose, measured as the time since your last checkpoint. If you checkpoint every 30 minutes and fail, you lose at most 30 minutes of work. But checkpointing too frequently creates overhead: writing an 18 TB checkpoint at 100 GB/s aggregate throughput takes 180 seconds, during which your expensive GPUs may be underutilized.
Recovery Time Objective
Recovery Time Objective (RTO) measures how quickly you can resume after failure: reading checkpoint shards from storage, restoring states across hundreds of GPUs, reinitializing communication collectives, and warming caches. Production systems aim for RTO under 10 minutes for models up to 100B parameters.