Checkpoint Frequency: Balancing Cost, Overhead, and Reliability
The Core Trade-off
Choosing checkpoint frequency is a classic trade off between lost work on failure and the overhead of checkpointing itself. Frequent checkpoints (every 10 to 30 minutes) minimize RPO, meaning you lose at most 30 minutes of GPU compute if something crashes. But each checkpoint imposes a cost: the snapshot stall (10 to 30 seconds of idle GPUs), bandwidth consumption, storage I/O load, and the accumulated storage footprint.
Optimal Interval Formula
A practical formula from checkpointing theory gives the optimal interval as approximately sqrt(2 times Cs times M), where Cs is checkpoint write time and M is Mean Time Between Failures (MTBF). For example, if a checkpoint takes 120 seconds to write and your cluster MTBF is 6 hours, the formula suggests an interval around 38 minutes. This balances the expected recompute cost against checkpointing overhead.
Real World Adjustments
On preemptible cloud instances (spot VMs or TPU preemptible nodes), MTBF can be measured in hours rather than days, pushing teams toward 15 to 30 minute checkpoint intervals despite higher overhead. Google's TPU training on preemptible pods checkpoints every 15 minutes because preemptions occur multiple times per day. Conversely, on stable on premises clusters with MTBF measured in days, checkpointing once per hour is common for smaller models.
Retention Policies
Keeping the last K checkpoints (typically K equals 3 to 5) provides rollback options if training becomes unstable. Additionally, teams keep a "best" checkpoint selected by validation metric for deployment. For an 18 TB checkpoint with K equals 5, you need 90 TB of storage, making tiered storage strategies essential.