Training Infrastructure & PipelinesModel Checkpointing & RecoveryMedium⏱️ ~2 min

Checkpoint Frequency: Balancing Cost, Overhead, and Reliability

Choosing checkpoint frequency is a classic trade off between lost work on failure and the overhead of checkpointing itself. Frequent checkpoints (every 10 to 30 minutes) minimize Recovery Point Objective (RPO), meaning you lose at most 30 minutes of GPU compute if something crashes. But each checkpoint imposes a cost: the snapshot stall (10 to 30 seconds of idle GPUs), bandwidth consumption, storage I/O load, and the accumulated storage footprint of keeping multiple versions. Infrequent checkpoints (every few hours or once per epoch) reduce overhead and storage churn but risk losing hours of progress on multi thousand dollar GPU clusters. A practical formula from checkpointing theory gives the optimal interval as approximately sqrt(2 × Cs × M), where Cs is checkpoint write time and M is Mean Time Between Failures (MTBF). For example, if a checkpoint takes 120 seconds to write and your cluster MTBF is 6 hours (21600 seconds), the formula suggests an interval around sqrt(2 × 120 × 21600) = sqrt(5184000) ≈ 2277 seconds or about 38 minutes. This balances the expected recompute cost (time lost between last checkpoint and failure) against checkpointing overhead. In practice, teams round this to 30 or 60 minute intervals and adjust based on observed failure rates and storage costs. Real world considerations often override the formula. On preemptible cloud instances (spot VMs or TPU preemptible nodes), MTBF can be measured in hours rather than days, pushing teams toward 15 to 30 minute checkpoint intervals despite higher overhead. Google's TPU training on preemptible pods checkpoints every 15 minutes because preemptions occur multiple times per day. Conversely, on stable on premises clusters with MTBF measured in days, checkpointing once per hour or even per epoch is common for smaller models where checkpoint writes complete in under a minute. Storage retention policies also factor in. Keeping the last K checkpoints (typically K equals 3 to 5) provides rollback options if a bad update corrupts the model or causes training instability. Additionally, teams keep a "best" checkpoint selected by validation metric (lowest perplexity, highest F1) for deployment. For an 18 TB checkpoint with K equals 5, you need 90 TB of storage, which on premium NVMe backed filesystems can cost tens of thousands of dollars per month. Object storage reduces this to hundreds or low thousands, but you must design for eventual consistency and higher read latency on recovery.
💡 Key Takeaways
Optimal checkpoint interval formula: sqrt(2 × write time × MTBF); for 120s write and 6 hour MTBF, checkpoint every 38 minutes to minimize combined cost of overhead and lost work
Preemptible infrastructure (spot instances, TPU preemptible) with daily interruptions requires 15 to 30 minute intervals despite higher overhead; stable clusters with multi day MTBF can checkpoint hourly
Storage retention: keep last 3 to 5 checkpoints for rollback plus one best checkpoint by validation metric; 18 TB per checkpoint × 5 versions = 90 TB storage, costing $2k to $20k/month depending on tier
Checkpoint overhead includes snapshot stall (10 to 30s per checkpoint), bandwidth (100 GB/s sustained writes), and the risk of overlapping async writes exhausting memory or I/O capacity
For unstable experimental runs, increase frequency temporarily (every 10 to 15 minutes) to enable fast bisection of regressions, then prune extra checkpoints once training stabilizes
📌 Examples
OpenAI GPT-3 scale training: 30 minute checkpoint interval on 1024 A100 GPUs with 18 TB checkpoints; keeps last 5 plus best, totaling 108 TB in S3 at ~$2500/month storage cost
Meta 175B OPT model: 60 minute checkpoints on stable cluster (3 day MTBF), 1.4 TB per checkpoint written in 90 seconds; retention of last 4 plus best = 7 TB total on parallel filesystem
Google preemptible TPU pod training: 15 minute intervals, checkpoint takes 180 seconds to write to GCS; experiences 2 to 4 preemptions per day, RPO bounded to 15 min loses ~1% of daily compute to recompute
← Back to Model Checkpointing & Recovery Overview