Training Infrastructure & PipelinesModel Checkpointing & RecoveryMedium⏱️ ~2 min

Checkpoint Frequency: Balancing Cost, Overhead, and Reliability

The Core Trade-off

Choosing checkpoint frequency is a classic trade off between lost work on failure and the overhead of checkpointing itself. Frequent checkpoints (every 10 to 30 minutes) minimize RPO, meaning you lose at most 30 minutes of GPU compute if something crashes. But each checkpoint imposes a cost: the snapshot stall (10 to 30 seconds of idle GPUs), bandwidth consumption, storage I/O load, and the accumulated storage footprint.

Optimal Interval Formula

A practical formula from checkpointing theory gives the optimal interval as approximately sqrt(2 times Cs times M), where Cs is checkpoint write time and M is Mean Time Between Failures (MTBF). For example, if a checkpoint takes 120 seconds to write and your cluster MTBF is 6 hours, the formula suggests an interval around 38 minutes. This balances the expected recompute cost against checkpointing overhead.

Real World Adjustments

On preemptible cloud instances (spot VMs or TPU preemptible nodes), MTBF can be measured in hours rather than days, pushing teams toward 15 to 30 minute checkpoint intervals despite higher overhead. Google's TPU training on preemptible pods checkpoints every 15 minutes because preemptions occur multiple times per day. Conversely, on stable on premises clusters with MTBF measured in days, checkpointing once per hour is common for smaller models.

Retention Policies

Keeping the last K checkpoints (typically K equals 3 to 5) provides rollback options if training becomes unstable. Additionally, teams keep a "best" checkpoint selected by validation metric for deployment. For an 18 TB checkpoint with K equals 5, you need 90 TB of storage, making tiered storage strategies essential.

💡 Key Takeaways
Optimal checkpoint interval formula: sqrt(2 × write time × MTBF); for 120s write and 6 hour MTBF, checkpoint every 38 minutes to minimize combined cost of overhead and lost work
Preemptible infrastructure (spot instances, TPU preemptible) with daily interruptions requires 15 to 30 minute intervals despite higher overhead; stable clusters with multi day MTBF can checkpoint hourly
Storage retention: keep last 3 to 5 checkpoints for rollback plus one best checkpoint by validation metric; 18 TB per checkpoint × 5 versions = 90 TB storage, costing $2k to $20k/month depending on tier
Checkpoint overhead includes snapshot stall (10 to 30s per checkpoint), bandwidth (100 GB/s sustained writes), and the risk of overlapping async writes exhausting memory or I/O capacity
For unstable experimental runs, increase frequency temporarily (every 10 to 15 minutes) to enable fast bisection of regressions, then prune extra checkpoints once training stabilizes
📌 Interview Tips
1OpenAI GPT-3 scale training: 30 minute checkpoint interval on 1024 A100 GPUs with 18 TB checkpoints; keeps last 5 plus best, totaling 108 TB in S3 at ~$2500/month storage cost
2Meta 175B OPT model: 60 minute checkpoints on stable cluster (3 day MTBF), 1.4 TB per checkpoint written in 90 seconds; retention of last 4 plus best = 7 TB total on parallel filesystem
3Google preemptible TPU pod training: 15 minute intervals, checkpoint takes 180 seconds to write to GCS; experiences 2 to 4 preemptions per day, RPO bounded to 15 min loses ~1% of daily compute to recompute
← Back to Model Checkpointing & Recovery Overview