Training Infrastructure & Pipelines • Model Checkpointing & RecoveryMedium⏱️ ~2 min
Checkpoint Storage Strategy: Retention, Tiering, and Cost Optimization
Managing checkpoint storage at scale requires a deliberate retention and tiering strategy to balance cost, recovery speed, and rollback flexibility. A naive approach that keeps every checkpoint indefinitely quickly becomes prohibitively expensive: a single 18 TB checkpoint written every 30 minutes over a 7 day training run generates 336 checkpoints totaling over 6 petabytes. Even on cheap object storage at $0.023 per GB per month, that is $140,000 in storage costs for one run. Production systems use retention policies that keep only the last K checkpoints (typically 3 to 5) plus a designated "best" checkpoint selected by validation metric, reducing storage to a manageable 5× checkpoint size.
The last K policy provides a sliding window of rollback points. If training becomes unstable or a code bug corrupts the model at step 1000, you can revert to the checkpoint from step 970 or 940. K equals 3 to 5 is common because it covers the last few hours of training (with 30 minute checkpoints, 5× checkpoint window is 2.5 hours), enough to recover from transient issues like a bad batch or hyperparameter spike without losing too much progress. The "best" checkpoint is retained indefinitely and updated whenever validation loss (or accuracy, F1, perplexity, etc.) improves, providing a deployment ready artifact even if later training diverges.
Storage tiering further reduces costs by moving older checkpoints to cheaper, slower storage classes. High frequency checkpoints (every 15 to 30 minutes) are written to premium storage for fast Recovery Time Objective (RTO) under 10 minutes. After 24 to 48 hours, checkpoints outside the last K window are demoted to a mid tier (e.g., S3 Infrequent Access or GCS Nearline) with slightly higher latency and lower cost. After a week, they move to cold storage (S3 Glacier, GCS Archive) or are deleted entirely. This tiered approach is common at Meta and Google, where most recoveries use the most recent checkpoint but occasionally an engineer needs to inspect a week old checkpoint for debugging, justifying a small cold storage footprint.
Cross region replication of the latest and best checkpoints provides disaster recovery in case an entire data center or availability zone fails. Replicating every checkpoint is expensive due to egress bandwidth costs (inter region transfer fees can be $0.02 to $0.09 per GB), so teams replicate only critical checkpoints asynchronously. For a 600 GB checkpoint, cross region replication costs $12 to $54 per checkpoint; doing this every 30 minutes is prohibitive, but replicating the daily best checkpoint is reasonable. Some teams further optimize by replicating only model weights (a few hundred GB) rather than full state (which includes multi TB optimizer buffers), accepting that cross region recovery will require restarting optimizer state but preserving model progress.
💡 Key Takeaways
•Retention policy: keep last 3 to 5 checkpoints plus one best by validation metric; for 18 TB checkpoints, this caps storage at 90 TB vs 6 PB if keeping all checkpoints over a week long run
•Storage tiering: recent checkpoints (last 48 hours) on premium fast storage for RTO under 10 minutes; older checkpoints demoted to infrequent access tier (30% cheaper, RTO 30 to 60 minutes) or cold archive after one week
•Cross region replication cost: $0.02 to $0.09 per GB egress; replicating 600 GB checkpoint every 30 minutes costs $400 to $1800 per day; optimize by replicating only daily best or weights only snapshots
•Best checkpoint selection: continuously updated based on validation perplexity, F1, or accuracy; serves as deployment artifact and safeguard if training diverges in later stages
•Garbage collection: async cleanup deletes checkpoints outside retention window; must be fault tolerant (never delete latest or best) and coordinated across distributed writers to avoid races
📌 Examples
OpenAI GPT training: keeps last 5 checkpoints (90 TB for 18 TB each) on S3 Standard, best checkpoint replicated to second region; garbage collection runs hourly, storage cost ~$2000/month for active run
Meta OPT 175B: 1.4 TB checkpoints, last 4 plus best on Lustre parallel filesystem (7 TB total); after training completes, best checkpoint moved to S3 and Lustre snapshots deleted, reclaiming $15k/month capacity
Google T5 on TPU: checkpoints every 15 minutes to GCS Standard for 48 hours, then auto tiered to Nearline after 2 days; best checkpoint kept in Standard class, others archived to Coldline after 7 days, reducing cost by 60%