Training Infrastructure & PipelinesModel Checkpointing & RecoveryMedium⏱️ ~2 min

Checkpoint Storage Strategy: Retention, Tiering, and Cost Optimization

The Storage Scale Problem

Managing checkpoint storage at scale requires a deliberate retention and tiering strategy. A naive approach that keeps every checkpoint indefinitely quickly becomes prohibitively expensive: a single 18 TB checkpoint written every 30 minutes over a 7 day training run generates 336 checkpoints totaling over 6 petabytes. Even on cheap object storage, that is $140,000 in storage costs for one run.

Last K Retention Policy

Production systems use retention policies that keep only the last K checkpoints (typically 3 to 5) plus a designated "best" checkpoint selected by validation metric. K equals 3 to 5 covers the last few hours of training (with 30 minute checkpoints, 5x checkpoint window is 2.5 hours), enough to recover from transient issues without losing too much progress. The "best" checkpoint is retained indefinitely and updated whenever validation metrics improve.

Storage Tiering

Storage tiering further reduces costs by moving older checkpoints to cheaper, slower storage classes. High frequency checkpoints are written to premium storage for fast RTO under 10 minutes. After 24 to 48 hours, checkpoints outside the last K window are demoted to mid tier storage. After a week, they move to cold storage or are deleted entirely.

Cross Region Replication

Cross region replication of the latest and best checkpoints provides disaster recovery. Replicating every checkpoint is expensive due to egress bandwidth costs, so teams replicate only critical checkpoints asynchronously. Some teams optimize by replicating only model weights (a few hundred GB) rather than full state (which includes multi TB optimizer buffers), accepting that cross region recovery will require restarting optimizer state but preserving model progress.

💡 Key Takeaways
Retention policy: keep last 3 to 5 checkpoints plus one best by validation metric; for 18 TB checkpoints, this caps storage at 90 TB vs 6 PB if keeping all checkpoints over a week long run
Storage tiering: recent checkpoints (last 48 hours) on premium fast storage for RTO under 10 minutes; older checkpoints demoted to infrequent access tier (30% cheaper, RTO 30 to 60 minutes) or cold archive after one week
Cross region replication cost: $0.02 to $0.09 per GB egress; replicating 600 GB checkpoint every 30 minutes costs $400 to $1800 per day; optimize by replicating only daily best or weights only snapshots
Best checkpoint selection: continuously updated based on validation perplexity, F1, or accuracy; serves as deployment artifact and safeguard if training diverges in later stages
Garbage collection: async cleanup deletes checkpoints outside retention window; must be fault tolerant (never delete latest or best) and coordinated across distributed writers to avoid races
📌 Interview Tips
1OpenAI GPT training: keeps last 5 checkpoints (90 TB for 18 TB each) on S3 Standard, best checkpoint replicated to second region; garbage collection runs hourly, storage cost ~$2000/month for active run
2Meta OPT 175B: 1.4 TB checkpoints, last 4 plus best on Lustre parallel filesystem (7 TB total); after training completes, best checkpoint moved to S3 and Lustre snapshots deleted, reclaiming $15k/month capacity
3Google T5 on TPU: checkpoints every 15 minutes to GCS Standard for 48 hours, then auto tiered to Nearline after 2 days; best checkpoint kept in Standard class, others archived to Coldline after 7 days, reducing cost by 60%
← Back to Model Checkpointing & Recovery Overview
Checkpoint Storage Strategy: Retention, Tiering, and Cost Optimization | Model Checkpointing & Recovery - System Overflow