Learn→Training Infrastructure & Pipelines→Model Checkpointing & Recovery→6 of 6

Training Infrastructure & Pipelines • Model Checkpointing & RecoveryMedium⏱️ ~2 min

Checkpoint Storage Strategy: Retention, Tiering, and Cost Optimization

The Storage Scale Problem
Managing checkpoint storage at scale requires a deliberate retention and tiering strategy. A naive approach that keeps every checkpoint indefinitely quickly becomes prohibitively expensive: a single 18 TB checkpoint written every 30 minutes over a 7 day training run generates 336 checkpoints totaling over 6 petabytes. Even on cheap object storage, that is $140,000 in storage costs for one run.
Last K Retention Policy
Production systems use retention policies that keep only the last K checkpoints (typically 3 to 5) plus a designated "best" checkpoint selected by validation metric. K equals 3 to 5 covers the last few hours of training (with 30 minute checkpoints, 5x checkpoint window is 2.5 hours), enough to recover from transient issues without losing too much progress. The "best" checkpoint is retained indefinitely and updated whenever validation metrics improve.
Storage Tiering
Storage tiering further reduces costs by moving older checkpoints to cheaper, slower storage classes. High frequency checkpoints are written to premium storage for fast RTO under 10 minutes. After 24 to 48 hours, checkpoints outside the last K window are demoted to mid tier storage. After a week, they move to cold storage or are deleted entirely.
Cross Region Replication
Cross region replication of the latest and best checkpoints provides disaster recovery. Replicating every checkpoint is expensive due to egress bandwidth costs, so teams replicate only critical checkpoints asynchronously. Some teams optimize by replicating only model weights (a few hundred GB) rather than full state (which includes multi TB optimizer buffers), accepting that cross region recovery will require restarting optimizer state but preserving model progress.

💡 Key Takeaways

✓Retention policy: keep last 3 to 5 checkpoints plus one best by validation metric; for 18 TB checkpoints, this caps storage at 90 TB vs 6 PB if keeping all checkpoints over a week long run

✓Storage tiering: recent checkpoints (last 48 hours) on premium fast storage for RTO under 10 minutes; older checkpoints demoted to infrequent access tier (30% cheaper, RTO 30 to 60 minutes) or cold archive after one week

✓Cross region replication cost: $0.02 to $0.09 per GB egress; replicating 600 GB checkpoint every 30 minutes costs $400 to $1800 per day; optimize by replicating only daily best or weights only snapshots

✓Best checkpoint selection: continuously updated based on validation perplexity, F1, or accuracy; serves as deployment artifact and safeguard if training diverges in later stages

✓Garbage collection: async cleanup deletes checkpoints outside retention window; must be fault tolerant (never delete latest or best) and coordinated across distributed writers to avoid races

📌 Interview Tips

1OpenAI GPT training: keeps last 5 checkpoints (90 TB for 18 TB each) on S3 Standard, best checkpoint replicated to second region; garbage collection runs hourly, storage cost ~$2000/month for active run

2Meta OPT 175B: 1.4 TB checkpoints, last 4 plus best on Lustre parallel filesystem (7 TB total); after training completes, best checkpoint moved to S3 and Lustre snapshots deleted, reclaiming $15k/month capacity

3Google T5 on TPU: checkpoints every 15 minutes to GCS Standard for 48 hours, then auto tiered to Nearline after 2 days; best checkpoint kept in Standard class, others archived to Coldline after 7 days, reducing cost by 60%

← Back to Model Checkpointing & Recovery Overview