Checkpoint Storage Strategy: Retention, Tiering, and Cost Optimization
The Storage Scale Problem
Managing checkpoint storage at scale requires a deliberate retention and tiering strategy. A naive approach that keeps every checkpoint indefinitely quickly becomes prohibitively expensive: a single 18 TB checkpoint written every 30 minutes over a 7 day training run generates 336 checkpoints totaling over 6 petabytes. Even on cheap object storage, that is $140,000 in storage costs for one run.
Last K Retention Policy
Production systems use retention policies that keep only the last K checkpoints (typically 3 to 5) plus a designated "best" checkpoint selected by validation metric. K equals 3 to 5 covers the last few hours of training (with 30 minute checkpoints, 5x checkpoint window is 2.5 hours), enough to recover from transient issues without losing too much progress. The "best" checkpoint is retained indefinitely and updated whenever validation metrics improve.
Storage Tiering
Storage tiering further reduces costs by moving older checkpoints to cheaper, slower storage classes. High frequency checkpoints are written to premium storage for fast RTO under 10 minutes. After 24 to 48 hours, checkpoints outside the last K window are demoted to mid tier storage. After a week, they move to cold storage or are deleted entirely.
Cross Region Replication
Cross region replication of the latest and best checkpoints provides disaster recovery. Replicating every checkpoint is expensive due to egress bandwidth costs, so teams replicate only critical checkpoints asynchronously. Some teams optimize by replicating only model weights (a few hundred GB) rather than full state (which includes multi TB optimizer buffers), accepting that cross region recovery will require restarting optimizer state but preserving model progress.