ML Infrastructure & MLOpsCost Optimization (Spot Instances, Autoscaling)Medium⏱️ ~2 min

Checkpointing and Fault Tolerance for Interruptible Workloads

Spot Instances can disappear with 2 minutes notice. Without checkpointing, a 12 hour training job interrupted at hour 11 loses everything. Checkpointing saves progress to durable storage periodically so work resumes from the last checkpoint instead of restarting from scratch. The checkpoint interval balances wasted compute against storage and latency overhead. For ML training, checkpoint every 5 to 10 minutes. Each checkpoint writes model weights, optimizer state, and current epoch to object storage like Amazon S3 or Google Cloud Storage. If an interruption hits, the orchestrator reschedules the job on any available worker. The new worker loads the latest checkpoint and continues. With 5 minute checkpoints and a 3 percent interruption rate, you waste on average 7.5 seconds per interruption (half the checkpoint interval times interruption probability). On a 24 hour job, that is roughly 130 seconds lost across expected interruptions, negligible compared to hours saved by using Spot. Batch data pipelines use similar patterns but checkpoint at task granularity. A pipeline with 10,000 tasks writes completion markers to a coordination service like Apache ZooKeeper or a database. When a worker disappears mid task, the orchestrator marks that task as failed and reschedules it. Idempotent task design ensures that rerunning a partially completed task produces correct output, even if some side effects like intermediate file writes happened twice. Netflix and Uber run large Extract, Transform, Load (ETL) and encoding pipelines this way, with orchestrators automatically handling thousands of task reschedules per day without manual intervention.
💡 Key Takeaways
Checkpoint every 5 to 10 minutes for ML training, saving model state to durable object storage like S3, so interruptions waste only minutes instead of hours
With 5 minute checkpoints and 3 percent interruption rate, expected waste is about 7.5 seconds per interruption, roughly 130 seconds over a 24 hour training run
Batch pipelines checkpoint at task granularity, writing completion markers so orchestrators can reschedule individual failed tasks without rerunning entire stages
Idempotent task design ensures reruns produce correct output even if partial work happened, critical for exactly once semantics in data processing
Netflix and Uber handle thousands of automatic task reschedules daily in ETL pipelines, with orchestrators transparently recovering from Spot interruptions
Tradeoff checkpoint frequency against storage cost and I/O latency, with more frequent checkpoints reducing wasted compute but increasing storage operations
📌 Examples
ML training: Save model checkpoint every 5 minutes to S3. Interruption at minute 47 resumes from minute 45 checkpoint, losing only 2 minutes of progress
Data pipeline: 10,000 map tasks each process 1 gigabyte (GB) files. Mark task complete in DynamoDB after output write. Interrupted tasks automatically reschedule, with idempotent processing ensuring correct results
Feature computation: Workers checkpoint progress every 1,000 records to Redis. On failure, new worker reads checkpoint offset and continues from last completed batch
← Back to Cost Optimization (Spot Instances, Autoscaling) Overview
Checkpointing and Fault Tolerance for Interruptible Workloads | Cost Optimization (Spot Instances, Autoscaling) - System Overflow