Model Serving & Inference • Autoscaling & GPU Resource ManagementMedium⏱️ ~2 min
Cost Control: On Demand vs Spot, Scale to Zero, and Fractional Allocation
GPU compute cost dominates machine learning infrastructure budgets, making cost control strategies essential. A single NVIDIA A100 on demand instance costs approximately $3 to $4 per hour on major cloud providers, adding up to $2,200 to $3,000 per month for always on capacity. At scale with dozens or hundreds of GPUs, monthly bills reach hundreds of thousands of dollars. Effective cost optimization requires combining multiple strategies that trade off reliability, availability, and operational complexity.
The foundational decision is capacity type selection. On demand instances provide guaranteed availability and stable pricing, suitable for latency critical inference serving production traffic with strict Service Level Objectives (SLOs). Spot or preemptible instances offer 60% to 80% discounts (reducing A100 cost from $3/hour to $0.60 to $1.20/hour) but can be interrupted with 30 to 120 seconds notice when cloud providers reclaim capacity. This makes spot ideal for batch training, fine tuning jobs, and opportunistic inference overflow traffic that tolerates interruptions through checkpointing and retry logic.
Scale to zero eliminates idle costs by shutting down GPU node groups when no workloads are running. A development cluster with sporadic usage can reduce monthly costs from $15,000 (always on) to $2,000 (actual usage hours) through aggressive scale to zero policies. The trade off is cold start latency: the first request after scale to zero waits 240+ seconds for node provisioning and model loading. Production systems use hybrid approaches: scale to zero for batch and development workloads, maintain small warm pools (one to two replicas) for latency critical inference.
Fractional GPU allocation through Multi Instance GPU (MIG) or vGPU improves utilization by bin packing multiple small workloads onto shared devices. Instead of seven separate V100 instances at $2.50/hour each ($17.50/hour total), seven small models can share one A100 with MIG at $3/hour, saving 83%. The technique works best for workloads under 4GB memory with tolerance for multi tenancy. Combining these strategies produces significant savings: a production system might use on demand full GPUs for critical inference (30% of capacity), spot fractional GPUs for batch jobs (50% of capacity), and scale to zero for development (20% of capacity), reducing total costs by 40% to 60% compared to naive always on on demand allocation.
💡 Key Takeaways
•On demand A100 costs $3 to $4 per hour ($2,200 to $3,000 monthly always on) versus spot at $0.60 to $1.20 per hour with 60% to 80% discount but 30 to 120 second interruption notice
•Scale to zero reduces idle costs by 80% to 90% for sporadic workloads (development cluster drops from $15,000 to $2,000 monthly) but adds 240+ second cold start latency on first request
•Fractional MIG allocation allows seven small models on one A100 at $3/hour instead of seven V100s at $17.50/hour total, saving 83% through improved bin packing and utilization
•Spot instances require checkpointing every 5 to 10 minutes for training jobs and health aware draining for inference to handle interruptions without losing progress or failing requests
•Hybrid capacity strategy allocates 30% on demand full GPUs for Service Level Objective (SLO) critical inference, 50% spot fractional GPUs for batch tolerant workloads, and 20% scale to zero for development, reducing total cost by 40% to 60%
📌 Examples
Production deployment: standard GPU pool uses on demand V100s (max 5 nodes) for reliable inference with p95 latency SLO of 200ms; large GPU pool uses spot A100s (max 3 nodes) for batch fine tuning jobs with checkpointing every 8 minutes
Development environment scales to zero outside business hours (6pm to 8am plus weekends), reducing monthly GPU costs from $12,000 to $3,500 for team of 15 engineers doing sporadic experimentation
Cost analysis: replacing always on dedicated GPU inference ($8,000/month for 3 on demand V100s) with scale to zero plus warm pool of 1 replica and spot overflow ($2,800/month) saved 65% while maintaining p99 latency under 250ms