ML Infrastructure & MLOpsResource Orchestration (Kubernetes, GPU Scheduling)Medium⏱️ ~3 min

Production ML Inference on Kubernetes with Autoscaling and Model Locality

Production inference platforms must balance latency Service Level Objectives (SLOs), cost efficiency, and operational simplicity. A typical architecture runs Triton Inference Server or similar model servers on Kubernetes with GPU aware scheduling, autoscaling based on real signals, and aggressive model locality strategies. Consider a vision and language model serving platform with 100 GPU nodes, 8 A100s per node, totaling 800 devices. With Multi Instance GPU (MIG) enabled in 1g.5gb profile, logical capacity reaches 5,600 isolated instances. Each instance serves a small vision model at 100 to 300 images per second with p99 latency of 50 to 80 milliseconds at batch size 8. Autoscaling watches GPU duty cycle (not utilization, which can be misleading with MIG) and request queue depth. When p95 latency exceeds 75 percent of the SLO budget for three consecutive minutes, or queue depth indicates more than 100 milliseconds of waiting time, the system adds replicas. New pods spin up in under 20 seconds because model weights are cached on node local NVMe. Model locality is critical for cold start performance. Loading a 20 GB model from remote object storage at 1 Gbps takes approximately 160 seconds, violating SLOs during scale up. Solutions include maintaining warm pools of pods with pre loaded weights for the top 5 to 10 models, caching 5 to 20 GB of model data per node on NVMe, and using image layers with model weights baked in for faster pulls. Netflix uses similar patterns for GPU accelerated media processing, keeping frequently used models warm to avoid cold starts during traffic spikes. Cost control requires preventing non GPU workloads from landing on expensive GPU nodes. Node taints, tolerations, and separate node pools enforce this boundary. Spot instances for batch inference can reduce costs by 60 to 80 percent compared to on demand, but require checkpointing and graceful degradation when instances are preempted.
💡 Key Takeaways
Autoscale on request queue depth and SLO compliance, not raw GPU utilization. With MIG, utilization can appear low even when instances are saturated. Scale when p95 latency exceeds 75 percent of budget or queue depth indicates more than 100 milliseconds wait time.
Model locality reduces cold start from 160 seconds to under 20 seconds for a 20 GB model. Cache hot models on node local NVMe (5 to 20 GB per node) and maintain warm pools of pre loaded pods for top 5 to 10 models by traffic volume.
MIG slices provide isolation for multi tenant serving. A single A100 in 1g.5gb profile yields seven instances, each serving 100 to 300 images per second at p99 latency of 50 to 80 milliseconds. Total per GPU throughput reaches 700 to 2,100 images per second.
Cost control requires strict node pool separation. Use taints and tolerations to prevent non GPU pods from landing on expensive GPU nodes. Spot instances for batch workloads reduce costs by 60 to 80 percent but require graceful handling of preemption.
Monitoring must be instance aware with MIG. Aggregate GPU metrics mislead autoscaling. Track per slice duty cycle, memory usage, kernel execution time, and queue depth. Alert when any slice exceeds 85 percent duty cycle for sustained periods.
Replica scale up time under 1 minute is achievable with warm pools and local caching. Without these, cold starts from remote storage can take 2 to 5 minutes, causing cascading SLO violations during traffic spikes.
📌 Examples
A computer vision API serves 20 models on 100 nodes with 5,600 MIG slices. Top 3 models by volume account for 70 percent of traffic and are pre loaded in warm pools. During a traffic spike, 50 new replicas spin up in 18 seconds average, compared to 140 seconds without caching.
An NLP inference service autoscales from 200 to 600 pods in 4 minutes when traffic triples. Queue depth based scaling triggers faster than latency based, adding capacity before SLOs are violated. p99 latency stays below 90 milliseconds throughout the event.
Netflix runs GPU accelerated media processing on Kubernetes with node local caching for frequently used models. This reduces cold start latency by 85 percent compared to pulling from S3, enabling aggressive autoscaling during peak viewing hours without SLO violations.
← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview
Production ML Inference on Kubernetes with Autoscaling and Model Locality | Resource Orchestration (Kubernetes, GPU Scheduling) - System Overflow