Learn→ML Infrastructure & MLOps→Resource Orchestration (Kubernetes, GPU Scheduling)→4 of 6

ML Infrastructure & MLOps • Resource Orchestration (Kubernetes, GPU Scheduling)Hard⏱️ ~3 min

Production ML Inference on Kubernetes with Autoscaling and Model Locality

ML Inference Orchestration: Running model serving at scale on container orchestrators like Kubernetes. Unlike training (batch, can wait), inference is latency-sensitive and must handle variable traffic. Autoscaling and model loading strategies are critical for meeting SLAs while controlling costs.
Autoscaling Challenges
Standard CPU autoscaling uses metrics like CPU utilization. GPU autoscaling is harder: GPU utilization is bursty (0% between batches, 100% during), memory utilization is static (model loaded once), and request latency is the real signal but hard to attribute to resource shortage vs model complexity. Effective GPU autoscaling uses: request queue depth (how many requests are waiting), inference latency percentiles (p99 above SLA triggers scale-up), and GPU memory pressure (approaching limits).
Cold Start Problem
Scaling up GPU inference is slow. A new pod must: pull the container image (seconds to minutes for large images), allocate GPU (seconds), load model weights into GPU memory (10-60 seconds for large models), and warm up (first few inferences are slower). Total cold start: 30 seconds to 5 minutes. During this time, existing pods handle increased load. Mitigations: keep warm standby replicas, use smaller model images, preload models into shared storage, or use serverless inference platforms with pre-warmed pools.
Model Locality
Large models (multi-GB) are expensive to transfer repeatedly. Model locality strategies: Node affinity: Schedule model pods on nodes that already have the model cached on local disk. Shared storage: Mount model files from network storage (reduces image size but adds load time). Model distribution: Pre-distribute models to nodes before scheduling pods. The right strategy depends on model size (small models: bake into image; large models: shared storage with aggressive caching).
SLA Planning: If p99 latency target is 100ms and cold start is 60 seconds, you need enough warm capacity to handle traffic spikes for 60 seconds before new pods are ready.

💡 Key Takeaways

✓GPU utilization is bursty: use queue depth and p99 latency for autoscaling

✓Cold start takes 30 seconds to 5 minutes including model loading

✓Model locality: bake small models into image, use shared storage for large models

📌 Interview Tips

1Large model cold start: image pull + GPU alloc + model load = 30s to 5min

2Keep warm standby replicas to handle traffic while new pods start

← Back to Resource Orchestration (Kubernetes, GPU Scheduling) Overview