ML Infrastructure & MLOps • Model Packaging (Docker, ONNX, SavedModel)Medium⏱️ ~3 min
Docker Containers for Model Serving: Building Lean Inference Images
Docker containers provide immutable, reproducible execution environments for model serving, but naive packaging creates bloated artifacts that cripple deployment speed. A typical mistake is copying the training environment directly: Python development tools, Jupyter notebooks, training frameworks, CUDA development kits, and debugging utilities. These containers balloon to 2 to 4 GB, taking 45 to 90 seconds to pull and start, which kills autoscaling responsiveness during traffic spikes.
Lean inference containers strip everything except the runtime and model. Start with minimal base images like python:3.10 slim or nvidia/cuda:11.8 runtime (not devel). Use multi stage builds where the first stage handles compilation and the second stage copies only the resulting binaries. For an ONNX Runtime serving container, you need the runtime library (about 10 MB), the model artifact (50 to 500 MB), preprocessing code, and a thin API wrapper. The result is 100 to 300 MB versus 1.5 to 2 GB for a training image. At Netflix, this optimization reduced cold start from 38 seconds to 4 seconds, allowing horizontal pod autoscaling to react to load within a single minute instead of several minutes.
The container should include a warmup routine that executes representative inputs at startup. This forces just in time compilation, kernel selection, and cache population before the first real request arrives. Without warmup, first request latency can spike to 500 milliseconds to 2 seconds while subsequent requests hit 20 to 50 milliseconds. Separate preprocessing libraries from the model container when possible, using sidecars or upstream services for heavy operations like image decoding or text tokenization. This prevents Central Processing Unit (CPU) bound preprocessing from hiding Graphics Processing Unit (GPU) underutilization and lets you scale preprocessing and inference independently.
💡 Key Takeaways
•Multi stage Docker builds separate compilation from runtime: build stage includes compilers and development headers (1 to 2 GB), final stage copies only binaries and runtime libraries reducing to 100 to 300 MB total
•Base image choice matters: nvidia/cuda:11.8 runtime is 1.2 GB versus nvidia/cuda:11.8 devel at 3.8 GB, and python:3.10 slim is 120 MB versus python:3.10 at 880 MB with unnecessary tooling
•Warmup routines execute 5 to 10 representative inputs during container startup, forcing kernel compilation and cache population to reduce first request latency from 500 to 2000 milliseconds down to steady state 20 to 50 milliseconds
•Container size directly impacts autoscaling speed: 100 MB images pull and start in 3 to 5 seconds enabling pod scale out within 30 to 60 seconds, while 2 GB images take 45 to 90 seconds delaying response to traffic spikes by several minutes
•Uber's model serving platform reduced inference container sizes from 1.8 GB to 180 MB using slim bases and multi stage builds, cutting cold start time from 42 seconds to 5 seconds and improving autoscaling responsiveness by 8x
📌 Examples
Netflix recommendation serving: uses python:3.10 slim base with ONNX Runtime, model artifact, and FastAPI wrapper totaling 220 MB, achieving 4 second cold start and supporting scale from 100 to 2000 pods in under 2 minutes during evening traffic peaks
Multi stage Dockerfile pattern: FROM nvidia/cuda:11.8 devel AS builder (compile custom ops), FROM nvidia/cuda:11.8 runtime AS inference (copy binaries only), final image 340 MB versus 2.1 GB single stage