ML Infrastructure & MLOps • Model Packaging (Docker, ONNX, SavedModel)Medium⏱️ ~3 min
Building Lean Inference Containers: Multi Stage Builds and Optimization Patterns
Building production ready inference containers requires deliberate optimization to minimize size and startup time. The multi stage build pattern separates compilation from runtime. The builder stage includes development tools like compilers, headers, and build systems (1 to 2 GB), while the final runtime stage copies only the compiled binaries, shared libraries, and the model artifact. For an Open Neural Network Exchange (ONNX) Runtime serving container, the builder installs cmake, g++, and development headers to compile custom operators, then the runtime stage starts from a minimal base like ubuntu:22.04 (77 MB) or python:3.10 slim (120 MB) and copies just the ONNX Runtime library (approximately 10 MB) and the model.
Base image selection dramatically affects size. The nvidia/cuda:11.8 runtime image is 1.2 GB and includes only CUDA runtime libraries, while nvidia/cuda:11.8 devel is 3.8 GB with compilation tools you do not need for inference. Similarly, python:3.10 at 880 MB includes package managers and development tools, while python:3.10 slim at 120 MB strips these. Google's distroless images go further, removing even shell and package managers for a 40 to 60 MB Python base, though this complicates debugging.
Dependency management matters. Use pip install with the no cache dir flag to avoid caching 200 to 400 MB of wheel files. Install only runtime dependencies: for ONNX Runtime inference, you need onnxruntime gpu (approximately 25 MB) but not onnx (15 MB) or the training framework. Separate preprocessing libraries into sidecars when they pull heavy dependencies: a container that needs both Pillow for images and spaCy for text balloons to 600 MB versus two specialized containers at 180 MB and 220 MB that can scale independently. At Netflix, splitting preprocessing into a separate service reduced inference container size from 520 MB to 180 MB and allowed scaling image decoding (Central Processing Unit or CPU bound) separately from model inference (Graphics Processing Unit or GPU bound), cutting serving costs by 30%.
💡 Key Takeaways
•Multi stage Docker builds compile in a devel image (2 to 4 GB with gcc, cmake, headers) then copy binaries to a runtime image (1 to 1.5 GB base), reducing final artifact by 50 to 70% and eliminating attack surface from build tools
•Base image choice creates 3x to 5x size difference: python:3.10 slim (120 MB) versus python:3.10 (880 MB), nvidia/cuda:11.8 runtime (1.2 GB) versus devel (3.8 GB), or distroless Python (40 to 60 MB) for maximum minimalism
•Dependency hygiene with pip install no cache dir and installing only runtime packages: onnxruntime gpu (25 MB) without onnx (15 MB) or training frameworks, avoiding 200 to 400 MB of cached wheels
•Preprocessing separation into sidecars or upstream services: splitting Pillow (image decode) and model inference into separate containers reduces each from 600 MB monolith to 180 MB and 220 MB services that scale independently based on CPU versus GPU bottlenecks
•Layer caching optimization: order Dockerfile commands with least frequently changing first (base image, system packages) and most frequently changing last (model artifact, application code) to reuse cached layers and speed continuous integration builds from 8 minutes to 2 minutes
•Netflix inference platform: reduced container sizes from average 520 MB to 180 MB using multi stage builds and preprocessing separation, cutting image pull time from 18 seconds to 5 seconds and enabling 3x faster autoscaling response
📌 Examples
Uber model serving Dockerfile: builder stage with FROM pytorch/pytorch:2.0 devel compiles TorchScript extensions, runtime stage with FROM python:3.10 slim copies torch libs and model only, final image 240 MB versus 1.8 GB
Google distroless pattern: FROM gcr.io/distroless/python3 10 (45 MB) for maximum security and size, removes shell and package managers, container runs as non root, total with ONNX Runtime and model is 180 MB