ML Infrastructure & MLOpsModel Packaging (Docker, ONNX, SavedModel)Hard⏱️ ~3 min

Building Lean Inference Containers: Multi Stage Builds and Optimization Patterns

Multi-Stage Builds: A Docker pattern where the Dockerfile has multiple FROM statements, each creating an intermediate image. Build-time dependencies stay in early stages; only runtime artifacts are copied to the final image. This separates build environment from runtime environment.

Why Multi-Stage Matters

Building a model package requires compilers, development headers, and build tools. Running inference requires only the runtime and model. Without multi-stage builds, all build dependencies end up in the final image. Example: compiling a Python package with C extensions needs gcc, make, and python-dev (adds 500MB). The compiled .so file is under 1MB. Multi-stage builds: stage 1 has build tools and compiles the package; stage 2 copies only the compiled artifact. Final image is 500MB smaller.

Optimization Patterns

Minimize layers: Combine related commands into single RUN statements. Each layer adds overhead; fewer layers mean smaller images. Order by change frequency: Put rarely-changing layers first (base image, system packages), frequently-changing layers last (model files). This maximizes cache hits during rebuilds. Clean up in same layer: If you install packages and then delete cache files, do both in the same RUN command. Docker layers are additive—deleting in a later layer does not reduce image size.

Practical Size Targets

CPU inference images: target under 500MB. GPU inference images: target under 2GB (CUDA libraries add significant size). If your image exceeds these targets, audit dependencies. Common bloat sources: full ML framework instead of inference-only build, test dependencies left in production image, unnecessary Python packages from requirements.txt copy-paste, cached package downloads not cleaned up. Use tools like dive to inspect layer contents and identify waste.

Dockerfile Example Pattern: Stage 1 (builder): install build tools, compile dependencies. Stage 2 (runtime): start from slim base, copy compiled artifacts from builder, copy model, set entrypoint.

💡 Key Takeaways
Multi-stage builds separate build dependencies from runtime, reducing image size
Order Dockerfile layers by change frequency for maximum cache efficiency
Target under 500MB for CPU images, under 2GB for GPU images
📌 Interview Tips
1Build tools (gcc, make) add 500MB but compiled artifacts are under 1MB
2Use dive tool to inspect layer contents and identify bloat sources
← Back to Model Packaging (Docker, ONNX, SavedModel) Overview