ML Infrastructure & MLOpsModel Packaging (Docker, ONNX, SavedModel)Hard⏱️ ~3 min

Model Packaging Failure Modes: Conversion Pitfalls and Production Gotchas

Model packaging fails in subtle ways that escape testing and surface in production. Conversion pitfalls are the most common. Exporting PyTorch to Open Neural Network Exchange (ONNX) can fail on unsupported operators, custom layers, or dynamic control flow. Even when export succeeds, numerical differences between framework kernels cause accuracy drift. A ranking model at Meta saw 1.2% accuracy drop after ONNX conversion due to different implementations of layer normalization, enough to impact click through rate by 0.3% and cost millions in revenue. Dynamic shapes are another trap: if you export without marking batch dimension as dynamic, the ONNX graph hardcodes batch size 1, silently failing or crashing when the serving layer attempts dynamic batching with size 16. Environment mismatches cause hard runtime failures. A model compiled with CUDA 11.8 and TensorRT 8.5 may fail to load if the container runs with CUDA 11.7 or mismatched cuDNN versions, producing cryptic errors like "symbol not found" or silent segmentation faults. Central Processing Unit (CPU) optimized builds using AVX512 instructions crash with illegal instruction errors on older instance types that lack the instruction set. These failures are hard to catch in staging if your testing cluster runs newer hardware than production. At Uber, a model deployment failed at 3 AM when autoscaling launched pods on older c5 instances without AVX512 support, while testing only used c6i instances. Throughput tuning creates edge cases. Aggressive dynamic batching with 50 millisecond collection windows improves throughput but causes head of line blocking: a single slow request delays an entire batch, spiking p99 latency from 30 milliseconds to 200 milliseconds. Instance groups that over allocate Graphics Processing Unit (GPU) memory (for example, loading 3 large models on a 16 GB GPU) trigger out of memory errors under peak load, causing restart loops that cascade across the fleet. If preprocessing lives inside the inference container, CPU bound operations like JPEG decoding at 20 to 40 milliseconds can dominate total latency at 60 milliseconds, hiding the fact that the GPU is idle 60% of the time and leading to massive over provisioning of expensive accelerators.
💡 Key Takeaways
Conversion accuracy drift occurs when framework specific kernel implementations differ: layer normalization, batch normalization, and activation functions can vary by 0.001 to 0.01 in output values, accumulating to 0.5 to 2% accuracy loss in deep models with 50+ layers
Dynamic shape export requires explicit axis marking during ONNX export or TensorFlow SavedModel signatures: forgetting to mark batch dimension as dynamic hardcodes batch size 1 and breaks dynamic batching at serving, silently degrading throughput by 5 to 10x
Environment version mismatches between container and runtime cause failures: CUDA version skew (11.7 versus 11.8), cuDNN library mismatches (8.2 versus 8.6), or Application Binary Interface (ABI) incompatibilities produce symbol lookup errors or segfaults that are flaky and hard to reproduce
Dynamic batching head of line blocking: 50 millisecond batch window improves throughput by 8x but one slow request (200 milliseconds preprocessing) delays entire batch, spiking p99 latency from 30 milliseconds to 250 milliseconds and violating Service Level Objectives (SLOs)
CPU bound preprocessing hidden inside inference containers causes GPU underutilization: 30 millisecond image decode plus 10 millisecond model inference equals 40 millisecond total, but GPU is idle 75% of time leading to 4x overprovisioning of GPU instances costing $20K per month instead of $5K
Large model artifacts serialized with protocol buffers hit 2 GB message size limits in some systems, causing silent truncation or upload failures to model registries, requiring chunking or split storage strategies
📌 Examples
Meta ad ranking model: ONNX conversion caused 1.2% precision drop due to layer norm kernel differences, detected only after A/B test showed 0.3% click through rate decrease, rolled back and kept PyTorch serving with TorchScript instead
Netflix model deployment: staging tests passed on g4dn instances with T4 GPUs and CUDA 11.8, production failed at scale when autoscaling launched g3 instances with older Tesla M60 GPUs lacking TensorRT 8 support, fixed by constraining instance types in Kubernetes node affinity
← Back to Model Packaging (Docker, ONNX, SavedModel) Overview