ML Infrastructure & MLOps • Model Packaging (Docker, ONNX, SavedModel)Medium⏱️ ~3 min
ONNX vs SavedModel: Choosing Your Serialization Format
Open Neural Network Exchange (ONNX) and SavedModel represent fundamentally different approaches to model serialization. ONNX prioritizes cross framework portability by defining an open graph representation with standardized operators. You train in PyTorch, export to ONNX, and serve with ONNX Runtime on optimized hardware backends. SavedModel is TensorFlow's native format, deeply integrated with the TensorFlow ecosystem and optimized for TensorFlow Serving deployments.
The performance characteristics differ significantly. A ResNet50 image classifier packaged as ONNX and served on Central Processing Unit (CPU) with ONNX Runtime achieves 8 to 20 millisecond single image latency on AVX2 hosts, handling 200 to 600 requests per second per node. The same model compiled to TensorRT (an NVIDIA inference optimizer) and served on a T4 Graphics Processing Unit (GPU) processes thousands of images per second with batch sizes of 16 or 32, reaching single digit millisecond latency at batch size 1. SavedModel with TensorFlow Serving on the same T4 GPU delivers comparable throughput when combined with Accelerated Linear Algebra (XLA) compilation, but container sizes are typically 400 to 800 MB larger due to the full TensorFlow runtime.
The trade off comes down to flexibility versus integration depth. ONNX shines in polyglot environments where teams want training framework independence and lean inference containers. Google uses ONNX for edge deployments where 100 MB containers matter. However, you pay conversion tax: custom operators may not translate cleanly, and numerical differences between framework kernels can cause 0.1 to 2% accuracy drift in sensitive applications like ad ranking. SavedModel eliminates conversion risk for TensorFlow workflows and leverages TensorFlow specific optimizations, but locks you into the TensorFlow stack and heavier deployment artifacts.
💡 Key Takeaways
•ONNX uses protocol buffers to serialize models as directed acyclic graphs with 150+ standardized operators, enabling cross framework deployment where PyTorch training can serve via ONNX Runtime with CPU, CUDA, or TensorRT backends
•Container size impact is dramatic: ONNX Runtime inference images run 100 to 300 MB enabling 3 to 5 second cold starts, while full TensorFlow SavedModel containers at 800 to 1500 MB take 20 to 45 seconds to start under Kubernetes
•Conversion accuracy risk exists with ONNX where custom layers or dynamic control flow may fail export or introduce numerical drift of 0.1 to 2%, requiring validation on fixed test sets before production promotion
•SavedModel avoids conversion entirely for TensorFlow workflows and leverages XLA (Accelerated Linear Algebra) compilation for up to 30% speedup on Tensor Processing Units (TPUs) and modern GPUs, but ties infrastructure to TensorFlow versions
•Production choice at scale: Meta uses ONNX for mobile and edge inference where container size matters, Google uses SavedModel for server side ranking where TensorFlow Serving integration and XLA provide performance wins
📌 Examples
Instagram feed ranking: trains PyTorch models, exports to ONNX, serves 50 millisecond p99 inference on CPU with ONNX Runtime across 10,000+ pods, achieving 95% container size reduction versus PyTorch containers
Google Search ranking: SavedModel artifacts served via TensorFlow Serving with XLA compilation on TPU v4 pods, processing 100,000 queries per second with 20 millisecond p50 latency per ranking request