ML Infrastructure & MLOps • Model Packaging (Docker, ONNX, SavedModel)Hard⏱️ ~3 min
Production Model Serving Pipeline: From Training to Inference at Scale
A production model serving pipeline orchestrates the full lifecycle from training completion to serving predictions at scale. The flow starts when training completes: teams export to Open Neural Network Exchange (ONNX) or SavedModel, register the artifact in a model registry with semantic versioning (for example, v2.3.1), metadata including input/output signatures, and performance baselines. A continuous integration system builds a lean inference container, runs conformance tests validating accuracy within 0.5% of training metrics, and measures latency on representative hardware. Only passing builds get promoted to a container registry.
Deployment uses progressive rollout. An orchestrator like Kubernetes deploys the new version alongside the current production version. Traffic routing starts with 1 to 5% canary allocation, monitoring per version metrics: p50/p99 latency, error rate, prediction quality on shadow traffic, and business metrics like click through rate. If the canary is healthy for 1 to 4 hours, traffic gradually shifts to 10%, 50%, then 100% over 6 to 24 hours. At serving time, an Application Programming Interface (API) gateway fans requests to a pool of inference pods. Each pod loads the model into memory on startup (1 to 5 seconds for 100 to 500 MB models), runs warmup, and exposes a prediction endpoint.
Multi model hosting is common at scale. Systems like NVIDIA Triton Inference Server or TensorFlow Serving can host 10 to 50 model versions per process, reading from a versioned model repository and hot reloading newer versions without downtime. Dynamic batching merges incoming requests into micro batches (typically 8 to 32 samples) to improve Graphics Processing Unit (GPU) throughput by 3 to 10x while constraining latency. For example, a single T4 GPU serving a vision model might process 50 requests per second individually at 20 millisecond latency, or batch requests with a 10 millisecond collection window to process 400 requests per second at 30 millisecond p99 latency, trading 10 milliseconds for 8x throughput.
Instrumentation is critical. Every request emits spans with model name, version, decode time, preprocess time, inference time, and postprocess time. Metrics feed autoscaling policies: scale on queue depth for latency sensitive services, scale on GPU utilization for throughput optimized batch workloads. Meta's recommendation serving emits over 100 metrics per model including feature distribution drift, which triggers retraining when input distributions shift beyond thresholds.
💡 Key Takeaways
•Model registry acts as source of truth with semantic versioning, storing artifacts with metadata including input shapes (for example, [batch, 224, 224, 3] float32), performance baselines (for example, 15 millisecond p99 latency), and checksums for reproducibility
•Progressive rollout with 1 to 5% canary traffic runs for 1 to 4 hours monitoring p50/p99 latency, error rate, and business metrics before gradual shift to 100%, with automatic rollback if error rate exceeds 0.1% or latency degrades beyond 20%
•Dynamic batching on Graphics Processing Units (GPUs) collects requests for 5 to 20 milliseconds to form micro batches of 8 to 32 samples, improving throughput by 3 to 10x at cost of bounded latency increase, critical for models where single inference is 5 to 15 milliseconds
•Multi model hosting with NVIDIA Triton or TensorFlow Serving allows one process to serve 10 to 50 model versions, hot reloading from versioned storage and routing traffic for A/B tests without restarting pods or reloading unchanged models
•Instrumentation emits per request metrics including decode (1 to 5 milliseconds), preprocess (2 to 10 milliseconds), inference (10 to 100 milliseconds), and postprocess (1 to 5 milliseconds) timings to identify bottlenecks and guide optimization efforts
•Google's production serving uses feature distribution monitoring to detect input drift, automatically triggering retraining when key feature distributions shift beyond 2 standard deviations from training data, preventing silent accuracy degradation
📌 Examples
Netflix homepage recommendation: serves 50+ model versions across 5000 pods, uses TensorFlow Serving with dynamic batching (batch size 16, 10 millisecond window) on T4 GPUs achieving 150 millisecond p99 latency at 200,000 requests per second, canary rollouts over 12 hours
Uber ETA prediction: registers SavedModel artifacts in MLflow, builds inference containers via continuous integration with latency tests on m5.xlarge instances, deploys with 5% canary for 2 hours monitoring mean absolute error on live trips before full rollout