Online Serving Architecture: Dynamic Batching and Caching
Online Serving Architecture
Serving a trained model at scale requires careful system design. The model itself is only part of the solution. Infrastructure for batching, caching, and load balancing determines whether you achieve target latency and throughput.
Dynamic Batching
GPUs excel at parallel computation. Processing one image takes 5ms; processing 32 images takes 8ms. Without batching, you waste 90%+ of GPU capacity on memory transfers and kernel launches.
How it works: Requests queue until either a batch fills (e.g., 32 images) or a timeout expires (e.g., 10ms). Larger batches increase throughput but add latency. A 50ms batching window means the fastest possible response is 50ms.
Adaptive batching: During traffic spikes, batches fill quickly and latency stays low. During low traffic, timeout triggers before batches fill, preventing requests from waiting forever.
Caching Strategies
Result caching: Store (image_hash, class_prediction) pairs. When the same image reappears, return the cached result without inference. Hit rates of 20-40% are typical for user upload scenarios.
Embedding caching: Store intermediate embeddings from the model backbone. For similar images, retrieve nearby embeddings and compare distances. Useful for near-duplicate detection.
Load Balancing
Model servers sit behind a load balancer that distributes requests. Round-robin works for homogeneous servers. For mixed GPU types, use weighted distribution based on server throughput capacity. Health checks remove failing servers from rotation within seconds.