Production Data Pipeline Design and Throughput
Production Pipeline Requirements
Training pipelines can take minutes per batch. Production pipelines must process images in milliseconds. The same preprocessing logic runs in both contexts, but implementation differs dramatically for throughput and latency.
Image Decoding Bottleneck
JPEG decoding is CPU-intensive. A single core decodes 50-200 images per second depending on resolution. For high-throughput pipelines, this becomes the bottleneck before GPU inference even starts.
Solutions: Use multiple CPU workers for parallel decoding. GPU-accelerated decoders like nvJPEG process 1000+ images per second. Pre-decode and store as raw tensors for repeated access at the cost of 10-20x storage.
Memory and Batching
A 224x224 RGB image consumes 150KB as a float tensor. A batch of 32 images uses 5MB. A batch of 256 images uses 38MB. Pipeline memory must accommodate multiple batches in flight simultaneously.
Prefetching: While GPU processes batch N, CPU prepares batch N+1. This hides preprocessing latency but doubles memory requirements. Balance prefetch depth against available memory.
Data Loading Patterns
Synchronous loading: Load, preprocess, inference sequentially. Simple but GPU sits idle during loading. Utilization drops to 30-50%.
Asynchronous loading: Separate threads handle loading and preprocessing. GPU stays busy while CPU prepares next batch. Achieves 80-95% GPU utilization.
Memory-mapped files: For datasets that fit in memory, map files directly. OS handles caching. Eliminates explicit loading code.