Computer Vision SystemsImage Preprocessing (Augmentation, Normalization)Medium⏱️ ~3 min

Production Data Pipeline Design and Throughput

Production Pipeline Requirements

Training pipelines can take minutes per batch. Production pipelines must process images in milliseconds. The same preprocessing logic runs in both contexts, but implementation differs dramatically for throughput and latency.

Image Decoding Bottleneck

JPEG decoding is CPU-intensive. A single core decodes 50-200 images per second depending on resolution. For high-throughput pipelines, this becomes the bottleneck before GPU inference even starts.

Solutions: Use multiple CPU workers for parallel decoding. GPU-accelerated decoders like nvJPEG process 1000+ images per second. Pre-decode and store as raw tensors for repeated access at the cost of 10-20x storage.

Memory and Batching

A 224x224 RGB image consumes 150KB as a float tensor. A batch of 32 images uses 5MB. A batch of 256 images uses 38MB. Pipeline memory must accommodate multiple batches in flight simultaneously.

Prefetching: While GPU processes batch N, CPU prepares batch N+1. This hides preprocessing latency but doubles memory requirements. Balance prefetch depth against available memory.

Data Loading Patterns

Synchronous loading: Load, preprocess, inference sequentially. Simple but GPU sits idle during loading. Utilization drops to 30-50%.

Asynchronous loading: Separate threads handle loading and preprocessing. GPU stays busy while CPU prepares next batch. Achieves 80-95% GPU utilization.

Memory-mapped files: For datasets that fit in memory, map files directly. OS handles caching. Eliminates explicit loading code.

💡 Key Takeaways
JPEG decoding at 50-200 images/sec per CPU core often bottlenecks before GPU inference
GPU-accelerated decoders (nvJPEG) process 1000+ images/sec, eliminating CPU bottleneck
Prefetching hides preprocessing latency but doubles memory requirements
Asynchronous loading achieves 80-95% GPU utilization vs 30-50% with synchronous loading
📌 Interview Tips
1Interview Tip: Identify image decoding as a common overlooked bottleneck - many teams focus on GPU but CPU limits throughput
2Interview Tip: Mention prefetching trade-off: faster throughput at cost of higher memory - size prefetch queue based on available RAM
← Back to Image Preprocessing (Augmentation, Normalization) Overview
Production Data Pipeline Design and Throughput | Image Preprocessing (Augmentation, Normalization) - System Overflow