Computer Vision Systems • Image Preprocessing (Augmentation, Normalization)Medium⏱️ ~3 min
Production Data Pipeline Design and Throughput
In production training systems, preprocessing forms a data plane with strict throughput and latency targets. The goal is to keep accelerators above 90% utilization; if GPUs or Tensor Processing Units (TPUs) stall waiting for data, expensive compute cycles are wasted. The flow is typically read compressed images from object storage, decode to tensors, apply augmentations, normalize, batch, and transfer to accelerators. Each stage must be carefully budgeted and parallelized.
Consider a concrete example: 8 GPUs, global batch size 2048, image size 224x224, target step time 0.25 seconds. Each GPU processes 256 images per iteration, so the pipeline must deliver 8 times 256 divided by 0.25 equals 8192 images per second. If average JPEG size is 180 KB, raw read bandwidth is approximately 1.4 GB per second before caching. A 25 gigabits per second (Gbps) network link provides roughly 3.1 GB per second at line rate, so network is feasible. However, CPU decode and augmentation become the bottleneck. A CPU only pipeline typically tops out around 1000 to 2000 images per second per server depending on cores and vectorization. To hit 8192 images per second, teams shift decode and augment to the accelerator or allocate dedicated CPU pools.
NVIDIA benchmarks show that moving JPEG decode and basic augmentations to GPUs sustains 3000 to 6000 images per second per server, enabling 2 servers with prefetching to reach the target with margin. Storage patterns are equally critical. Reading millions of small files kills throughput due to open, stat, and seek overhead. Large scale systems shard datasets into 1 GB to 4 GB sequential record files, which can double effective throughput over random access. Meta and Google use per worker sharding and local prefetch queues to avoid cross worker contention and guarantee each sample is seen once per epoch.
At inference time, the picture simplifies. Only normalization, resize, and possibly center crop are applied. For a real time camera application at 30 frames per second, preprocessing must complete in under 2 milliseconds per frame to keep end to end latency below 30 milliseconds. Edge devices use fixed integer scales and hardware assist to meet power budgets.
💡 Key Takeaways
•8 GPU training at 0.25 second step time and batch size 2048 requires 8192 images per second throughput; CPU decode alone caps at 2000 images per second per server
•Sharding datasets into 1 GB to 4 GB files doubles throughput over millions of small files by reducing metadata overhead and enabling sequential reads
•NVIDIA benchmarks report 3000 to 6000 images per second per server when decode and augment run on GPUs, enabling 2 servers to sustain 8192 images per second
•Real time inference at 30 frames per second needs preprocessing under 2 milliseconds per frame; larger 640x640 detection inputs may require 4 to 6 milliseconds
•Meta and Google use per worker dataset sharding to avoid cross worker contention and guarantee each sample is seen exactly once per epoch
📌 Examples
NVIDIA DGX system: moves JPEG decode to GPU using nvJPEG library, sustaining 5000 images per second per node for 224x224 images with basic augmentations
Google TPU pods: shard ImageNet into 1024 files of 1.2 GB each, achieving 50 GB/s aggregate read throughput across 128 workers
Tesla inference pipeline: processes 1280x960 camera frames in under 5 milliseconds per frame on custom accelerator, including resize and normalization