Computer Vision SystemsImage Classification at ScaleMedium⏱️ ~3 min

Training Pipeline: From Pretraining to Production

Production training pipelines combine large scale pretraining with task specific fine tuning to achieve strong performance across diverse real world data. Companies start with pretraining on hundreds of millions of images using supervised datasets like ImageNet or self supervised methods on unlabeled data. This provides a strong backbone that captures general visual features. Fine tuning on tens of millions of task specific images then closes the domain gap between generic pretraining and production use cases. Scale determines infrastructure requirements. Large companies report training runs with 64 to 512 GPUs, global batch sizes from 2,048 to 8,192, and wall clock time of 1 to 10 days for modern backbones. Concrete example: training on 100 million images at 224 by 224 pixels resolution, 50 epochs, global batch size 4,096 on 256 GPUs at 0.2 seconds per step yields approximately 1.22 million steps completing in 2.8 days. Input/output operations typically dominate training time, so teams use sharded binary record files, aggressive caching, and JPEG acceleration libraries to keep GPUs saturated. Class imbalance presents severe challenges in production data. Long tail categories might have 100 examples while popular classes have millions. Without correction, rare class F1 scores approach zero. Teams apply class aware sampling, loss reweighting with inverse frequency, and hard example mining to surface difficult instances. Semi supervised learning with pseudo labels helps by expanding training data for rare classes using confident predictions on unlabeled images. Data quality and augmentation directly impact generalization. Strong augmentations including color jitter, random resized crops, and mixup based techniques improve robustness. Mixed precision training and gradient checkpointing enable larger batches within memory constraints. Evaluation must happen on held out production like sets including challenging and rare classes, not just curated benchmark splits. Checkpoints bundle model weights, preprocessing code, taxonomy version, and validation metrics to ensure reproducibility.
💡 Key Takeaways
Training at scale uses 64 to 512 GPUs with global batch sizes 2,048 to 8,192, completing in 1 to 10 days for 100 million image runs at 50 epochs
Input/output operations dominate training time, requiring sharded binary records and JPEG acceleration to saturate GPU utilization and avoid bottlenecks
Class imbalance causes rare category F1 scores to approach zero without intervention, requiring reweighting, hard example mining, and semi supervised pseudo labeling
Mixed precision training and gradient checkpointing enable fitting larger batch sizes within GPU memory constraints, improving convergence speed
Checkpoints must bundle model weights, preprocessing code, taxonomy version, and metrics to ensure reproducibility across training and serving environments
Strong augmentations including color jitter, random resized crop, and mixup improve generalization on real world production data beyond curated benchmarks
📌 Examples
Amazon product classification: Pretrain on 200M generic images, fine tune on 30M product catalog images with 10,000 categories, apply inverse frequency weighting to handle 100:1 class imbalance, achieve 92% top 5 accuracy
Training run calculation: 100M images, 224x224 resolution, 50 epochs, batch 4096 on 256 GPUs, 0.2s per step gives 1.22M steps completing in 2.8 days wall clock time
Google Photos model: Semi supervised learning generates pseudo labels on 500M unlabeled images with confidence threshold 0.95, expands rare class training data by 10x, improves rare class recall from 45% to 78%
← Back to Image Classification at Scale Overview