Computer Vision SystemsImage Classification at ScaleMedium⏱️ ~3 min

Image Classification at Scale: Architecture and Data Flow

Image classification at scale involves predicting labels for millions to billions of images within strict latency and cost constraints. Production systems handle two distinct paths: online serving for user facing features requiring 50 to 150 ms p99 latency, and offline batch processing for bulk indexing at 5,000 to 50,000 images per second per GPU cluster. The complete flow starts with ingestion writing raw uploads to object storage, typically petabytes of JPEG or WebP format. At 200 KB average per image, 1 billion images consume roughly 200 TB of storage. Content hashing detects exact duplicates while perceptual hashing or learned embeddings catch near duplicates. A metadata bus publishes new items to downstream consumers. Offline pipelines precompute embeddings and labels in bulk. Embeddings at 512 float32 dimensions take 2 KB per image, totaling approximately 2 TB for 1 billion images plus replication overhead. Online serving uses edge caches or CDN nodes storing predictions keyed by content hash, covering 80 to 95 percent of repeat requests. Cache misses route to GPU clusters with dynamic batching. A single A100 GPU delivers roughly 1,000 to 3,000 ResNet50 sized inferences per second at batch sizes between 8 and 32, with compute time around 3 to 10 ms per batch. Google Photos and Pinterest use asynchronous processing for bulk uploads, accepting images immediately and completing classification within minutes. Meta and Amazon content moderation requires faster decisions, mixing a lightweight model online with deeper models in async review pipelines.
💡 Key Takeaways
Online serving targets 50 to 150 ms p99 latency with cache hit rates of 80 to 95 percent using content hash lookups
Offline batch processing achieves 5,000 to 50,000 images per second per GPU cluster depending on model size and I/O optimization
A single A100 GPU delivers 1,000 to 3,000 ResNet50 inferences per second at batch sizes 8 to 32 with 3 to 10 ms compute time per batch
Storage scales linearly: 200 TB raw images plus 2 TB embeddings per billion items at 512 float32 dimensions with replication overhead
Asynchronous processing for uploads enables immediate acceptance with classification landing within minutes, used by Google Photos and Pinterest
Content moderation pipelines mix fast lightweight models online with deeper async models to balance speed and accuracy for safety critical decisions
📌 Examples
Pinterest bulk upload flow: User uploads 100 photos, system returns immediate success, background pipeline processes at 10,000 images/second, embeddings and labels available in 10 seconds for search indexing
Meta content moderation: Lightweight MobileNet runs in 15 ms online for instant blocking, flagged content routes to heavier EfficientNet model completing in 500 ms for human review queue
Google Photos at 1 billion images: 200 TB raw storage, 2 TB embeddings at 512 dimensions, 6 TB total with triple replication, cache serves 90% of requests with sub 10 ms latency
← Back to Image Classification at Scale Overview
Image Classification at Scale: Architecture and Data Flow | Image Classification at Scale - System Overflow