Production VLM Systems: Routing and Scale
The Routing Problem
In production, you never run just one VLM model. User requests vary wildly: a simple text question with a single screenshot needs sub second response, while analyzing a 100 page legal contract can take minutes. Running every request through your largest, most capable model wastes compute and money. Production systems at companies like OpenAI, Google, and Meta use intelligent routing layers that classify incoming requests by modality mix, complexity, user tier, and latency budget, then dynamically choose between small, medium, and large VLM variants.
The Three Tier Strategy
First, latency sensitive flows like customer chat target p50 under 800ms and p99 under 2 seconds. These route to 4B to 9B parameter models like Gemma 3 4B or GLM 4.6V Flash running on consumer grade GPUs. These models sacrifice some quality for speed. Second, balanced workloads like document Q&A might use 30B to 70B models with p50 latency of 2 to 5 seconds. These run on A100 or H100 GPUs with mixed precision and batching to maximize throughput. Third, high value offline tasks such as invoice auditing, legal document review, or medical image analysis route to 70B to 235B models like Qwen3 VL 235B. Service Level Agreement (SLA) is minutes, not milliseconds, but quality is paramount. These might use multi GPU inference or even CPU offloading for rare, expensive queries.
Specialized Preprocessing
Some systems add a preprocessing tier. If the router detects a text heavy document (invoice, receipt, form), it first routes to a specialized OCR model like DeepSeek OCR. This compresses visual content 10x to 20x, then feeds the compact representation to a reasoning VLM like QVQ 72B. This two stage approach reduces end to end cost by 60% to 80% for document heavy workloads while maintaining quality.
Capacity Planning
At scale, you need separate inference pools. A pool of 50 A100 GPUs running 4B models at 20 QPS each handles 1,000 QPS aggregate for interactive queries. A smaller pool of 10 H100 GPUs running 70B models handles 50 to 100 QPS for higher quality requests. A batch processing cluster runs 235B models overnight for audit workloads.
Observability
Production systems track latency separately for each pipeline stage. A spike in vision encoding latency (from 150ms to 500ms) indicates GPU memory pressure or inefficient batching. A spike in decoding latency suggests context length growth or attention mechanism bottlenecks. Monitoring p99 latency by model tier reveals which capacity pools need scaling.