Learn→LLM & Generative AI Systems→Multimodal Systems (Vision-Language Models)→3 of 6

LLM & Generative AI Systems • Multimodal Systems (Vision-Language Models)Medium⏱️ ~3 min

Production VLM Systems: Routing and Scale

The Routing Problem
In production, you never run just one VLM model. User requests vary wildly: a simple text question with a single screenshot needs sub second response, while analyzing a 100 page legal contract can take minutes. Running every request through your largest, most capable model wastes compute and money.

Production systems at companies like OpenAI, Google, and Meta use intelligent routing layers that classify incoming requests by modality mix, complexity, user tier, and latency budget, then dynamically choose between small, medium, and large VLM variants.
The Three Tier Strategy
First, latency sensitive flows like customer chat target p50 under 800ms and p99 under 2 seconds. These route to 4B to 9B parameter models like Gemma 3 4B or GLM 4.6V Flash running on consumer grade GPUs. These models sacrifice some quality for speed.

Second, balanced workloads like document Q&A might use 30B to 70B models with p50 latency of 2 to 5 seconds. These run on A100 or H100 GPUs with mixed precision and batching to maximize throughput.

Third, high value offline tasks such as invoice auditing, legal document review, or medical image analysis route to 70B to 235B models like Qwen3 VL 235B. Service Level Agreement (SLA) is minutes, not milliseconds, but quality is paramount. These might use multi GPU inference or even CPU offloading for rare, expensive queries.
Latency vs Model Size Trade-off
800ms
4B MODEL P50
3s
70B MODEL P50
30s+
235B MODEL
Specialized Preprocessing
Some systems add a preprocessing tier. If the router detects a text heavy document (invoice, receipt, form), it first routes to a specialized OCR model like DeepSeek OCR. This compresses visual content 10x to 20x, then feeds the compact representation to a reasoning VLM like QVQ 72B. This two stage approach reduces end to end cost by 60% to 80% for document heavy workloads while maintaining quality.
Capacity Planning
At scale, you need separate inference pools. A pool of 50 A100 GPUs running 4B models at 20 QPS each handles 1,000 QPS aggregate for interactive queries. A smaller pool of 10 H100 GPUs running 70B models handles 50 to 100 QPS for higher quality requests. A batch processing cluster runs 235B models overnight for audit workloads.
"The decision isn't about building one best VLM. It's about orchestrating a fleet of models where each request lands on the smallest, cheapest model that can meet quality requirements."
Observability
Production systems track latency separately for each pipeline stage. A spike in vision encoding latency (from 150ms to 500ms) indicates GPU memory pressure or inefficient batching. A spike in decoding latency suggests context length growth or attention mechanism bottlenecks. Monitoring p99 latency by model tier reveals which capacity pools need scaling.

💡 Key Takeaways

✓Production systems use three tier routing: 4B models for sub second latency (p50 800ms), 70B for balanced quality (p50 3s), 235B for high value offline tasks (minutes SLA)

✓Single A100 handles 10 to 30 QPS per model. A pool of 50 A100s running 4B models serves 1,000 QPS aggregate for interactive workloads

✓Specialized OCR preprocessing (like DeepSeek OCR) reduces end to end cost by 60% to 80% for document heavy workloads through 10x to 20x token compression

✓Observability tracks latency by pipeline stage (encoding, decoding, post processing) to identify specific bottlenecks as scale increases

📌 Interview Tips

1E commerce support: Simple product question with 1 image routes to 4B model (800ms). Complex return with 5 images + invoice routes to 70B model (3s).

2Legal document review: 200 page contract routes to OCR preprocessor first (compress to 40k tokens), then to 235B model overnight for clause extraction and risk analysis

3Medical imaging: Real time ultrasound analysis (emergency) uses fast 9B model. Radiology report generation (non urgent) uses 70B model with higher accuracy

← Back to Multimodal Systems (Vision-Language Models) Overview