Learn→LLM & Generative AI Systems→Multimodal Systems (Vision-Language Models)→5 of 6

LLM & Generative AI Systems • Multimodal Systems (Vision-Language Models)Hard⏱️ ~3 min

VLM Failure Modes and Edge Cases at Scale

Input Corruption and Silent Degradation
Multimodal systems have more failure modes than text only LLMs because every modality introduces new edge cases. Images arrive corrupted (truncated downloads, encoding errors), extremely high resolution (24 megapixel phone photos when model expects 896 by 896), or contain multiple documents per page (scanned multi page receipts as single image).

The danger is silent degradation. If your preprocessing blindly resizes a 6000 by 4000 pixel receipt to 896 by 896, small text becomes unreadable. Your VLM returns plausible but wrong invoice amounts because it literally cannot see the digits. This fails silently with no error, just bad output.

The fix: implement resolution aware tiling. For high resolution inputs, split into overlapping tiles (1024 by 1024 with 128 pixel overlap), process each tile separately, then merge results. This increases token count by 4x to 9x but preserves fine details. Alternatively, use native resolution encoders like those in Pixtral, which accept arbitrary image sizes but require more sophisticated attention mechanisms.
Video Processing Edge Cases
Video introduces temporal failures. Users upload 10 minute screen recordings when your system expects 30 second clips. At 4 FPS, that is 2,400 frames times 4,096 tokens equals 9.8 million tokens, far exceeding any context window.

Frame deduplication helps but creates new issues. If you use simple frame differencing, slow animations or gradual transitions get over compressed (60 second fade treated as one frame). If you use embedding similarity with DINOv2, you need to tune the similarity threshold. Too aggressive (threshold 0.95) and you miss subtle changes. Too conservative (threshold 0.7) and you keep too many redundant frames.
Video Token Explosion
10 MIN VIDEO
9.8M tokens
→
DEDUPLICATION
800K tokens
→
+ COMPRESSION
40K tokens
Alignment Failures and Domain Shift
The projector that maps visual embeddings to language space is trained on specific data distributions. A model trained on natural images (ImageNet, COCO) may hallucinate heavily on UI screenshots, scientific plots, or medical images because the visual patterns are out of distribution.

This is subtle. The model does not crash or return errors. It confidently describes UI elements that don't exist or misreads chart axes by 10x. Systems like Qwen 2.5 VL and Molmo mitigate this by training on diverse synthetic data including UI screenshots, flowcharts, and technical diagrams, and by exposing structured outputs like bounding boxes for grounding.

Production systems need domain specific validation. For medical imaging, check that detected anatomical structures are anatomically plausible. For UI analysis, verify that described UI elements have corresponding pixel regions with high attention scores.
Multimodal RAG Failures
Retrieval Augmented Generation (RAG) for multimodal content has unique failure modes. Document screenshot embedding (one vector per page or passage) is fast but misses fine grained details. A 5 page financial report with 12 charts and 8 tables embedded as 5 vectors loses the structure of which table corresponds to which text section.

ColBERT style token level embeddings are more accurate (hundreds of vectors per page capture table cells, chart elements, text spans), but a corpus of 1 million pages becomes 200 billion vectors. Retrieval latency jumps from 50ms to 500ms, and memory requirements go from 4GB to 400GB.

The trade-off: use coarse page level retrieval for initial filtering (top 100 pages), then fine grained token level reranking on those 100 candidates. This keeps p99 latency under 200ms while preserving detail.
Safety and Moderation at Scale
Text safety filters miss visual unsafe content. An image of a harmful activity with benign caption passes text only filters. Open VLMs like Pixtral have no built in moderation, so production deployments must add multimodal safety models.

The challenge: false positives versus false negatives. Set your safety threshold too high and you block 5% of legitimate medical images or art references. Set it too low and 0.1% of harmful content gets through. At 100 million requests per day, 0.1% is 100,000 unsafe responses.

Production systems run dual safety checks: a fast lightweight filter (99% recall, 2% false positive rate, 20ms latency) before generation, and a more accurate heavy filter (99.9% recall, 0.5% false positive rate, 100ms latency) after generation with human review for borderline cases.
❗ Remember: The failure mode that matters most is the one that scales with your traffic. A 0.1% alignment failure on UI screenshots is ignorable at 1,000 queries per day but causes 1,000 bad outputs per day at 1M QPS. Always calculate failure impact at peak scale, not average load.

💡 Key Takeaways

✓Silent degradation from image resizing: 6000x4000 receipt resized to 896x896 makes small text unreadable, causing wrong outputs with no error. Fix with resolution aware tiling or native resolution encoders.

✓Video token explosion: 10 minute video generates 9.8M tokens. Frame deduplication (similarity threshold tuning) plus compression reduces to 40k tokens, fitting in context windows.

✓Alignment failures from domain shift: Models trained on natural images hallucinate on UI screenshots or medical images. Need domain specific validation and grounding with bounding boxes.

✓Safety at scale: 0.1% false negative rate means 100k unsafe outputs daily at 100M requests/day. Dual filter approach (fast 20ms pre check, heavy 100ms post check) balances latency and safety.

📌 Interview Tips

1Invoice processing: Blind resize to 896x896 causes 15% field extraction errors on high resolution receipts. Switch to 4 tile approach (2x2 grid) recovers accuracy to 98% at 3x token cost.

2Video analysis: 5 minute tutorial generates 1.2M tokens unprocessed. DINOv2 deduplication at 0.85 similarity threshold keeps 200 diverse frames (163k tokens), then compress to 8k tokens.

3Medical imaging: VLM trained on natural images describes non existent lesions in X rays (8% hallucination rate). Fine tune on radiology dataset + anatomical plausibility checks reduces to 0.5%.

← Back to Multimodal Systems (Vision-Language Models) Overview