VLM Failure Modes and Edge Cases at Scale
Input Corruption and Silent Degradation
Multimodal systems have more failure modes than text only LLMs because every modality introduces new edge cases. Images arrive corrupted (truncated downloads, encoding errors), extremely high resolution (24 megapixel phone photos when model expects 896 by 896), or contain multiple documents per page (scanned multi page receipts as single image). The danger is silent degradation. If your preprocessing blindly resizes a 6000 by 4000 pixel receipt to 896 by 896, small text becomes unreadable. Your VLM returns plausible but wrong invoice amounts because it literally cannot see the digits. This fails silently with no error, just bad output. The fix: implement resolution aware tiling. For high resolution inputs, split into overlapping tiles (1024 by 1024 with 128 pixel overlap), process each tile separately, then merge results. This increases token count by 4x to 9x but preserves fine details. Alternatively, use native resolution encoders like those in Pixtral, which accept arbitrary image sizes but require more sophisticated attention mechanisms.
Video Processing Edge Cases
Video introduces temporal failures. Users upload 10 minute screen recordings when your system expects 30 second clips. At 4 FPS, that is 2,400 frames times 4,096 tokens equals 9.8 million tokens, far exceeding any context window. Frame deduplication helps but creates new issues. If you use simple frame differencing, slow animations or gradual transitions get over compressed (60 second fade treated as one frame). If you use embedding similarity with DINOv2, you need to tune the similarity threshold. Too aggressive (threshold 0.95) and you miss subtle changes. Too conservative (threshold 0.7) and you keep too many redundant frames.
Alignment Failures and Domain Shift
The projector that maps visual embeddings to language space is trained on specific data distributions. A model trained on natural images (ImageNet, COCO) may hallucinate heavily on UI screenshots, scientific plots, or medical images because the visual patterns are out of distribution. This is subtle. The model does not crash or return errors. It confidently describes UI elements that don't exist or misreads chart axes by 10x. Systems like Qwen 2.5 VL and Molmo mitigate this by training on diverse synthetic data including UI screenshots, flowcharts, and technical diagrams, and by exposing structured outputs like bounding boxes for grounding. Production systems need domain specific validation. For medical imaging, check that detected anatomical structures are anatomically plausible. For UI analysis, verify that described UI elements have corresponding pixel regions with high attention scores.
Multimodal RAG Failures
Retrieval Augmented Generation (RAG) for multimodal content has unique failure modes. Document screenshot embedding (one vector per page or passage) is fast but misses fine grained details. A 5 page financial report with 12 charts and 8 tables embedded as 5 vectors loses the structure of which table corresponds to which text section. ColBERT style token level embeddings are more accurate (hundreds of vectors per page capture table cells, chart elements, text spans), but a corpus of 1 million pages becomes 200 billion vectors. Retrieval latency jumps from 50ms to 500ms, and memory requirements go from 4GB to 400GB. The trade-off: use coarse page level retrieval for initial filtering (top 100 pages), then fine grained token level reranking on those 100 candidates. This keeps p99 latency under 200ms while preserving detail.
Safety and Moderation at Scale
Text safety filters miss visual unsafe content. An image of a harmful activity with benign caption passes text only filters. Open VLMs like Pixtral have no built in moderation, so production deployments must add multimodal safety models. The challenge: false positives versus false negatives. Set your safety threshold too high and you block 5% of legitimate medical images or art references. Set it too low and 0.1% of harmful content gets through. At 100 million requests per day, 0.1% is 100,000 unsafe responses. Production systems run dual safety checks: a fast lightweight filter (99% recall, 2% false positive rate, 20ms latency) before generation, and a more accurate heavy filter (99.9% recall, 0.5% false positive rate, 100ms latency) after generation with human review for borderline cases.