Learn→LLM & Generative AI Systems→Multimodal Systems (Vision-Language Models)→6 of 6

LLM & Generative AI Systems • Multimodal Systems (Vision-Language Models)Hard⏱️ ~3 min

Long Context VLMs: Handling Documents and Extended Video

The Context Length Problem
A typical conversation with text only LLMs uses 2,000 to 5,000 tokens. Add one image (4,096 tokens) and you double or triple context length. Add a 20 page document (80,000 tokens of screenshots plus text) or a 5 minute video (200,000+ tokens), and suddenly you need context windows of 256k to 1 million tokens.

This is not just about model architecture supporting long contexts. It's about computational cost scaling quadratically with sequence length in standard attention mechanisms. Processing 256k tokens requires 65 billion attention computations (256k squared) compared to 25 million for 5k tokens: a 2,600x increase.
Architectural Solutions
Modern long context VLMs like Qwen3 VL (256k native, 1M extended) use sparse attention patterns. Instead of every token attending to every other token (O(n squared) complexity), they use sliding window attention (each token attends to nearest 4,096 neighbors) plus global attention on key frames or document sections.

This reduces complexity to O(n times window size), making 256k context computationally feasible. The tradeoff: you lose some long range dependencies. A reference on page 5 of a document to a chart on page 75 might be missed if they are outside each other's attention windows.

FlashAttention and other kernel optimizations reduce memory bandwidth bottlenecks, achieving 2x to 3x speedup, but the fundamental quadratic scaling remains for full attention.
Attention Computation Scaling
25M
5K TOKENS
4B
64K TOKENS
65B
256K TOKENS
Compression Strategies
Aggressive visual compression is mandatory for long context. DeepSeek OCR's 20x compression turns a 50 page document from 200k visual tokens to 10k compressed tokens. LongVU's frame deduplication keeps video tokens manageable by eliminating 60% to 80% of redundant frames using self supervised embeddings from DINOv2.

The decision matrix: for safety critical applications (legal contracts, medical records), use conservative compression (5x to 10x) to preserve all details. For cost sensitive, high volume workloads (customer support ticket screenshots), use aggressive compression (15x to 20x) and accept 2% to 3% detail loss.
Hierarchical Processing
Some systems use two pass approaches. First pass: process the entire document with a small, fast model that generates a summary and identifies key sections. Second pass: only process identified key sections with a large, high quality model. This cuts compute cost by 70% to 90% for long documents where most content is not relevant to the query.

For example, analyzing a 100 page financial report to answer "What was Q3 revenue?" First pass with a 4B model identifies pages 23 to 27 contain Q3 data. Second pass processes only those 5 pages with a 70B model, using 95% less compute than processing all 100 pages.
Memory and Serving Constraints
A 70B model processing 256k tokens requires approximately 180GB of GPU memory for key value (KV) cache (256k tokens times 128 layers times 8k hidden dimension times 2 bytes per float16 value). This exceeds single GPU capacity, requiring tensor parallelism across 4 to 8 GPUs.

Batch size is severely constrained. A single H100 80GB can batch maybe 2 to 4 long context requests simultaneously, compared to 32 to 64 short context requests. This limits throughput and increases per query cost by 10x to 15x.
"Long context is not a feature you add for free. Every 10x increase in context length increases serving cost by 5x to 10x. Build your system to avoid needing long context whenever possible."
When to Actually Use Long Context
Do NOT use 256k context for everything. Most queries need under 8k tokens. Reserve long context for genuine use cases: comprehensive document analysis where the answer requires synthesizing information across many pages, extended video analysis for multi step processes, or conversational agents maintaining multi hour interaction history.

For point queries ("What is the total on this invoice?"), extract and process only the relevant section. For aggregate queries ("Summarize all Q3 financial metrics"), use long context. The architectural pattern: route based on query type, not document size.

💡 Key Takeaways

✓Context length scaling is quadratic: 256k tokens require 65 billion attention computations vs 25 million for 5k tokens, a 2,600x increase

✓Sparse attention (sliding window + global attention) reduces complexity from O(n squared) to O(n times window) but may miss long range dependencies across distant pages

✓70B model with 256k context needs 180GB GPU memory for KV cache, exceeding single GPU and constraining batch size to 2 to 4 requests (vs 32 to 64 for short context)

✓Hierarchical two pass processing (4B model finds relevant sections, 70B model analyzes them) cuts compute cost by 70% to 90% for long documents with localized queries

📌 Interview Tips

1Legal contract review: 200 page agreement generates 800k tokens uncompressed. DeepSeek OCR compresses to 40k tokens. Sparse attention processes in 8 seconds vs 45 seconds full attention.

2Financial report Q&A: 100 page report, query is 'Q3 revenue'. First pass (4B model, 2s) identifies pages 23 to 27. Second pass (70B model, 3s) processes 5 pages. Total 5s vs 50s for full document.

3Multi hour support chat: 3 hour conversation with 50 screenshots generates 220k tokens. Sliding window attention (4k window) keeps recent context sharp, older context compressed to key points.

← Back to Multimodal Systems (Vision-Language Models) Overview