Loading...
LLM & Generative AI SystemsMultimodal Systems (Vision-Language Models)Medium⏱️ ~2 min

VLM Processing Pipeline: From Pixels to Tokens

The Token Budget Challenge: When you feed an image into a VLM, it doesn't stay as pixels. A single 896 by 896 image gets divided into patches (typically 14 by 14 pixels each), creating 64 by 64 equals 4,096 patches. Each patch becomes a visual token. For a 30 second video sampled at 4 frames per second (FPS), that is 120 frames times 4,096 tokens equals 491,520 visual tokens before you even add the text prompt. This is why token compression matters. Systems like DeepSeek OCR achieve 10x to 20x compression, reducing 4,096 tokens per image down to 200 to 400 tokens while preserving critical details like small text in receipts or tables in documents.
1
Preprocessing: Images are normalized and resized to the model's expected resolution (896x896 for some models, native resolution for others like Pixtral). Video is sampled at 1 to 4 FPS. PDFs are converted to page screenshots plus extracted text.
2
Vision Encoding: Encoders like CLIP, SigLIP, or DINOv2 convert patches into embeddings. This takes 100 to 300ms at p50 on modern GPUs for moderate sized images.
3
Projection: Visual embeddings are mapped into the language model's token space through learned projection layers. This alignment is critical; poor projection causes hallucinations on new domains.
4
Decoding: The language model processes the fused sequence of text tokens and visual embeddings. Mixture of Experts (MoE) decoders activate only a fraction of parameters per token for better efficiency. This takes 300 to 800ms at p50.
Real Numbers That Matter: A single NVIDIA A100 40GB GPU can handle 10 to 30 queries per second (QPS) for medium sized VLM requests, assuming 1,000 to 2,000 total tokens per request and 1 to 4 images with batching. DeepSeek OCR specifically achieves approximately 2,500 tokens per second throughput on a single A100, making it useful as a preprocessing step to compress visual content before the main VLM.
Single Image Token Count
4,096
UNCOMPRESSED
200-400
COMPRESSED
⚠️ Common Pitfall: Blindly resizing all images to a fixed resolution like 896x896 can make small text or fine details unreadable, silently degrading OCR and reasoning accuracy. For documents with dense text, consider tiling or using native resolution encoders.
💡 Key Takeaways
A single 896x896 image generates 4,096 visual tokens (64x64 patches). A 30 second video at 4 FPS creates 491,520 tokens before text prompts.
Compression techniques like DeepSeek OCR reduce tokens by 10x to 20x (from 4,096 to 200 to 400 per image), critical for long context and cost management
End to end latency budget: 50 to 150ms preprocessing, 100 to 300ms encoding, 300 to 800ms decoding at p50 on modern GPUs
Single A100 40GB handles 10 to 30 QPS for medium VLM requests with batching; specialized OCR models hit 2,500 tokens/second for preprocessing
📌 Examples
1Document processing: 80 page PDF becomes 80 screenshots. Uncompressed is 327,680 tokens (80 × 4,096). Compressed to 16,000 to 32,000 tokens fits in context window.
2Video analysis: 30 second clip at 4 FPS with frame deduplication keeps 40 diverse frames (120 reduced to 40), cutting tokens from 491,520 to 163,840
3Invoice extraction: DeepSeek OCR preprocessor compresses invoice image to 300 tokens, then feeds to Qwen3 VL for field extraction, reducing total latency from 1.2s to 600ms
← Back to Multimodal Systems (Vision-Language Models) Overview
Loading...
VLM Processing Pipeline: From Pixels to Tokens | Multimodal Systems (Vision-Language Models) - System Overflow