LLM Multi-Model Serving: Gateway Pattern and VRAM Constraints
Why LLMs Are Different
LLMs require a fundamentally different multi-model approach due to massive memory footprints and KV cache growth during generation. A 7B parameter model needs 6 to 8GB of VRAM just for weights (in FP16), plus additional gigabytes for the KV cache that grows with sequence length and batch size. This makes on demand loading impractical: swapping a multi gigabyte model in and out of VRAM takes 5 to 30 seconds and destroys throughput.
The Gateway Pattern
The dominant production pattern is gateway level aggregation: each large model runs on dedicated GPU resources (one model per GPU or node), and a lightweight reverse proxy exposes a single external endpoint that routes to per model backends. A team serving 10 different 7B to 13B models deploys 10 separate GPU instances (each running one model with vLLM or TensorRT-LLM), fronted by an nginx or Envoy gateway that routes based on model ID. The gateway adds negligible overhead (under 1ms) while providing centralized authentication, rate limiting, and failover.
Throughput and Latency
Per GPU throughput for LLMs is measured in tokens per second. A 7B model on a single 40GB A100 typically sustains 100 to 300 tokens/s aggregate throughput across concurrent requests, depending on batch size, sequence length, and KV cache optimization (techniques like paged attention). Per request latency is dominated by output length: generating 100 tokens at 50 tokens/s takes 2 seconds, plus initial prompt processing (typically 50 to 200ms for 1000 token prompts). Trying to fit two 7B models on one 40GB GPU usually violates SLOs because VRAM pressure limits effective batch size.
Sequence Length Spikes
The critical failure mode. If a user sends a request with a 4000 token output limit, the KV cache for that sequence can consume 2 to 4GB, reducing concurrent requests from 16 to 4, causing OOM or latency cliffs for other requests. Production systems mitigate this with strict max token limits (512 or 1024 output tokens), budget aware admission control that tracks allocated KV memory, and paged KV caching (used by vLLM) that reduces fragmentation.