Model Serving & InferenceMulti-model ServingHard⏱️ ~3 min

LLM Multi-Model Serving: Gateway Pattern and VRAM Constraints

Large language models (LLMs) require a fundamentally different multi-model approach than traditional models due to massive memory footprints and Key-Value (KV) cache growth during generation. A 7 billion parameter model needs 6 to 8 gigabytes of VRAM just for weights (in 16-bit precision), plus additional gigabytes for the KV cache that grows with sequence length and batch size. This makes on demand loading impractical: swapping a multi gigabyte model in and out of VRAM takes 5 to 30 seconds and destroys throughput. The dominant production pattern is gateway level aggregation: each large model runs on dedicated GPU resources (one model per GPU or node), and a lightweight reverse proxy or service mesh exposes a single external endpoint that routes to per model backends. For example, a team serving 10 different 7B to 13B models deploys 10 separate GPU instances (each running one model with vLLM or TensorRT-LLM), fronted by an nginx or Envoy gateway that routes based on model ID in the request. The gateway adds negligible overhead (under 1 millisecond) while providing centralized authentication, rate limiting, and failover. Per GPU throughput for LLMs is measured in tokens per second. A 7B model on a single 40GB A100 typically sustains 100 to 300 tokens/s aggregate throughput across concurrent requests, depending on batch size, sequence length, and KV cache optimization (techniques like paged attention). Per request latency is dominated by output length: generating 100 tokens at 50 tokens/s takes 2 seconds, plus initial prompt processing (typically 50 to 200ms for 1000 token prompts). Trying to fit two 7B models on one 40GB GPU usually violates Service Level Objectives (SLOs) because VRAM pressure limits effective batch size, KV cache thrashes, and throughput drops below 150 tokens/s total. The critical failure mode is sequence length spikes. If a user sends a request with a 4000 token output limit, the KV cache for that sequence can consume 2 to 4 gigabytes, reducing the number of concurrent requests the GPU can handle from 16 to 4, causing out of memory errors or latency cliffs for other requests. Production systems mitigate this with strict max token limits (for example, 512 or 1024 output tokens), budget aware admission control that tracks allocated KV memory before accepting new requests, and paged KV caching (used by vLLM) that allows non contiguous memory allocation and reduces fragmentation.
💡 Key Takeaways
7 billion parameter LLMs require 6 to 8 gigabytes VRAM for weights alone; on demand loading takes 5 to 30 seconds and destroys throughput, making per GPU isolation necessary
Gateway pattern deploys each model on dedicated GPU with reverse proxy routing; gateway adds under 1ms overhead while providing centralized auth and rate limiting
Single 40GB A100 running 7B model sustains 100 to 300 tokens per second aggregate throughput; per request latency is output length divided by generation speed (100 tokens at 50 tok/s equals 2 seconds)
Fitting two 7B models on one 40GB GPU usually fails: VRAM pressure limits batch size, KV cache thrashes, and total throughput drops below 150 tokens/s, violating SLOs
Sequence length spikes are the critical failure mode: a 4000 token output can consume 2 to 4GB of KV cache, reducing concurrency from 16 to 4 requests and causing OOM; mitigate with strict max token limits and paged KV caching
📌 Examples
OpenAI style deployment serving 5 LLM variants: each model on separate GPU cluster (GPT 3.5 turbo on 50 A100s, GPT 4 on 200 A100s), single api.openai.com endpoint routes by model parameter
Anthropic Claude serving multiple model sizes: claude-instant (7B class) at 250 tokens/s per GPU, claude-2 (70B+ class) at 30 tokens/s per GPU, gateway splits traffic 80/20 by cost and latency needs
Internal enterprise LLM platform: 8 fine tuned 13B models for different business units, each on dedicated 40GB A100, nginx gateway with JWT auth and per-user rate limits (10 requests per minute), tracks KV memory and rejects requests when 35GB threshold reached
← Back to Multi-model Serving Overview