Model Serving & InferenceMulti-model ServingHard⏱️ ~3 min

LLM Multi-Model Serving: Gateway Pattern and VRAM Constraints

Why LLMs Are Different

LLMs require a fundamentally different multi-model approach due to massive memory footprints and KV cache growth during generation. A 7B parameter model needs 6 to 8GB of VRAM just for weights (in FP16), plus additional gigabytes for the KV cache that grows with sequence length and batch size. This makes on demand loading impractical: swapping a multi gigabyte model in and out of VRAM takes 5 to 30 seconds and destroys throughput.

The Gateway Pattern

The dominant production pattern is gateway level aggregation: each large model runs on dedicated GPU resources (one model per GPU or node), and a lightweight reverse proxy exposes a single external endpoint that routes to per model backends. A team serving 10 different 7B to 13B models deploys 10 separate GPU instances (each running one model with vLLM or TensorRT-LLM), fronted by an nginx or Envoy gateway that routes based on model ID. The gateway adds negligible overhead (under 1ms) while providing centralized authentication, rate limiting, and failover.

Throughput and Latency

Per GPU throughput for LLMs is measured in tokens per second. A 7B model on a single 40GB A100 typically sustains 100 to 300 tokens/s aggregate throughput across concurrent requests, depending on batch size, sequence length, and KV cache optimization (techniques like paged attention). Per request latency is dominated by output length: generating 100 tokens at 50 tokens/s takes 2 seconds, plus initial prompt processing (typically 50 to 200ms for 1000 token prompts). Trying to fit two 7B models on one 40GB GPU usually violates SLOs because VRAM pressure limits effective batch size.

Sequence Length Spikes

The critical failure mode. If a user sends a request with a 4000 token output limit, the KV cache for that sequence can consume 2 to 4GB, reducing concurrent requests from 16 to 4, causing OOM or latency cliffs for other requests. Production systems mitigate this with strict max token limits (512 or 1024 output tokens), budget aware admission control that tracks allocated KV memory, and paged KV caching (used by vLLM) that reduces fragmentation.

💡 Key Takeaways
7 billion parameter LLMs require 6 to 8 gigabytes VRAM for weights alone; on demand loading takes 5 to 30 seconds and destroys throughput, making per GPU isolation necessary
Gateway pattern deploys each model on dedicated GPU with reverse proxy routing; gateway adds under 1ms overhead while providing centralized auth and rate limiting
Single 40GB A100 running 7B model sustains 100 to 300 tokens per second aggregate throughput; per request latency is output length divided by generation speed (100 tokens at 50 tok/s equals 2 seconds)
Fitting two 7B models on one 40GB GPU usually fails: VRAM pressure limits batch size, KV cache thrashes, and total throughput drops below 150 tokens/s, violating SLOs
Sequence length spikes are the critical failure mode: a 4000 token output can consume 2 to 4GB of KV cache, reducing concurrency from 16 to 4 requests and causing OOM; mitigate with strict max token limits and paged KV caching
📌 Interview Tips
1OpenAI style deployment serving 5 LLM variants: each model on separate GPU cluster (GPT 3.5 turbo on 50 A100s, GPT 4 on 200 A100s), single api.openai.com endpoint routes by model parameter
2Anthropic Claude serving multiple model sizes: claude-instant (7B class) at 250 tokens/s per GPU, claude-2 (70B+ class) at 30 tokens/s per GPU, gateway splits traffic 80/20 by cost and latency needs
3Internal enterprise LLM platform: 8 fine tuned 13B models for different business units, each on dedicated 40GB A100, nginx gateway with JWT auth and per-user rate limits (10 requests per minute), tracks KV memory and rejects requests when 35GB threshold reached
← Back to Multi-model Serving Overview
LLM Multi-Model Serving: Gateway Pattern and VRAM Constraints | Multi-model Serving - System Overflow