Definition
Latency in ML serving is not a single number but a decomposition across pipeline stages. For LLMs, the critical split is between Time to First Token (TTFT) (when the user sees the first response token, targeting under 300 to 500ms) and total end to end latency (the full generation cycle, often 2 to 3 seconds at p95).
Stage Budget Decomposition
Production systems decompose latency into explicit stage budgets. A typical RAG pipeline might allocate: 2 to 10ms for query vectorization, 10 to 50ms at p95 for ANN retrieval, 50 to 150ms for reranking with a cross encoder on top k equals 50 documents, 5 to 30ms for context assembly, and 20 to 60 tokens per second during LLM decode (resulting in 2 to 7 seconds for a 150 token answer without streaming). Microsoft reported a 105x speedup in one case by optimizing across the full pipeline.
Alerting on Tail Percentiles
Tail percentiles matter more than averages. A practical alerting pattern is monitoring p95 inference latency every 15 seconds, triggering an alert if it exceeds 2 seconds over a 5 minute rolling window and remains elevated for 2 minutes to avoid flapping. Feed ranking and ad serving systems enforce even tighter SLOs, commonly requiring sub 100 to 200ms p95 end to end, with individual model calls budgeted to tens of milliseconds.
Tail Amplification Failure
One slow retrieval hop that hits p99 can blow the entire request budget. Cold starts or GPU memory thrash cause spikes that propagate downstream. Use distributed tracing to attribute latency to stages, tag traces with model version and context length, and set per hop budgets with admission control to shed load when queues exceed 50 to 100ms.
✓Time to First Token (TTFB) targets under 300 to 500 milliseconds for interactive systems, while total p95 latency aims for 2 to 3 seconds for short answers
✓Streaming reduces perceived latency by hundreds of milliseconds even when total generation remains 5 to 10 seconds, significantly improving user engagement
✓Stage level budgets are critical: query vectorization 2 to 10 milliseconds, ANN retrieval 10 to 50 milliseconds p95, reranking 50 to 150 milliseconds p95, LLM decode at 20 to 60 tokens per second
✓Alert when p95 inference latency exceeds 2 seconds over a 5 minute rolling window for 2 minutes to avoid flapping, use 15 second collection intervals
✓Tail amplification is the primary failure mode: one slow hop at p99 blows the entire request budget, requiring per hop admission control and load shedding at 50 to 100 milliseconds queue depth
1Microsoft production case: 105× speedup from 315 seconds to 3 seconds via pipeline optimization across all stages
2Netflix and Uber feed ranking services enforce sub 100 to 200 milliseconds p95 end to end, with model calls budgeted to tens of milliseconds
3RAG system with semantic caching: 17× speedup when prompts repeat semantically, reducing retrieval and generation costs dramatically
4Distributed tracing pattern: tag every trace with model version, prompt template hash, context length, and user cohort to attribute latency spikes to specific stages and configurations