Model Serving & InferenceMonitoring & Observability (Latency, Drift, Performance)Medium⏱️ ~3 min

Monitoring Inference Latency: Time to First Token vs End to End

Latency in ML serving is not a single number but a decomposition across pipeline stages. For Large Language Models (LLMs), the critical split is between Time to First Token (TTFB) and total end to end latency. TTFB measures when the user sees the first response token, typically targeting under 300 to 500 milliseconds for interactive systems. Total latency captures the full generation cycle, often 2 to 3 seconds at p95 for short answers. Streaming drastically improves perceived latency by rendering tokens as they arrive, even when total generation takes 5 to 10 seconds. Production systems decompose latency into explicit stage budgets. A typical Retrieval Augmented Generation (RAG) pipeline might allocate 2 to 10 milliseconds for query vectorization, 10 to 50 milliseconds at p95 for Approximate Nearest Neighbor (ANN) retrieval, 50 to 150 milliseconds for reranking with a cross encoder on top k equals 50 documents, 5 to 30 milliseconds for context assembly, and 20 to 60 tokens per second during LLM decode (resulting in 2 to 7 seconds for a 150 token answer without streaming). Microsoft reported a 105× speedup in one production case by optimizing across the full pipeline, reducing end to end time from 315 seconds to 3 seconds. Tail percentiles matter more than averages. A practical alerting pattern is monitoring p95 inference latency every 15 seconds, triggering an alert if it exceeds 2 seconds over a 5 minute rolling window and remains elevated for 2 minutes to avoid flapping. Feed ranking and ad serving systems enforce even tighter Service Level Objectives (SLOs), commonly requiring sub 100 to 200 milliseconds p95 end to end, with individual model calls budgeted to tens of milliseconds. The failure mode is tail amplification in cascading services. One slow retrieval hop that hits p99 can blow the entire request budget. Cold starts or Graphics Processing Unit (GPU) memory thrash cause spikes that propagate downstream. Use distributed tracing to attribute latency to stages, tag traces with model version and context length, and set per hop budgets with admission control to shed load when queues exceed 50 to 100 milliseconds.
💡 Key Takeaways
Time to First Token (TTFB) targets under 300 to 500 milliseconds for interactive systems, while total p95 latency aims for 2 to 3 seconds for short answers
Streaming reduces perceived latency by hundreds of milliseconds even when total generation remains 5 to 10 seconds, significantly improving user engagement
Stage level budgets are critical: query vectorization 2 to 10 milliseconds, ANN retrieval 10 to 50 milliseconds p95, reranking 50 to 150 milliseconds p95, LLM decode at 20 to 60 tokens per second
Alert when p95 inference latency exceeds 2 seconds over a 5 minute rolling window for 2 minutes to avoid flapping, use 15 second collection intervals
Tail amplification is the primary failure mode: one slow hop at p99 blows the entire request budget, requiring per hop admission control and load shedding at 50 to 100 milliseconds queue depth
📌 Examples
Microsoft production case: 105× speedup from 315 seconds to 3 seconds via pipeline optimization across all stages
Netflix and Uber feed ranking services enforce sub 100 to 200 milliseconds p95 end to end, with model calls budgeted to tens of milliseconds
RAG system with semantic caching: 17× speedup when prompts repeat semantically, reducing retrieval and generation costs dramatically
Distributed tracing pattern: tag every trace with model version, prompt template hash, context length, and user cohort to attribute latency spikes to specific stages and configurations
← Back to Monitoring & Observability (Latency, Drift, Performance) Overview
Monitoring Inference Latency: Time to First Token vs End to End | Monitoring & Observability (Latency, Drift, Performance) - System Overflow