Production Implementation and LLMOps
Offline Simulators and Evaluation
Before deploying a new prompt, tool, or orchestration change, teams build simulators that replay real production traffic against the modified system. They collect a dataset of 10,000 to 50,000 representative user requests with ground truth labels or human judgments.
The simulator runs each request through both the current system and the candidate system, comparing outputs. Metrics include: task success rate (did it solve the user problem?), tool call efficiency (did it use the minimum necessary tools?), latency (p50, p95, p99), and cost (average LLM calls and tokens per request).
Example: you want to add a new search_company_wiki tool. Offline eval on 20,000 support requests shows: success rate improves from 78 percent to 84 percent, average tool calls increase from 2.1 to 2.8, latency p95 increases from 1.2 seconds to 1.6 seconds, cost per request increases by 22 percent. You decide the 6 percent success improvement justifies the cost, but you implement parallel wiki search to reduce the latency hit to 200 milliseconds.
Metrics That Matter
Production monitoring tracks multiple dimensions. System metrics include: requests per second, latency percentiles, error rates, tool success rates. ML metrics include: average LLM calls per request, average tokens per request, tool invocation distribution (which tools are called most?). Business metrics include: task completion rate, user satisfaction scores, escalation to human rate. You also track incremental cost per successful task. If adding a feature improves success rate from 80 percent to 85 percent but doubles cost, that is $0.04 per request to gain 5 percent success. For a support product with 1 million requests per month, that is $40,000 per month for 50,000 additional successful resolutions. The business decides if this Return on Investment (ROI) makes sense.
A/B Testing Agents
You cannot just deploy a new agent configuration to 100 percent of traffic. Teams run controlled experiments: 5 percent of traffic goes to the new system, 95 percent stays on the current system. They run for 1 to 2 weeks, collecting thousands of requests per variant. Key challenge: agent outputs are not easily compared. For a search ranking change, you measure clicks. For an agent, success is subjective. Many teams use a combination of automated metrics (did it call the right tools? was latency acceptable?) and human evaluation (random sample of 200 to 500 responses rated by judges). If automated metrics look good but human evaluation is neutral, you do not ship. If latency degrades beyond SLA thresholds even with quality improvements, you optimize first.
The Simplicity Escape Hatch
The most important production lesson: agent systems are not always the answer. For many use cases, simple Retrieval Augmented Generation (RAG) with a single search plus LLM call outperforms multi step agents on latency, cost, and reliability, with only marginal quality difference.
Observability is Non Negotiable
Every production agent system logs every interaction with correlation identifiers. When a user reports a problem, engineers trace the exact LLM prompts, tool calls, parameters, results, and policy decisions that led to the output.
Logs feed into dashboards showing: tool success rates over time (is a tool degrading?), latency distribution per tool (which tool is the bottleneck?), cost trends (is average token usage creeping up?), and error patterns (what are the top 10 failure modes?).
This data also feeds back into prompt engineering and tool interface refinement. If logs show the LLM frequently calls search_docs with overly broad queries that return too many results, you might refine the tool description in the schema or add a max_results parameter with a default of 5.