Loading...
LLM & Generative AI Systems • Agent Systems & Tool UseMedium⏱️ ~3 min
Agent System Architecture & Execution Flow
The System Components:
In production, agent systems sit between user interfaces and existing microservices. The architecture consists of four key layers that work together to enable safe, scalable tool use.
First is the tool registry, which defines each tool as a typed function with a name, description, input and output schemas using JSON Schema or similar, and safety attributes like required permissions and risk level. Think of this as your service catalog but formatted for LLM consumption.
Second is the agent orchestrator, the service responsible for the interaction loop. It initializes agent state with the user query, context like user profile and permissions, and the goal. It manages the back and forth between LLM and tools.
Third is the state store, which maintains conversation history, intermediate results, and task progress. This might be Redis for short lived sessions or a database for long running workflows.
Fourth is the safety and policy layer, which checks every tool invocation against user identity, scopes, rate limits, and business rules before execution.
The Execution Flow:
Let's trace a typical request through a system serving 5,000 concurrent users at 200 Queries Per Second (QPS). A user sends a request to an API gateway. After authentication, it passes to the agent orchestrator, which initializes state and calls the LLM with available tools and their schemas.
The LLM outputs either a direct text response or a structured tool invocation. Suppose it calls a
Scaling Considerations:
At global scale with 10,000 QPS, companies like Google and Microsoft split responsibilities. A thin front agent handles dialog and routing, while specialized sub agents or microservices handle vertical logic like billing, search, or content creation. Tools are implemented as horizontally scalable stateless services behind load balancers.
To keep latency under control at high throughput, mature systems run independent tools in parallel when possible, cache frequent queries with short Time To Live (TTL) values, and prefetch likely resources based on conversation state. When tool subsystems become overloaded, the orchestrator uses queues and applies backpressure rather than cascading failures.
search_tickets tool. The orchestrator validates parameters through the policy layer, executes the tool (which hits an indexed service in 50 to 150 milliseconds), and feeds results back to the LLM for final response generation.
Typical Request Latency Breakdown
150-400ms
LLM CALL (P50)
50-150ms
TOOL EXECUTION
<800ms
TOTAL P50 TARGET
⚠️ Common Pitfall: The orchestrator must restrict agents to 1 or 2 tool hops in the critical path for synchronous requests. Anything requiring more steps should be pushed to an async workflow with user notifications to avoid timeout issues.
💡 Key Takeaways
✓Tool registry defines each tool with typed schemas, descriptions, and safety attributes that LLMs can understand and invoke correctly
✓Orchestrator manages the interaction loop between LLM and tools, enforcing 1 to 2 tool hops in critical path to meet p50 latency targets under 800 milliseconds
✓Policy layer validates every tool call against user permissions, rate limits, and business rules before execution to prevent unauthorized access
✓At scale beyond 10,000 QPS, systems split into front agents for routing and specialized sub agents for vertical logic, with tools as stateless microservices
✓Parallel tool execution, caching with short TTL, and prefetching based on conversation state are essential to keep p95 latency under 2 seconds
📌 Examples
1Internal support system at 200 QPS restricts to 2 tool calls maximum: one search_tickets (100ms) and one get_details (80ms) to stay under 800ms p50
2Microsoft Copilot runs calendar and email tools in parallel when both are needed, reducing sequential 300ms + 250ms to parallel 300ms max
3Google Workspace agent prefetches recent docs when user asks document questions, hitting cache in 15ms instead of 120ms storage fetch
Loading...