Natural Language Processing SystemsPrompt Engineering & ManagementHard⏱️ ~3 min

Advanced Techniques: Caching, Multi Model Routing, and Cost Optimization

At scale, production prompt systems use sophisticated techniques to optimize latency, cost, and quality. Caching is the most impactful optimization for deterministic or high repetition tasks. A simple cache keyed by prompt hash plus input can deliver 40 to 70 percent hit rates, reducing median latency from 1.8 seconds to under 100 milliseconds for repeated queries. Cache hits bypass the model entirely, cutting serving costs proportionally. The trade-off is that caching only works for deterministic outputs with temperature set to zero. For creative or varied responses, caching is not applicable. Multi model routing directs requests to different model sizes based on task complexity and latency requirements. Simple classification or entity extraction tasks route to smaller, faster models that can respond in 300 to 600 milliseconds at p95. Complex reasoning or generation tasks route to larger models that take 2 to 4 seconds but deliver higher accuracy. A production system might use a small model for 60 percent of traffic, achieving 450 milliseconds p95 latency and $0.0001 per request, while routing 40 percent to a large model at 2.8 seconds p95 and $0.002 per request. This hybrid approach balances cost and quality, meeting overall Service Level Objectives (SLOs) like p95 under 1.5 seconds by serving the majority of requests quickly. Cost optimization also involves careful token budget management. Prompts should include only the context necessary for the task. Techniques like dynamic few shot selection choose 2 to 5 examples based on input similarity rather than including a fixed set of 10 examples that inflate every request. Retrieval systems rank and prune documents to include only the top 3 to 5 most relevant sources instead of dumping 20 documents into context. This reduces average prompt size by 30 to 50 percent without significant accuracy loss. Another optimization is prompt compression where less critical content is summarized or paraphrased to reduce tokens. For example, a 2,000 token retrieved document might be compressed to 400 tokens using an extractive summarization step. This adds 50 to 150 milliseconds of preprocessing latency but saves 1,600 tokens of generation cost. Companies like OpenAI and Anthropic charge per token, with input tokens costing roughly 10 to 50 percent less than output tokens depending on the model. Reducing unnecessary input tokens directly lowers bills. Finally, streaming responses improve perceived latency even when total generation time is unchanged. Instead of waiting 3 seconds for a complete response, users see the first tokens in 300 milliseconds and the response builds incrementally. This dramatically improves user experience for interactive applications. Combined with caching, multi model routing, and budget management, these techniques enable production systems to serve millions of requests per day at acceptable cost and latency while maintaining quality thresholds.
💡 Key Takeaways
Caching delivers 40 to 70 percent hit rates for deterministic tasks, reducing latency from 1.8 seconds to under 100 milliseconds and cutting serving costs proportionally, but only works with temperature zero
Multi model routing sends 60 percent of traffic to small fast models (450ms p95, $0.0001 per request) and 40 percent to large accurate models (2.8s p95, $0.002 per request) to meet blended SLOs
Dynamic few shot selection and retrieval pruning reduce average prompt size by 30 to 50 percent by including only 2 to 5 relevant examples and top 3 to 5 documents instead of fixed large sets
Prompt compression summarizes 2,000 token documents to 400 tokens with 50 to 150 milliseconds preprocessing, saving 1,600 input tokens and lowering costs since input tokens cost 10 to 50 percent less than output
Streaming responses show first tokens in 300 milliseconds while total generation takes 3 seconds, dramatically improving perceived latency and user experience without changing total compute cost
📌 Examples
Anthropic Claude caching system at a financial services company achieves 68 percent hit rate on compliance classification queries, reducing average latency from 2.1 seconds to 92 milliseconds and cutting monthly costs by $34,000
Google's internal multi model router for code generation sends 70 percent of autocomplete requests to a distilled 7B parameter model at 380ms p95 and 30 percent of complex refactoring to a 540B model at 4.2s p95
OpenAI GPT 4 Turbo input tokens cost $0.01 per 1,000 tokens while output tokens cost $0.03 per 1,000, making prompt compression that reduces input from 3,000 to 800 tokens save $0.022 per request
← Back to Prompt Engineering & Management Overview
Advanced Techniques: Caching, Multi Model Routing, and Cost Optimization | Prompt Engineering & Management - System Overflow