Scaling RAG to Production: Architecture Patterns
Vector Index Deployment
Deploy the vector index as a replicated cluster with sharding. With 200 million embeddings at 12 kilobytes each (including metadata), you have 2.4 terabytes of data. A single node with 512 gigabytes of memory can hold roughly 40 million embeddings in RAM for sub 20 millisecond p95 retrieval. You need at least 5 to 6 shards, each replicated 3x for availability. Sharding strategies matter. Random sharding by document ID distributes load evenly but requires querying all shards and merging results: 5 shards means 5 parallel queries, increasing complexity. Semantic space partitioning using clustering (like k means with k equal to shard count) routes queries to 1 to 2 relevant shards, reducing fan out but risking hot spots if queries cluster around popular topics. Load balancing across replicas uses round robin with health checks. If one replica hits p95 latency over 50 milliseconds due to garbage collection or disk I/O, remove it from rotation temporarily.
Hot and Cold Data Separation
Recent documents get 80% of queries due to recency bias. Separate hot data (last 30 days, roughly 10 million embeddings) into a smaller in memory index optimized for ultra low latency: 5 to 10 milliseconds p95. Route queries there first. If results are insufficient (fewer than 5 chunks with relevance score over 0.7), fall back to the cold index (remaining 190 million embeddings) with 30 to 50 millisecond p95 latency from disk backed storage. This pattern reduces infrastructure cost by 40% to 60%: you only need expensive high memory instances for 5% of your data while serving 80% of traffic with excellent latency.
Caching Strategy
Cache at multiple layers. Embedding cache stores query embeddings keyed by query text: if two users ask "How do I reset my password?", compute the embedding once. With 100,000 unique queries per day and 50% repeat rate, this saves 50,000 embedding calls at $0.0001 each: $5 daily or $1,825 annually. More importantly, it cuts 20 to 50 milliseconds from latency. Retrieval cache stores top K results for exact query matches. This works well for common questions but has low hit rates (10 to 20%) for long tail queries. Use a Least Recently Used (LRU) cache with 1 hour time to live (TTL) to balance freshness and hit rate. LLM response cache is trickier because context changes with conversation history. Partial caching of common document chunk combinations can help, but requires careful cache key design.
Observability and Guardrails
Log which documents were retrieved, which made it into the prompt, and which citations appeared in the answer. This supports offline analysis of failures and online guardrails like blocking answers if all retrieved docs score below a confidence threshold (for example, 0.6 similarity). Track per query metrics: retrieval latency breakdown (embedding, search, re ranking), context length used, citation count, and user feedback signals. Alert when p95 retrieval latency exceeds 50 milliseconds or when average relevance score drops below 0.7, indicating index drift or quality degradation.
Cost at Scale
At 500 QPS with 1.5 second average latency, each request uses 5,000 input tokens (retrieved context) and 500 output tokens. Daily cost with GPT 4 pricing ($0.01 per 1K input, $0.03 per 1K output): 500 QPS * 86,400 seconds * 5K tokens / 1000 * $0.01 plus output tokens equals roughly $21,600 for input and $6,480 for output, totaling $28,080 daily or $10.2 million annually just for LLM calls. Infrastructure (vector index, re ranker, embedding) adds another $1 to 2 million annually. Total system cost: $11 to 12 million per year at this scale. This is why aggressive caching, hot/cold separation, and careful context management are essential.