Learn→LLM & Generative AI Systems→RAG Architecture (Retrieval Augmented Generation)→2 of 5

LLM & Generative AI Systems • RAG Architecture (Retrieval Augmented Generation)Medium⏱️ ~3 min

RAG System Architecture and Data Flow

The Two Critical Paths:
RAG systems operate through two distinct pipelines: offline data ingestion that builds the knowledge index, and online query serving that retrieves and generates answers in real time. Understanding both is essential for system design interviews.
Raw DocumentsWikis, PDFs, Code, Tickets
↓
Chunk + Embed500 to 1000 tokens per chunk
↓
Vector IndexMillions of searchable embeddings
↓
Retrieval + GenerationQuery time: 1 to 2 sec p95
Offline Ingestion Pipeline
Documents arrive from multiple sources: document stores, code repositories, ticketing systems, and email archives. The system normalizes formats, extracts text, and splits content into chunks. Chunk size is critical. Too small (50 to 100 tokens) and you lose semantic coherence. Too large (3000 tokens) and retrieval becomes coarse and expensive. Production systems typically settle on 300 to 1000 tokens per chunk with 50 to 100 token overlap to preserve context at boundaries.

Each chunk gets embedded once using a model like OpenAI text-embedding-3-large (3072 dimensions) or similar. For 50 million documents split into 500 token chunks, you might generate 200 million embeddings. At 12 kilobytes per embedding with metadata, that is roughly 2.4 terabytes of index data. These vectors are stored in a vector database like Pinecone, Weaviate, or Milvus with metadata including access control lists, document IDs, and timestamps.
Online Query Serving
When a user asks "How do I roll out a new microservice?", the system first embeds the query using the same embedding model (typically 20 to 50 milliseconds). It executes a vector search to find the top 50 similar chunks, usually completing in 10 to 30 milliseconds at p95 with approximate nearest neighbor algorithms.

Many systems then apply a re ranker, a smaller cross encoder model that scores query to document relevance more accurately than pure vector similarity. This narrows 50 candidates to the best 5 to 10 chunks in another 10 to 30 milliseconds. The selected passages plus instructions are inserted into the LLM prompt.
Typical Query Latency Breakdown
30ms
VECTOR SEARCH
20ms
RE-RANKING
600ms
LLM GENERATION

The LLM generates a 1000 token answer, taking 400 to 800 milliseconds at p95 for GPT 4 class models. Total end to end latency: 600 milliseconds at p50, 1.5 to 2.0 seconds at p95. Systems like Microsoft 365 Copilot and OpenAI ChatGPT Enterprise follow this pattern, achieving sub 2 second responses for complex queries over millions of documents.

💡 Key Takeaways

✓Offline ingestion chunks documents into 300 to 1000 token pieces with overlap, embeds each chunk, and stores in vector index with metadata

✓Chunk size trades off semantic coherence (too small loses context) versus retrieval precision (too large is coarse and expensive)

✓Online serving performs query embedding (20 to 50ms), vector search (10 to 30ms p95), re ranking (10 to 30ms), then LLM generation (400 to 800ms p95)

✓For 200 million embeddings at 12KB each with metadata, expect roughly 2.4TB of index storage requiring sharded deployment

✓Approximate nearest neighbor algorithms are essential: exact search over 100 million vectors cannot meet sub 50ms p95 latency targets

📌 Interview Tips

1Enterprise assistant handling 20,000 employees querying 50 million documents (30TB text): chunks to 200M embeddings, sharded vector index across 20 nodes, hybrid search combining semantic and keyword signals, achieving 200 QPS at 1.8 second p95 latency

2Customer support system with daily document updates: incremental indexing adds new chunks in 5 to 10 minutes, hot recent documents in memory optimized index, cold historical data in larger disk backed index with slightly higher latency

3Legal research platform: 10 million case documents split into 80 million chunks, cross encoder re ranker improves relevance by 25% compared to pure vector search, citations to specific paragraphs required for every generated claim

← Back to RAG Architecture (Retrieval Augmented Generation) Overview