What is RAG (Retrieval Augmented Generation)?
The Core Problem
Large Language Models (LLMs) are trained on static snapshots of public data. They cannot access your company's internal documentation, last week's product specifications, or yesterday's incident reports. Even worse, when LLMs lack knowledge, they hallucinate, meaning they confidently invent plausible sounding but completely false information. Fine tuning the model helps with style and behavior, but it does not reliably inject large, constantly changing knowledge bases. Retraining is expensive, slow (taking days or weeks), and still does not guarantee the model will correctly recall specific facts from millions of documents.
How RAG Solves This
RAG separates knowledge storage from language generation. Instead of cramming facts into the model's weights, you store your documents in an external search system, typically a vector database. When a user asks a question, the system retrieves the most relevant documents first, then explicitly provides those as context to the LLM in the prompt with instructions like "Answer based only on these sources." Think of it like an open book exam versus a closed book exam. Without RAG, the LLM must rely purely on memorized training data (closed book). With RAG, it can look up specific information in provided documents (open book) before answering.
The Three Core Components
First, an embedding model converts text into high dimensional vectors (typically 768 to 1536 dimensions) so semantically similar content is mathematically close in vector space. Second, a retrieval system searches millions or billions of these vectors with low latency, usually under 30 milliseconds at the 95th percentile. Third, the generative LLM uses retrieved passages plus the user query to craft an answer, often with citation requirements. This architecture lets you update knowledge without retraining the LLM and gives you precise control over what information the model can access.