How to Build a RAG Agent in 2026: A Step-by-Step Guide
RAG gets talked about like it's either magic or a solved problem. It's neither. Retrieval-augmented generation is a specific pattern for giving an AI agent access to a large body of documents without stuffing everything into the context window at once, and it works well when you implement it carefully and fails in predictable ways when you don't.
This guide walks through how to build a RAG agent in 2026, covering the actual decisions that matter: when RAG is the right choice versus just using a large context window, which vector database fits your situation, how to chunk your documents so retrieval actually works, and which frameworks make the implementation manageable.
First question: do you actually need RAG?
Before writing any code, answer this. Context windows are large now. Claude 4 Opus supports up to 1 million tokens in extended configurations. Gemini 2.5 Pro handles 2 million tokens. GPT-5 gives you 400K. If your entire knowledge base is 50,000 words and you're building an internal tool for a small team, you might be able to load everything into context and skip retrieval entirely.
RAG earns its complexity when:
- Your document corpus is larger than a few hundred pages and grows over time
- You need to keep costs down (loading a large corpus into every request gets expensive fast)
- Your documents update frequently and you want the agent to work with current information without re-indexing the whole corpus
- You need the agent to cite sources accurately, because retrieval surfaces the exact chunks the answer came from
If none of those apply, a simpler approach may serve you better. Load your documents into context, run a capable model, and avoid the retrieval infrastructure entirely. For building a company chatbot over a 500-page documentation site, RAG is worth it. For answering questions about a 30-page product spec, it's probably not.
For a deeper look at when context window size changes the calculus, the context window explainer covers current model sizes in detail.
The RAG pipeline: what you're actually building
A RAG agent has two separate phases, and most implementation problems come from not being clear on which phase has the bug.
Indexing (happens once, or on a schedule): You take your documents, split them into chunks, generate an embedding vector for each chunk, and store those vectors in a vector database alongside the original text. This is how the system learns what's in your corpus.
Retrieval and generation (happens at query time): When a user asks a question, you embed the question using the same embedding model, search the vector database for the chunks most similar to the question, pull those chunks into the context window, and pass everything to the language model to generate an answer.
The quality of your RAG system lives mostly in step one. If your chunks are too large, retrieval pulls in too much noise. If they're too small, each chunk lacks enough context to be meaningful. If you use a weak embedding model, semantically similar content gets bad similarity scores. Most teams that say "RAG doesn't work" have a chunking or embedding problem, not a retrieval or generation problem.
Choosing a vector database
You have real options here, and the choice matters more at scale than it does in early development. Here's how the main ones compare:
pgvector, This is a PostgreSQL extension. If you already run Postgres, adding pgvector is the lowest-friction way to store embeddings. You get vector similarity search inside the same database as your application data, with no separate infrastructure to manage. Performance is solid for corpora up to a few million vectors. For most production applications this is the right choice, and I'd start here before considering anything more complex.
Pinecone, A managed, serverless vector database purpose-built for this use case. You don't manage infrastructure. It scales without configuration. The tradeoff is vendor lock-in and pricing that climbs as your index grows. Pinecone makes sense for teams that want to move fast and aren't running their own infrastructure anyway.
Weaviate, Open source, schema-flexible, and designed for both vector search and traditional filtered search. The hybrid search capabilities (combine keyword + semantic search) are genuinely better than pure vector similarity in many retrieval scenarios. Good choice if you need fine-grained filtering (search by document type, date range, department, etc.) alongside semantic similarity.
Chroma, The easiest way to get started locally. In-memory or local persistent storage, minimal configuration, good Python integration. I'd use Chroma for prototyping and development, then migrate to Weaviate or pgvector for production. Don't try to run Chroma at scale, it's not designed for it.
For most teams shipping a production RAG agent in 2026, pgvector if you already use Postgres, Weaviate if you need hybrid search, and Pinecone if you want managed infrastructure.
Embeddings: picking the right model
The embedding model converts text into a vector of numbers. Chunks with similar meaning should produce vectors that are close together in the vector space. The quality of your retrieval depends directly on how well the embedding model captures semantic meaning.
OpenAI's text-embedding-3-large is the most commonly used embedding model in production. It produces 3072-dimensional vectors and handles most retrieval tasks well. It costs $0.13 per million tokens as of May 2026, which is inexpensive enough that embedding costs are rarely the issue.
Cohere's embedding models are worth knowing about. Their embed-v3 series handles multilingual content better than OpenAI's offering and includes a useful feature: you can specify whether you're embedding a query or a document, and the model optimizes accordingly. For multilingual corpora this is a meaningful advantage.
For teams running local models, Nomic's nomic-embed-text is a strong open-weights option. It runs on-device, which eliminates embedding API costs entirely and keeps your documents off third-party infrastructure. Performance is close enough to the hosted options for most use cases.
One thing that catches teams off guard: once you've indexed your documents with one embedding model, you're committed to it. Switching embedding models means re-indexing your entire corpus. Make this choice deliberately.
Chunking strategy: where most RAG projects go wrong
Chunking is the step that determines whether retrieval actually works, and it gets less attention than it deserves. There's no universally correct chunking strategy. The right approach depends on your document structure.
Fixed-size chunking is the simplest approach. You split every document into chunks of N tokens with K tokens of overlap between adjacent chunks. The overlap is important, it prevents a retrieval miss when the answer straddles a chunk boundary.
# Simple fixed-size chunking with overlap
def chunk_text(text, chunk_size=512, overlap=50):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap
return chunks
This works acceptably for uniform content like support tickets or short articles. It fails on structured documents like legal contracts, API documentation, or technical specifications, where meaning is tied to document structure rather than raw token position.
Semantic chunking uses embedding similarity to decide where to split. Rather than cutting at fixed intervals, you split where the meaning changes significantly. The result is chunks that each cover a coherent concept, even if they vary in length. LlamaIndex and LangChain both have semantic chunker implementations.
Document-aware chunking respects the actual structure of your documents. For Markdown, you chunk at heading boundaries. For PDFs with structured sections, you use those section breaks. For HTML documentation, you respect the DOM structure. If your documents have consistent structure, this produces the best retrieval results because each chunk corresponds to a meaningful unit.
For a corpus of Markdown documentation files, my recommended default is document-aware chunking at the H2/H3 heading level with a maximum chunk size of 800 tokens. For mixed or unstructured documents, semantic chunking with a 400-600 token target size is a reasonable starting point.
Retrieval: beyond naive similarity search
The basic retrieval step is: embed the query, find the top-K most similar chunks, return them. This is a fine starting point but has known failure modes.
Hybrid search combines dense vector search (semantic similarity) with sparse keyword search (BM25 or similar). This matters because vector search sometimes misses exact matches, if a user asks about "GDPR Article 17" and that phrase appears verbatim in your documents, keyword search will find it reliably while pure vector search might not. Weaviate and several other vector databases support hybrid search natively. Use it.
Reranking adds a second pass after the initial retrieval. You retrieve the top-20 chunks by similarity, then pass all 20 to a reranking model that scores them for relevance to the specific query. The top-5 reranked results are more accurate than the top-5 raw similarity results. Cohere's Rerank API makes this easy to add to an existing pipeline.
Query expansion generates multiple versions of the user's question before retrieval, then combines results. A user asking "how do I cancel" might mean "cancel a subscription," "cancel a meeting," or "cancel an order", the same words, different intent. Expanding the query before retrieval catches more of the relevant chunks.
Framework options: LlamaIndex, LangChain, Haystack
All three frameworks cover the full RAG pipeline. The choice comes down to what fits your existing codebase and how much flexibility you need.
LlamaIndex is the most RAG-focused of the three. It's built specifically for document indexing and retrieval, and it shows. The document loaders, node parsers (their term for chunking), and retrieval abstractions are more mature and more RAG-specific than anything in LangChain. If RAG is your primary use case, start here.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Load and index documents
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What is our refund policy?")
print(response)
LangChain has broader scope. It covers RAG but also chains, tool use, and agent orchestration. If your agent needs retrieval as one capability among many (it also needs to call APIs, run code, manage state), LangChain may give you a more consistent abstraction across the whole system. The RAG components are solid, if not as specialized as LlamaIndex's.
Haystack from deepset has the most production-focused pipeline design. Its composable pipeline model makes it straightforward to add preprocessing steps, swap retrieval strategies, and modify the pipeline without rewriting core logic. For teams building document processing systems at scale, Haystack's design is cleaner than LlamaIndex or LangChain once you go beyond the basic use case.
For a first RAG project, LlamaIndex. For an existing LangChain codebase, stay there. For a production document processing pipeline with complex retrieval requirements, evaluate Haystack.
Evaluation: knowing when your RAG actually works
Teams skip this step and then wonder why their chatbot gives wrong answers. Before shipping a RAG agent, you need to measure retrieval quality and answer quality separately.
Retrieval quality: For a set of test questions, did the retrieval step actually surface the chunk that contains the answer? A simple way to check: generate 50 question-answer pairs from your documents manually, run retrieval, and check whether the answer-containing chunk appears in the top-3 results. Anything below 80% means your chunking or embedding setup needs work.
Answer quality: Did the model produce a correct answer given the retrieved chunks? This can be automated by using another LLM as a judge, or scored manually for a smaller test set.
Treat these separately. If retrieval is finding the right content but the answers are still wrong, the problem is in the generation step, your prompt may not be giving the model enough instruction on how to use the retrieved context. If retrieval is missing relevant chunks, fix the chunking before touching anything else.
A minimal working implementation
Putting it together: here's the shape of a minimal RAG agent. This isn't production code, but it captures the structure.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
# 1. Configure embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-large")
# 2. Connect to vector store
vector_store = PGVectorStore.from_params(
database="mydb",
host="localhost",
table_name="document_chunks",
embed_dim=3072,
)
# 3. Load and index documents
documents = SimpleDirectoryReader("./knowledge_base").load_data()
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
embed_model=embed_model,
)
# 4. Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("How do I reset my password?")
The real complexity is in the chunking configuration, the retrieval strategy, and the prompt that instructs the model how to use retrieved context. Those pieces are what you iterate on after you have this basic structure running.
Where to go from here
The LlamaIndex documentation and Haystack documentation are both genuinely good. For multi-agent systems where RAG is one component among several, the agent frameworks comparison guide covers how the retrieval-focused frameworks fit into broader orchestration.
One last thing: RAG is not a product feature you ship once. The documents change, the user questions evolve, and retrieval quality drifts. Build evaluation into your process from the start, not as an afterthought when users start complaining.