Agentic RAG Patterns in 2026: Beyond Basic Vector Search

March 31, 2026 · Editorial Team · 7 min read · ai-engineering rag agentic-rag

Basic RAG (Retrieval Augmented Generation) is straightforward: embed the user's query, search a vector store, retrieve the top-k chunks, stuff them into the context, generate a response. This works well for simple factual questions against a clean knowledge base. It fails noticeably on questions that require reasoning across multiple documents, involve ambiguous terminology, or need information that's spread across the knowledge base in non-obvious ways.

Agentic RAG is what happens when you give the retrieval process a brain. Instead of a single fixed retrieval step, the agent decides what to retrieve, evaluates what it found, retrieves again if needed, and reasons across multiple retrieved documents before producing an answer.

Here's what that actually looks like in practice, and when it's worth the additional complexity.

Where basic RAG fails

Before getting into agentic patterns, it's worth being precise about the failure modes basic RAG has.

Single-hop limitation. Basic RAG does one retrieval and generates from the result. But many real questions require chaining. "What was the company's revenue growth in Q3 2024, and how did that compare to the guidance they'd given in Q2?" This requires finding the Q3 report, finding the Q2 guidance statement, comparing them, and synthesizing. A single top-k retrieval over a mixed document corpus likely surfaces one or the other, not both.

Query-document vocabulary mismatch. Embedding models map semantic meaning, not keywords. But sometimes the user's phrasing and the document's phrasing don't overlap semantically even when they're asking about the same thing. "How do I cancel my subscription?" might not retrieve a document titled "Terminating your account," depending on how the embedding model handles those terms. Basic RAG with no query transformation will miss this.

Lost information in dense chunks. Chunking strategies for vector search divide documents into pieces of 200-800 tokens. The answer to a question sometimes requires understanding how information at the start of a document connects to information at the end. Fixed-size chunks lose this long-range document structure.

No quality verification. Basic RAG retrieves and uses. There's no step that checks whether the retrieved documents actually contain relevant information, or whether the retrieved content is sufficient to answer the question. The model generates based on whatever was retrieved, even if the retrieved chunks are tangentially related.

Query rewriting

The first agentic improvement to add is query rewriting: before executing the retrieval, have the model reformulate the query to improve retrieval quality.

Query rewriting serves two purposes. First, it can generate multiple phrasings of the same question, catching vocabulary mismatches. "Cancel my subscription" might be rewritten as "terminate account," "stop billing," "end membership." Run all three as separate queries, merge the results.

Second, it can decompose complex questions into sub-queries. "Compare the pricing of Plan A and Plan B and tell me which is better for a team of 10" breaks into: "pricing for Plan A," "pricing for Plan B," and the comparison happens after both are retrieved.

Implementation is cheap: a single small model call (Haiku or GPT-4o mini) that takes the original query and returns 2-5 reformulations plus a decomposition if the query is multi-part. Add this step before vector search, run parallel searches for each reformulation, and deduplicate the results. This alone fixes a large fraction of basic RAG failure cases.

Multi-hop retrieval

Multi-hop retrieval is the pattern for questions that require following a chain of references across documents.

The agent retrieves an initial document, identifies references or implicit connections to other documents, and retrieves those in turn. This continues until the agent has enough information to answer the question or determines it can't.

A concrete example: a compliance question like "Is our data handling consistent with GDPR Article 17 requirements?" This might require:

Retrieving the relevant section of GDPR Article 17.
Retrieving the company's data handling policy.
Retrieving any exceptions or exemptions the company has documented.
Potentially retrieving court decisions or regulatory guidance on Article 17 interpretation.

Each retrieval step informs the next. The agent doesn't know it needs the regulatory guidance until it's read the policy and identified a potentially ambiguous provision.

Naive multi-hop retrieval can get expensive quickly, because each hop is a retrieval operation plus a model call. Two practical constraints: a maximum hop count (typically 3-5) and a check at each step to confirm the retrieval is making progress toward answering the question rather than wandering into tangents. Without these, some queries trigger retrieval loops that exhaust context and budget without producing an answer.

Re-ranking retrieved results

Vector similarity search returns the top-k documents by embedding distance. Embedding distance is a proxy for semantic relevance, but it's an imperfect proxy. The 3rd most similar chunk by embedding might be more relevant to the specific question than the most similar one.

Re-ranking adds a second relevance judgment after vector search. The re-ranker receives the query and the top-k retrieved chunks and produces a new relevance score for each one, typically using a cross-encoder model that attends to both the query and the document together (rather than comparing embeddings separately).

Common re-ranking options:

Cohere Rerank (API, $1 per 1,000 re-rankings): Production-ready, low-latency, integrates well with most vector stores.
bge-reranker-v2-m3 (open-source, self-hosted): State-of-the-art cross-encoder that matches commercial re-rankers on quality. Runs well on CPU for moderate volumes.
LLM-based re-ranking: Ask a small model to score each retrieved chunk's relevance to the query on a 1-5 scale. Slower and more expensive than dedicated re-rankers but doesn't require additional infrastructure.

Re-ranking typically improves retrieval precision significantly for complex queries. The typical result: after re-ranking, you keep 3-5 documents instead of 10-20, and those 3-5 are more consistently relevant. This reduces context window usage and improves generation quality.

Self-RAG: agents that check their own retrieval

Self-RAG is a pattern where the agent evaluates the quality of what it retrieved and decides whether to retrieve again, reformulate, or generate from the current results.

The evaluation step is a small model call (or a few tokens from the main model) that answers: "Given this query and these retrieved documents, am I ready to generate a reliable answer?" If the answer is no, the agent can:

Retrieve with a different query formulation.
Try a different data source.
Retrieve at a different granularity (whole document instead of chunks, or more specific chunks).
Escalate to a human if it can't find reliable information.

Self-RAG reduces hallucination substantially on factual question-answering tasks. A system that confidently answers from poor retrieved context is more dangerous than one that says "I couldn't find reliable information about that." The self-evaluation step makes the boundary explicit.

Implementation: a simple router prompt before generation. "Here is the question: [Q]. Here are the retrieved documents: [D]. On a scale of 1-5, how confident are you that these documents contain enough information to answer the question accurately? Respond with just the number." If below 3, re-retrieve or acknowledge the gap.

Citations: making retrieval outputs verifiable

Citations are what separate RAG that users can trust from RAG that just sounds authoritative. A response that includes "Based on Section 3.2 of the Employee Handbook (retrieved document ID #47)..." is auditable. A response with no source attribution isn't.

Implementing citations in agentic RAG:

Tag each retrieved chunk with a document identifier, section name, and page number during retrieval.
Pass the tags to the model along with the content.
Instruct the model to include inline citations using the provided identifiers.
Post-process the response to extract citations and render them as links or footnotes.

The challenge: models sometimes cite documents that don't contain the claimed information, especially when the relevant context is spread across multiple chunks. A post-generation citation verification step, where you check each cited claim against the cited document, catches this. It adds latency but is worth it for high-stakes applications (legal, medical, compliance).

LlamaIndex and LangChain both have citation tracking built into their RAG pipelines, which is a practical reason to use them rather than rolling your own if you're building citation-enabled RAG.

When agentic RAG is worth the complexity

Agentic RAG adds cost and latency compared to basic RAG. The added infrastructure is justified when:

Questions frequently require multiple documents to answer (multi-hop).
The knowledge base uses inconsistent terminology (query rewriting helps).
Users expect auditable, source-cited responses.
False positives (confidently wrong answers) have significant consequences.
The knowledge base is large enough that a single top-k retrieval misses relevant content.

Basic RAG is the right choice when:

Questions are factual and answerable from a single document.
The knowledge base is small and clean.
Latency is tight (each agentic step adds 200-800ms).
The cost of additional model calls matters at your volume.

The typical upgrade path: start with basic RAG to validate the use case, then add query rewriting (cheap and high-impact), then re-ranking, then multi-hop as needed. Don't implement full agentic RAG from day one unless you already know you need it. The complexity is real and the debugging is harder than basic RAG.

Most production RAG systems in 2026 use at least query rewriting and re-ranking. Full multi-hop with self-evaluation is less common but increasingly available through frameworks like LlamaIndex's agentic pipeline and LangGraph's retrieval graph patterns.