Retrieval-Augmented Generation (RAG) is an architectural pattern that combines information retrieval with large language models (LLMs) to produce answers grounded in external knowledge. It reduces hallucinations, enables up-to-date responses without retraining, and allows domain-specific adaptation by injecting relevant documents into the model’s context.

What RAG Is

RAG is a two-stage pipeline:

A retriever selects relevant information from a knowledge source (e.g., files, webpages, databases).
A generator (LLM) conditions on the retrieved passages to produce a final answer.

By augmenting the LLM with retrieval, RAG turns the model from a closed-book system into an open-book system.

Why RAG Is Used

Grounding: Reduces hallucinations by citing or conditioning on real documents.
Freshness: Answers can reflect the latest data without retraining the base model.
Domain Adaptation: Leverages proprietary or niche corpora that the base model didn’t see in pretraining.
Cost and Speed: Cheaper and faster than fine-tuning for many knowledge update scenarios.
Explainability: Retrieved passages can be shown to users for transparency and trust.

How RAG Works (Retriever + Generator)

Query understanding

The user query is optionally reformulated (e.g., with query expansion or rewriting).

Retrieval

The retriever searches a knowledge index (semantic/vector and/or lexical) and returns top-k passages.

Generation

The LLM reads the query plus retrieved passages and produces an answer, often with citations.

Optional feedback

Re-ranking, iterative retrieval, or self-reflection improve the next round.

Simple example:

Question: “What are the health benefits of green tea?”
Retriever: Finds passages from trusted sources describing catechins, antioxidants, and potential cardiovascular benefits.
Generator: Summarizes: “Green tea contains catechins that may support heart health and reduce oxidative stress; evidence suggests modest benefits when consumed regularly.” It may include citations to the retrieved passages.

What Is Indexing

Indexing is the process of preparing a knowledge base for fast retrieval. It typically includes:

Parsing content (PDFs, HTML, docs)
Chunking into manageable passages
Vectorization (embedding each chunk into a numerical representation)
Building a search structure (e.g., FAISS, HNSW) for efficient nearest-neighbor queries
Storing metadata (source, titles, timestamps, permissions) for filtering and display

Why We Perform Vectorization

Vectorization converts text into dense embeddings that capture semantic meaning. Benefits:

Semantic search: Finds conceptually similar content even when wording differs from the query.
Robustness: Handles synonyms, paraphrases, and multilingual content better than pure keyword search.
Ranking: Enables similarity scoring and hybrid approaches (combine vector and keyword signals).

Why RAGs Exist

LLMs alone have limits:

Static knowledge: Models “freeze” at training cutoff dates.
Hallucinations: Fluent but incorrect answers without grounding.
Proprietary data: Sensitive or private corpora aren’t in public training sets.
RAG solves these by integrating dynamic, organization-specific, and verified sources at inference time.

Why We Perform Chunking

Chunking splits long documents into smaller passages that fit model context and improve retrieval precision. Reasons:

Context window limits: LLMs can only read a limited number of tokens.
Targeted retrieval: Smaller, focused chunks increase the chance the right passage is retrieved.
Reduced noise: Avoids stuffing entire documents where only a section is relevant.

Why Overlapping Is Used in Chunking

Overlapping adds a small shared region between consecutive chunks (e.g., 20–30% overlap):

Preserves continuity: Important sentences that straddle boundaries still appear intact in at least one chunk.
Improves recall: Increases the likelihood the relevant context is captured and retrieved.
Reduces boundary errors: Minimizes missing definitions, references, or context carried across paragraphs.

Where RAG Fails (and Quick Mitigations)

Even well-designed RAG systems can fail. Below are common failure modes, symptoms, and lightweight mitigations that practitioners often discuss and implement.

1) Poor Recall (Relevant Documents Not Retrieved)

Symptoms:

The answer ignores key facts known to exist in the corpus.
Retrieved passages are off-topic or too generic.

Root causes:

Embedding model mismatch with domain language.
Suboptimal chunk sizes or overlap.
Sparse-only or dense-only retrieval missing signals.
Low top-k or aggressive filters reducing candidate coverage.

Quick mitigations:

Use hybrid retrieval (dense + lexical/keyword) to balance semantics and exact matches.
Increase top-k and apply a cross-encoder re-ranker for precision after broad recall.
Tune chunk size (e.g., 200–500 tokens) and overlap (e.g., 15–25%) for the corpus.
Adopt domain-tuned embeddings or multilingual embeddings for mixed-language corpora.
Add query expansion or rewriting (synonyms, acronyms, entity linking).

2) Bad Chunking (Context Lost or Fragmented)

Symptoms:

The model answers incompletely or misses definitions/examples.
Passages lack self-contained meaning; key tables or figures split awkwardly.

Root causes:

Fixed-size chunking that ignores structure (headings, sections, bullets, tables).
No or too little overlap causing boundary losses.
Overly large chunks with filler content reduce retrieval precision.

Quick mitigations:

Use structure-aware chunking: split by headings, sections, semantic breaks; keep tables whole.
Calibrate chunk size and overlap empirically per corpus.
Create “summary headers” or “topic descriptors” stored as metadata to aid retrieval.
For code/docs, keep function/class blocks intact; for PDFs, correct layout extraction first.

3) Query Drift (Retriever Fetches Irrelevant but Semantically Similar Content)

Symptoms:

Retrieved passages share vocabulary but don’t answer the user’s intent.
Model produces generic or tangential answers.

Root causes:

Overly broad query embeddings.
Ambiguous user questions without constraints.
No re-ranking or shallow signal fusion.

Quick mitigations:

Add instruction-tuned query rewriting to clarify entities, time ranges, and constraints.
Introduce filters on metadata (time, source, author, product, geography).
Use multi-stage retrieval: broad recall → cross-encoder re-rank → LLM check (e.g., “Does this passage answer the question?”).
Prompt the LLM to identify missing specifics and run a second retrieval pass.

4) Outdated Indexes (Stale or Missing New Content)

Symptoms:

Answers contradict recent policy changes, prices, or versions.
Newly added documents aren’t reflected in results.

Root causes:

Infrequent indexing pipelines.
Missing change detection or timestamp-based filtering.

Quick mitigations:

Automate incremental indexing with webhooks or scheduled jobs.
Store and filter by timestamps; prefer recent docs when freshness matters.
Maintain versioned corpora and retire deprecated content.
For time-sensitive queries, include “last updated” disclaimers or prompt the LLM to prefer recent sources.

5) Hallucinations from Weak Context

Symptoms:

Confident but incorrect statements not supported by retrieved passages.
Fabricated citations or invented numbers.

Root causes:

Low-quality or irrelevant retrieval.
Overly brief context windows or noisy passages.
Prompts that reward fluency over faithfulness.

Quick mitigations:

Retrieval-first prompting: “Answer only using the provided context. If insufficient, say so.”
Use constrained generation: citation-required prompts, quote-then-summarize patterns.
Calibrate temperature and length; prefer deterministic decoding for factual tasks.
Add answer verification: have the model justify each claim with passage spans; reject if unsupported.
Consider smaller, instruction-tuned models for extractive tasks or use extractive QA before generative summarization.

Additional Failure Modes and Fixes

Duplicated or Near-Duplicate Chunks
- Issue: Reduces diversity in top-k results.
- Fix: Deduplicate by hash/similarity during indexing; penalize near-duplicates during re-ranking.
Security and Permissions Leaks
- Issue: Retrieval crosses tenant or ACL boundaries.
- Fix: Enforce per-document ACLs at query time; propagate identity/entitlements to the retriever.
Multimodal Mismatch (Tables, Code, Images)
- Issue: Text-only embeddings miss key information in tables/diagrams or code semantics.
- Fix: Use modality-aware parsing (table extractors, code-aware chunking) and specialized embeddings.
Tool/Agentic Loops
- Issue: Iterative retrieval/generation loops drift off-topic and waste tokens.
- Fix: Add step limits and intermediate grounding checks; summarize and refocus between hops.

Intro to RAG (Retrieval Augmented Generation) and Where it fails

What RAG Is

Why RAG Is Used

How RAG Works (Retriever + Generator)

What Is Indexing

Why We Perform Vectorization

Why RAGs Exist

Why We Perform Chunking

Why Overlapping Is Used in Chunking

Where RAG Fails (and Quick Mitigations)

1) Poor Recall (Relevant Documents Not Retrieved)

2) Bad Chunking (Context Lost or Fragmented)

3) Query Drift (Retriever Fetches Irrelevant but Semantically Similar Content)

4) Outdated Indexes (Stale or Missing New Content)

5) Hallucinations from Weak Context

Additional Failure Modes and Fixes

Comments

More from this blog

Advanced RAG Patterns, Pipelines, and System Design

Why Computers Chat Like You: The Story of GPT

From Text to Tokens: How AI Unlocks the Secret Structure of Language

When Words Become Numbers: The Secret Language of Artificial Intelligence

Command Palette

What RAG Is

Why RAG Is Used

How RAG Works (Retriever + Generator)

What Is Indexing

Why We Perform Vectorization

Why RAGs Exist

Why We Perform Chunking

Why Overlapping Is Used in Chunking

Where RAG Fails (and Quick Mitigations)

1) Poor Recall (Relevant Documents Not Retrieved)

2) Bad Chunking (Context Lost or Fragmented)

3) Query Drift (Retriever Fetches Irrelevant but Semantically Similar Content)

4) Outdated Indexes (Stale or Missing New Content)

5) Hallucinations from Weak Context

Additional Failure Modes and Fixes

Comments

More from this blog