Intro to RAG (Retrieval Augmented Generation) and Where it fails
Introduction to Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines information retrieval with large language models (LLMs) to produce answers grounded in external knowledge. It reduces hallucinations, enables up-to-date responses without retraining, and allows domain-specific adaptation by injecting relevant documents into the model’s context.
What RAG Is
RAG is a two-stage pipeline:
A retriever selects relevant information from a knowledge source (e.g., files, webpages, databases).
A generator (LLM) conditions on the retrieved passages to produce a final answer.
By augmenting the LLM with retrieval, RAG turns the model from a closed-book system into an open-book system.
Why RAG Is Used
Grounding: Reduces hallucinations by citing or conditioning on real documents.
Freshness: Answers can reflect the latest data without retraining the base model.
Domain Adaptation: Leverages proprietary or niche corpora that the base model didn’t see in pretraining.
Cost and Speed: Cheaper and faster than fine-tuning for many knowledge update scenarios.
Explainability: Retrieved passages can be shown to users for transparency and trust.
How RAG Works (Retriever + Generator)
- Query understanding
- The user query is optionally reformulated (e.g., with query expansion or rewriting).
- Retrieval
- The retriever searches a knowledge index (semantic/vector and/or lexical) and returns top-k passages.
- Generation
- The LLM reads the query plus retrieved passages and produces an answer, often with citations.
- Optional feedback
- Re-ranking, iterative retrieval, or self-reflection improve the next round.
Simple example:
Question: “What are the health benefits of green tea?”
Retriever: Finds passages from trusted sources describing catechins, antioxidants, and potential cardiovascular benefits.
Generator: Summarizes: “Green tea contains catechins that may support heart health and reduce oxidative stress; evidence suggests modest benefits when consumed regularly.” It may include citations to the retrieved passages.
What Is Indexing
Indexing is the process of preparing a knowledge base for fast retrieval. It typically includes:
Parsing content (PDFs, HTML, docs)
Chunking into manageable passages
Vectorization (embedding each chunk into a numerical representation)
Building a search structure (e.g., FAISS, HNSW) for efficient nearest-neighbor queries
Storing metadata (source, titles, timestamps, permissions) for filtering and display

Why We Perform Vectorization
Vectorization converts text into dense embeddings that capture semantic meaning. Benefits:
Semantic search: Finds conceptually similar content even when wording differs from the query.
Robustness: Handles synonyms, paraphrases, and multilingual content better than pure keyword search.
Ranking: Enables similarity scoring and hybrid approaches (combine vector and keyword signals).
Why RAGs Exist
LLMs alone have limits:
Static knowledge: Models “freeze” at training cutoff dates.
Hallucinations: Fluent but incorrect answers without grounding.
Proprietary data: Sensitive or private corpora aren’t in public training sets.
RAG solves these by integrating dynamic, organization-specific, and verified sources at inference time.
Why We Perform Chunking
Chunking splits long documents into smaller passages that fit model context and improve retrieval precision. Reasons:
Context window limits: LLMs can only read a limited number of tokens.
Targeted retrieval: Smaller, focused chunks increase the chance the right passage is retrieved.
Reduced noise: Avoids stuffing entire documents where only a section is relevant.
Why Overlapping Is Used in Chunking
Overlapping adds a small shared region between consecutive chunks (e.g., 20–30% overlap):
Preserves continuity: Important sentences that straddle boundaries still appear intact in at least one chunk.
Improves recall: Increases the likelihood the relevant context is captured and retrieved.
Reduces boundary errors: Minimizes missing definitions, references, or context carried across paragraphs.
Where RAG Fails (and Quick Mitigations)
Even well-designed RAG systems can fail. Below are common failure modes, symptoms, and lightweight mitigations that practitioners often discuss and implement.
1) Poor Recall (Relevant Documents Not Retrieved)
Symptoms:
The answer ignores key facts known to exist in the corpus.
Retrieved passages are off-topic or too generic.
Root causes:
Embedding model mismatch with domain language.
Suboptimal chunk sizes or overlap.
Sparse-only or dense-only retrieval missing signals.
Low top-k or aggressive filters reducing candidate coverage.
Quick mitigations:
Use hybrid retrieval (dense + lexical/keyword) to balance semantics and exact matches.
Increase top-k and apply a cross-encoder re-ranker for precision after broad recall.
Tune chunk size (e.g., 200–500 tokens) and overlap (e.g., 15–25%) for the corpus.
Adopt domain-tuned embeddings or multilingual embeddings for mixed-language corpora.
Add query expansion or rewriting (synonyms, acronyms, entity linking).
2) Bad Chunking (Context Lost or Fragmented)
Symptoms:
The model answers incompletely or misses definitions/examples.
Passages lack self-contained meaning; key tables or figures split awkwardly.
Root causes:
Fixed-size chunking that ignores structure (headings, sections, bullets, tables).
No or too little overlap causing boundary losses.
Overly large chunks with filler content reduce retrieval precision.
Quick mitigations:
Use structure-aware chunking: split by headings, sections, semantic breaks; keep tables whole.
Calibrate chunk size and overlap empirically per corpus.
Create “summary headers” or “topic descriptors” stored as metadata to aid retrieval.
For code/docs, keep function/class blocks intact; for PDFs, correct layout extraction first.
3) Query Drift (Retriever Fetches Irrelevant but Semantically Similar Content)
Symptoms:
Retrieved passages share vocabulary but don’t answer the user’s intent.
Model produces generic or tangential answers.
Root causes:
Overly broad query embeddings.
Ambiguous user questions without constraints.
No re-ranking or shallow signal fusion.
Quick mitigations:
Add instruction-tuned query rewriting to clarify entities, time ranges, and constraints.
Introduce filters on metadata (time, source, author, product, geography).
Use multi-stage retrieval: broad recall → cross-encoder re-rank → LLM check (e.g., “Does this passage answer the question?”).
Prompt the LLM to identify missing specifics and run a second retrieval pass.
4) Outdated Indexes (Stale or Missing New Content)
Symptoms:
Answers contradict recent policy changes, prices, or versions.
Newly added documents aren’t reflected in results.
Root causes:
Infrequent indexing pipelines.
Missing change detection or timestamp-based filtering.
Quick mitigations:
Automate incremental indexing with webhooks or scheduled jobs.
Store and filter by timestamps; prefer recent docs when freshness matters.
Maintain versioned corpora and retire deprecated content.
For time-sensitive queries, include “last updated” disclaimers or prompt the LLM to prefer recent sources.
5) Hallucinations from Weak Context
Symptoms:
Confident but incorrect statements not supported by retrieved passages.
Fabricated citations or invented numbers.
Root causes:
Low-quality or irrelevant retrieval.
Overly brief context windows or noisy passages.
Prompts that reward fluency over faithfulness.
Quick mitigations:
Retrieval-first prompting: “Answer only using the provided context. If insufficient, say so.”
Use constrained generation: citation-required prompts, quote-then-summarize patterns.
Calibrate temperature and length; prefer deterministic decoding for factual tasks.
Add answer verification: have the model justify each claim with passage spans; reject if unsupported.
Consider smaller, instruction-tuned models for extractive tasks or use extractive QA before generative summarization.

Additional Failure Modes and Fixes
Duplicated or Near-Duplicate Chunks
Issue: Reduces diversity in top-k results.
Fix: Deduplicate by hash/similarity during indexing; penalize near-duplicates during re-ranking.
Security and Permissions Leaks
Issue: Retrieval crosses tenant or ACL boundaries.
Fix: Enforce per-document ACLs at query time; propagate identity/entitlements to the retriever.
Multimodal Mismatch (Tables, Code, Images)
Issue: Text-only embeddings miss key information in tables/diagrams or code semantics.
Fix: Use modality-aware parsing (table extractors, code-aware chunking) and specialized embeddings.
Tool/Agentic Loops
Issue: Iterative retrieval/generation loops drift off-topic and waste tokens.
Fix: Add step limits and intermediate grounding checks; summarize and refocus between hops.



