Skip to main content

Command Palette

Search for a command to run...

Intro to RAG (Retrieval Augmented Generation) and Where it fails

Introduction to Retrieval-Augmented Generation (RAG)

Updated
6 min read
Intro to RAG (Retrieval Augmented Generation) and Where it fails

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines information retrieval with large language models (LLMs) to produce answers grounded in external knowledge. It reduces hallucinations, enables up-to-date responses without retraining, and allows domain-specific adaptation by injecting relevant documents into the model’s context.

What RAG Is

RAG is a two-stage pipeline:

  • A retriever selects relevant information from a knowledge source (e.g., files, webpages, databases).

  • A generator (LLM) conditions on the retrieved passages to produce a final answer.

By augmenting the LLM with retrieval, RAG turns the model from a closed-book system into an open-book system.

Why RAG Is Used

  • Grounding: Reduces hallucinations by citing or conditioning on real documents.

  • Freshness: Answers can reflect the latest data without retraining the base model.

  • Domain Adaptation: Leverages proprietary or niche corpora that the base model didn’t see in pretraining.

  • Cost and Speed: Cheaper and faster than fine-tuning for many knowledge update scenarios.

  • Explainability: Retrieved passages can be shown to users for transparency and trust.

How RAG Works (Retriever + Generator)

  1. Query understanding
  • The user query is optionally reformulated (e.g., with query expansion or rewriting).
  1. Retrieval
  • The retriever searches a knowledge index (semantic/vector and/or lexical) and returns top-k passages.
  1. Generation
  • The LLM reads the query plus retrieved passages and produces an answer, often with citations.
  1. Optional feedback
  • Re-ranking, iterative retrieval, or self-reflection improve the next round.

Simple example:

  • Question: “What are the health benefits of green tea?”

  • Retriever: Finds passages from trusted sources describing catechins, antioxidants, and potential cardiovascular benefits.

  • Generator: Summarizes: “Green tea contains catechins that may support heart health and reduce oxidative stress; evidence suggests modest benefits when consumed regularly.” It may include citations to the retrieved passages.

What Is Indexing

Indexing is the process of preparing a knowledge base for fast retrieval. It typically includes:

  • Parsing content (PDFs, HTML, docs)

  • Chunking into manageable passages

  • Vectorization (embedding each chunk into a numerical representation)

  • Building a search structure (e.g., FAISS, HNSW) for efficient nearest-neighbor queries

  • Storing metadata (source, titles, timestamps, permissions) for filtering and display

Why We Perform Vectorization

Vectorization converts text into dense embeddings that capture semantic meaning. Benefits:

  • Semantic search: Finds conceptually similar content even when wording differs from the query.

  • Robustness: Handles synonyms, paraphrases, and multilingual content better than pure keyword search.

  • Ranking: Enables similarity scoring and hybrid approaches (combine vector and keyword signals).

Why RAGs Exist

LLMs alone have limits:

  • Static knowledge: Models “freeze” at training cutoff dates.

  • Hallucinations: Fluent but incorrect answers without grounding.

  • Proprietary data: Sensitive or private corpora aren’t in public training sets.
    RAG solves these by integrating dynamic, organization-specific, and verified sources at inference time.

Why We Perform Chunking

Chunking splits long documents into smaller passages that fit model context and improve retrieval precision. Reasons:

  • Context window limits: LLMs can only read a limited number of tokens.

  • Targeted retrieval: Smaller, focused chunks increase the chance the right passage is retrieved.

  • Reduced noise: Avoids stuffing entire documents where only a section is relevant.

Why Overlapping Is Used in Chunking

Overlapping adds a small shared region between consecutive chunks (e.g., 20–30% overlap):

  • Preserves continuity: Important sentences that straddle boundaries still appear intact in at least one chunk.

  • Improves recall: Increases the likelihood the relevant context is captured and retrieved.

  • Reduces boundary errors: Minimizes missing definitions, references, or context carried across paragraphs.


Where RAG Fails (and Quick Mitigations)

Even well-designed RAG systems can fail. Below are common failure modes, symptoms, and lightweight mitigations that practitioners often discuss and implement.

1) Poor Recall (Relevant Documents Not Retrieved)

Symptoms:

  • The answer ignores key facts known to exist in the corpus.

  • Retrieved passages are off-topic or too generic.

Root causes:

  • Embedding model mismatch with domain language.

  • Suboptimal chunk sizes or overlap.

  • Sparse-only or dense-only retrieval missing signals.

  • Low top-k or aggressive filters reducing candidate coverage.

Quick mitigations:

  • Use hybrid retrieval (dense + lexical/keyword) to balance semantics and exact matches.

  • Increase top-k and apply a cross-encoder re-ranker for precision after broad recall.

  • Tune chunk size (e.g., 200–500 tokens) and overlap (e.g., 15–25%) for the corpus.

  • Adopt domain-tuned embeddings or multilingual embeddings for mixed-language corpora.

  • Add query expansion or rewriting (synonyms, acronyms, entity linking).

2) Bad Chunking (Context Lost or Fragmented)

Symptoms:

  • The model answers incompletely or misses definitions/examples.

  • Passages lack self-contained meaning; key tables or figures split awkwardly.

Root causes:

  • Fixed-size chunking that ignores structure (headings, sections, bullets, tables).

  • No or too little overlap causing boundary losses.

  • Overly large chunks with filler content reduce retrieval precision.

Quick mitigations:

  • Use structure-aware chunking: split by headings, sections, semantic breaks; keep tables whole.

  • Calibrate chunk size and overlap empirically per corpus.

  • Create “summary headers” or “topic descriptors” stored as metadata to aid retrieval.

  • For code/docs, keep function/class blocks intact; for PDFs, correct layout extraction first.

3) Query Drift (Retriever Fetches Irrelevant but Semantically Similar Content)

Symptoms:

  • Retrieved passages share vocabulary but don’t answer the user’s intent.

  • Model produces generic or tangential answers.

Root causes:

  • Overly broad query embeddings.

  • Ambiguous user questions without constraints.

  • No re-ranking or shallow signal fusion.

Quick mitigations:

  • Add instruction-tuned query rewriting to clarify entities, time ranges, and constraints.

  • Introduce filters on metadata (time, source, author, product, geography).

  • Use multi-stage retrieval: broad recall → cross-encoder re-rank → LLM check (e.g., “Does this passage answer the question?”).

  • Prompt the LLM to identify missing specifics and run a second retrieval pass.

4) Outdated Indexes (Stale or Missing New Content)

Symptoms:

  • Answers contradict recent policy changes, prices, or versions.

  • Newly added documents aren’t reflected in results.

Root causes:

  • Infrequent indexing pipelines.

  • Missing change detection or timestamp-based filtering.

Quick mitigations:

  • Automate incremental indexing with webhooks or scheduled jobs.

  • Store and filter by timestamps; prefer recent docs when freshness matters.

  • Maintain versioned corpora and retire deprecated content.

  • For time-sensitive queries, include “last updated” disclaimers or prompt the LLM to prefer recent sources.

5) Hallucinations from Weak Context

Symptoms:

  • Confident but incorrect statements not supported by retrieved passages.

  • Fabricated citations or invented numbers.

Root causes:

  • Low-quality or irrelevant retrieval.

  • Overly brief context windows or noisy passages.

  • Prompts that reward fluency over faithfulness.

Quick mitigations:

  • Retrieval-first prompting: “Answer only using the provided context. If insufficient, say so.”

  • Use constrained generation: citation-required prompts, quote-then-summarize patterns.

  • Calibrate temperature and length; prefer deterministic decoding for factual tasks.

  • Add answer verification: have the model justify each claim with passage spans; reject if unsupported.

  • Consider smaller, instruction-tuned models for extractive tasks or use extractive QA before generative summarization.

Additional Failure Modes and Fixes

  • Duplicated or Near-Duplicate Chunks

    • Issue: Reduces diversity in top-k results.

    • Fix: Deduplicate by hash/similarity during indexing; penalize near-duplicates during re-ranking.

  • Security and Permissions Leaks

    • Issue: Retrieval crosses tenant or ACL boundaries.

    • Fix: Enforce per-document ACLs at query time; propagate identity/entitlements to the retriever.

  • Multimodal Mismatch (Tables, Code, Images)

    • Issue: Text-only embeddings miss key information in tables/diagrams or code semantics.

    • Fix: Use modality-aware parsing (table extractors, code-aware chunking) and specialized embeddings.

  • Tool/Agentic Loops

    • Issue: Iterative retrieval/generation loops drift off-topic and waste tokens.

    • Fix: Add step limits and intermediate grounding checks; summarize and refocus between hops.