Thank you

Retrieval Is Not Intelligence | Dhruvil Patel

June 7, 2026

Retrieval Is Not Intelligence

Q: Why is my RAG system stuck at 80% accuracy no matter what I improve?

Audit your failures before tuning anything. In most enterprise systems, the residual failures are dominated by assembly questions - aggregations, joins, temporal diffs, absence checks - whose answers exist in no chunk. Retrieval improvements only address the minority of failures where a stated answer wasn't found.

Q: What's the difference between lookup and assembly questions?

Lookup: the answer is written somewhere; the task is finding it. Assembly: the answer must be constructed - summed, joined, diffed, or inferred - from evidence spread across documents that individually don't resemble the question. Retrieval solves lookup; assembly needs structure and computation built at ingestion time.

Q: Can better embedding models fix multi-hop and aggregation failures?

Partially for multi-hop: QA-trained embedders bridge mild question-evidence gaps. Not for aggregation, absence, or temporal questions - their gap is structural, not representational. No embedding places "total liability" near two hundred scattered clauses in a way that computes their sum.

Q: Is GraphRAG or a knowledge graph worth it?

Worth it when your failure audit shows join and traversal questions - and only as a narrow, validated graph over the entities your questions actually traverse. An aspirational everything-ontology fails; vendors-contracts-amendments-obligations with clean edges pays for itself quickly.

Q: How should a system answer "does X exist?" questions?

Never via retrieval alone - similarity search cannot verify absence. Absence answers require enumeration over a defined universe with known extraction coverage ("all 214 active contracts examined; none contain exclusivity language"), and the answer must state its scope and confidence, because "no" and "not found" are different claims.

Q: What does this essay explain about the plateau?

Every RAG system I have ever built or reviewed has the same learning curve, and nobody talks about its shape. It goes like this. The first month is euphoric - you go from nothing to answering real questions over real documents, and stakeholders see magic. The next few months are solid engineering progress: better chunking, hybrid search, a reranker, and you climb from rough to genuinely good.

Q: What additional detail does the "The Plateau" section cover?

It goes like this. The first month is euphoric - you go from nothing to answering real questions over real documents, and stakeholders see magic. The next few months are solid engineering progress: better chunking, hybrid search, a reranker, and you climb from rough to genuinely good.

Q: What does this essay explain about upgrading the wrong component?

First, the part where we wasted a quarter, so you don't have to. The system was a document intelligence platform for procurement and vendor management. Contracts, invoices, compliance certificates, correspondence - the usual enterprise sediment, tens of thousands of documents deep.

Q: What additional detail does the "Upgrading the Wrong Component" section cover?

The system was a document intelligence platform for procurement and vendor management. Contracts, invoices, compliance certificates, correspondence - the usual enterprise sediment, tens of thousands of documents deep.

Q: What does this essay explain about the night we read every failure?

The turning point wasn't clever. It was a spreadsheet. One evening, tired of tuning, two of us pulled all 190 failing questions from the eval set and read them. Not the retrieval scores. The *questions* - and what a correct answer would actually require.

Your RAG system plateaued because similarity search answers lookup questions, and the questions worth money are assembly questions. A failure taxonomy.

No content blocks yet.

Key Takeaways

RAG systems plateau in the low eighties because the residual failures are mostly assembly questions - aggregation, joins, temporal diffs, absence - whose answers exist in no retrievable chunk.
Retrieval is a resemblance detector, and assembly questions have a question-evidence gap: their evidence doesn't resemble them, so retrieval fails quietly by returning question-shaped material instead of answer-bearing material, which the model fluently synthesizes into misleading partial answers.
The question ladder (lookup → multi-hop → aggregation → temporal → absence → synthesis) locates each failure and the capability it demands.
Breaking the plateau means answering at write time: extract structure at ingestion - knowledge graph edges for joins, derived tables for aggregates, versioned states for diffs - and demote retrieval to an evidence-citation tool inside an agentic loop for the long tail.
Absence deserves special care: it's a closed-world claim requiring enumeration and explicit epistemics, not a retrieval result.

FAQ

Why is my RAG system stuck at 80% accuracy no matter what I improve?

What's the difference between lookup and assembly questions?

Can better embedding models fix multi-hop and aggregation failures?

Is GraphRAG or a knowledge graph worth it?

How should a system answer "does X exist?" questions?

What does this essay explain about the plateau?

What additional detail does the "The Plateau" section cover?

What does this essay explain about upgrading the wrong component?

What additional detail does the "Upgrading the Wrong Component" section cover?

What does this essay explain about the night we read every failure?

What additional detail does the "The Night We Read Every Failure" section cover?

What does this essay explain about lookup questions and assembly questions?

What additional detail does the "Lookup Questions and Assembly Questions" section cover?

What key claim does the essay make about lookup questions and assembly questions?

What does this essay explain about the resemblance trap?

What additional detail does the "The Resemblance Trap" section cover?

What does this essay explain about the question ladder?

What additional detail does the "The Question Ladder" section cover?

What does this essay explain about negative knowledge?

What additional detail does the "Negative Knowledge" section cover?

How should readers understand: Negative knowledge is knowing that something is absent: no amendment exists, no exclusivity was granted, no certificate covers this period?

What Actually Works Above the Plateau?

What additional detail does the "What Actually Works Above the Plateau" section cover?

What does "A knowledge graph for the joins (rungs 1-2)" mean in this essay?

What does "Derived tables for the aggregates (rung 2)" mean in this essay?

What does "Versioned snapshots for the temporal questions (rung 3)" mean in this essay?

What does "An agentic loop for the long tail (rungs 1-5 residual)" mean in this essay?

What are the limits of the argument presented in this essay?

What additional detail does the "Where This Argument Breaks" section cover?

What does "Agentic retrieval genuinely blurs the line" mean in this essay?

What does "Better embeddings keep eating the bottom rungs" mean in this essay?

What does "Long context is a partial substitute" mean in this essay?

What should readers take away about rag systems plateau in the low eighties because the residual failures are mostly assembly questions - aggregation?

What should readers take away about retrieval is a resemblance detector?

What should readers take away about the question ladder (lookup → multi-hop → aggregation → temporal → absence → synthesis) locates each failure and the capability it demands?

What should readers take away about breaking the plateau means answering at write time: extract structure at ingestion - knowledge graph edges for joins?

What should readers take away about absence deserves special care: it's a closed-world claim requiring enumeration and explicit epistemics?

What should readers take away about the general principle: move intelligence from read time to write time?

What is Retrieval Is Not Intelligence about?

Which series is Retrieval Is Not Intelligence part of?

How does Retrieval Is Not Intelligence relate to retrieval augmented generation?

How does Retrieval Is Not Intelligence relate to RAG accuracy plateau?

How does Retrieval Is Not Intelligence relate to multi-hop question answering?

How does Retrieval Is Not Intelligence relate to knowledge graph RAG?

How does Retrieval Is Not Intelligence relate to agentic retrieval?

How does Retrieval Is Not Intelligence relate to embedding similarity search?

How does Retrieval Is Not Intelligence relate to enterprise RAG architecture?

How does Retrieval Is Not Intelligence relate to NL2SQL?

Who should read Retrieval Is Not Intelligence?

What else does the "The Plateau" section argue?

What is this writing piece about?

What are the key takeaways from Retrieval Is Not Intelligence?

How does Retrieval Is Not Intelligence relate to Dhruvil Patel's work?

Key Takeaways

FAQ

Related Expertise

Related Concepts

Related Projects

Related Research

Related Articles

Related Automations