Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Give a language model an open book.
A language model keeps its facts locked inside its weights, where you cannot see them or edit them. RAG bolts on a second memory it can actually read: a searchable index of Wikipedia, consulted fresh for every question.
Explaining the paperRetrieval-Augmented Generation for Knowledge-Intensive NLP TasksWhat if a model didn't have to memorize every fact, and could look things up instead?
Ask a large language model who won some minor election last year and it will often answer with total confidence and a completely invented name. The facts a model knows are the facts that happened to get pressed into its weights while it trained. That store has a name worth keeping: parametric memory, knowledge held in the parameters. It is genuinely impressive, and it is also a sealed box. You cannot look inside to see which fact the model used. You cannot add a fact, or correct one, without retraining. And when the model does not know something, it rarely admits it. It guesses, which is the polite term for a hallucination.
The fix is the oldest one there is, which is to let the model use a book. Keep a separate pile of text, a searchable collection of documents, sitting outside the weights. Call that non-parametric memory: knowledge held in data you can read and edit. Before the model answers, it looks something up, and then it writes its answer with the relevant page open in front of it.
Retrieval-Augmented Generation, RAG, is the recipe that married these two memories for ordinary text generation. The parametric half is a pre-trained sequence-to-sequence model; the non-parametric half is a dense index of Wikipedia reached through a neural retriever; and the two are fine-tuned together. Two earlier systems, REALM and ORQA, had already wired a learned retriever to a masked language model, but only for extractive question answering, where the answer is a span copied straight out of a retrieved document. RAG brought the open book to a general generator, one that writes its answer rather than copying it.
We build it up in pieces: how a question finds the right pages, why the model reads several pages at once instead of betting on one, the two ways it can blend them, how the retriever learns which pages help without ever being told, and what the whole arrangement buys you in the end.
Two kinds of memory
Picture a closed-book exam. Everything you can write down has to already be in your head, and if you misremember a date, nobody can tell whether you reasoned badly or recalled wrong. A language model with only parametric memory sits that exam every time. It is a remarkable student. It has read most of the internet. But it carries the three weaknesses of any closed-book taker: it cannot look anything up, it cannot show its source, and under pressure it will confidently write down something that is not true.
Now hand the same student a library and let the exam go open-book. Three things change at once. The student can answer questions about events that happened after they studied, because the facts live on the shelf, not in their memory. The student can point at the page they used, so a grader can check the work. And when the library simply does not cover it, that is visible too, instead of being papered over with a guess. Non-parametric memory is that library: a body of text the model consults at answer time rather than swallowing in advance.
This matters most for what the paper calls knowledge-intensive tasks, the ones a person could not reasonably do without looking something up. Who held a particular office, what a specific study found, whether a claim checks out against the record. The knowledge is too specific and too changeable to expect any fixed set of weights to hold it all. The interesting question is how to fuse the two memories so the model can write fluent, general answers and stay anchored to real documents. That fusion is RAG.
Retrieve, then generate
The whole system is two learned components and one fixed library, run in sequence. First a retrieverreads the question and pulls a handful of likely-relevant passages off the shelf. Then a generatorreads the question together with those passages and writes the answer. Both are pre-trained models that already know how to do their jobs; fine-tuning teaches them to do those jobs for each other.
Two probability distributions name the two halves. The retriever is : given the question , how likely is passage to be the one worth reading, with the retriever's parameters written . The generator is : given the question and a passage, how likely is the answer , with the generator's parameters . The retrieved passage is never observed in the training data. It is a latent variable, an intermediate quantity the model introduces on its own and has to reason about, and almost everything interesting about RAG comes from how it handles that.
The pipeline, end to end. A question is encoded into a vector, a fast search over the document index lights up the top few passages, those passages and the question flow into the generator, and the generator's predictions are combined into one answer:
The generator is BART, a pre-trained encoder-decoder Transformer (a sequence-to-sequence, or seq2seq, model: read an input sequence and write an output sequence) with roughly 400 million parameters. It was pre-trained to repair deliberately corrupted text, which makes it a strong general writer. To feed it a retrieved passage, RAG does the simplest possible thing: it concatenates the passage onto the question and lets BART read both. Nothing clever. The cleverness is all in the retriever and in how the passages are combined, which is where we go next.
Finding the right pages
How do you search 21 million passages for the few that bear on a question, in a fraction of a second, when the question and the passage may not share a single word? The answer is to stop matching words and start matching meaning, by turning text into geometry.
An embedding is a vector, a list of numbers, that a model assigns to a piece of text so that text with similar meaning lands at nearby points. RAG's retriever is DPR, Dense Passage Retrieval, and it uses a bi-encoder: two separate networks, each a BERT-base model. One embeds documents, ; the other embeds the question, . The relevance of a passage is the inner product of the two vectors, large when they point the same way. Exponentiate and you have a score you can read as a probability:
Think of a vast library where books about the same thing sit on nearby shelves, in every direction at once. The question is a coordinate dropped into that space, and the passages you want are the ones standing closest to it. Finding those nearest neighbors, the passages whose embedding vectors sit closest to the question's, among 21 million points is a Maximum Inner Product Search (MIPS), and there are index structures (RAG uses FAISS with an HNSW graph) that solve it approximately in sub-linear time, so you never score all 21 million. The name is heavier than the idea: MIPS is nothing more than "find the vectors most aligned with the question," run fast enough to work at that scale. Pick a question below and watch which passages it pulls in, and how the retrieval weights shift:
The equation hides a detail. That softmax is taken over the top-K passages the search returned, never over all 21 million; normalizing across the whole index would be hopeless, and dodging it is the entire point of retrieving a short list first. The in (1) is RAG recasting DPR's raw inner product as a probability. DPR on its own ranks passages by that dot product. The document embeddings are computed once with a pre-trained DPR encoder (itself trained to find answer-bearing passages for Natural Questions and TriviaQA) and then frozen into the index, which will matter a great deal when we get to training.
Don't bet on one page
The retriever hands back several passages, not one. The obvious move is to keep the top passage and throw the rest away. RAG does not do that, and the reason is the heart of the method.
You do not know which retrieved passage holds the answer. Maybe the second one does. Maybe the answer is spread across two of them. Betting everything on the top hit throws that uncertainty away. So instead of choosing, RAG keeps all passages and lets each one vote. Each passage produces its own answer through the generator, and the votes are added up, weighted by how relevant the retriever judged each passage to be. There is no safety net for a wrong choice either, nothing in the training data says which passage was right, so committing to the top hit would gamble the whole training signal on a guess. Spreading the bet across all of them, each in proportion to its retrieval score, hedges it, and that hedge is what the operation below formalizes. In the simplest version, called RAG-Sequence, one passage is assumed responsible for the whole answer, and the probability of an answer is that weighted sum:
This operation, summing a quantity over every value a hidden variable could take, each weighted by its probability, is called marginalizing out the latent . It is the standard way to deal with something you cannot observe: account for every possibility in proportion to how likely it is. Picture a panel of advisors, each having read a different passage. You ask all of them, and you trust each one's answer in proportion to how on-topic their passage was. The right-hand expansion spells out that is itself a product of per-token probabilities, the usual way a generator scores a whole sequence.
This is not a vague safety margin, it is a principled move. Treat "which passage holds the answer" as an unknown with a probability attached, namely the retrieval weight , and the correct way to score the answer is to average the generator's answer over that unknown, weighting each passage by how likely it is to be the right one. That is Bayesian averaging, and (2) is exactly it. The top-1 passage is wrong often enough that committing to it gambles the whole answer on a single guess that came with no label confirming it; averaging over the top spends the retriever's own confidence as the hedge, so a near-miss at rank 1 is recovered by a correct passage at rank 2 or 3 instead of being lost.
The hedge is only over passages, never the full 21 million, and that cutoff is a real tradeoff. Pick too small and the right passage may never make the shortlist, so no amount of averaging can recover it, the gradient has nothing correct to reinforce. Pick too large and the sum fills with off-topic passages that each carry a small but nonzero weight, diluting the average with noise and blunting the signal that should sharpen onto the passage that helped. RAG uses or , small enough to stay cheap and clean, large enough that the answer-bearing passage is usually somewhere in the list. Slide on the toy below and watch the marginal answer probability trace out that tradeoff: flat-wrong at , a climb as the right passage enters, a clear best , then a gentle slide back down as off-topic passages dilute it:
Now the question that should be nagging. The retriever is supposed to learn to find good passages, but the training data only ever contains questions and gold answers. Nobody writes down which passage was the right one to retrieve. So what teaches the retriever?
The marginal does. Because sits inside the sum in (2), the gradient of the answer's likelihood flows back into the retriever weights, and retrieval, which feels like a discrete, un-learnable lookup, becomes a smooth, differentiable knob. That is the quiet trick that makes RAG trainable end to end. Slide the training control below and watch the weights drift from flat toward the passages that support the gold answer, and the marginal sharpen onto it:
The gradient flows in a specific, sensible direction: a passage's weight is pushed up in proportion to how much that passage raised the probability of the gold answer, relative to the others. A passage whose generator confidently produced the correct answer gets reinforced, a distractor that led nowhere gets suppressed, and the retriever, never told which passage was right, infers it from which passages helped. The figure below runs that gradient live, five documents, no retrieval labels, the scores move on their own:
One source, or one per word
RAG-Sequence assumes a single passage is behind the entire answer. That is the right assumption for a short factual reply, but it is limiting when an answer needs to weave together facts that live in different documents. So the paper offers a second model, RAG-Token, that marginalizes at a finer grain: it can draw a different passage for every token it writes. The sum over passages moves inside the product over positions:
Compare (2) and (3) and the difference is exactly where the sum sits. In RAG-Sequence the sum is outside: pick one passage, write the whole answer, then average over which passage it was. In RAG-Token the sum is inside: at each token, blend all the passages, then take the next step. RAG-Sequence is a book report from a single source. RAG-Token is an essay that can cite a different source in each sentence. That difference makes a prediction before any benchmark does: Token should win when an answer needs facts stitched from different passages, Sequence should hold its own when one passage carries the whole answer, because re-voting at every word only pays when the right source changes mid-sentence.
The paper has a lovely picture of this. Asked to write a Jeopardy clue for the answer "Hemingway," the model produces a sentence naming two of his books and the prize he won, and you can watch which passage drives each token. The passage about A Farewell to Arms spikes as that title begins; the passage about The Sun Also Rises takes over as the next one starts. Switch between the two models below and watch the document posterior shift across the sentence, or stay locked:
Notice what happens after the first word of each title. The posterior over documents flattens. Once "The Sun" is on the page, the generator can finish "Also Rises" from its own parametric memory, without leaning on any particular passage. The two memories are working together: the retrieved document supplies the specific fact, and the weights fill in the rest. That cooperation is the reason RAG-Token tends to win when an answer combines several facts, and the paper finds exactly that on Jeopardy generation.
Both models reduce to the same thing for classification. Treat the class label as a one-token answer and the sum-inside and sum-outside forms coincide, so RAG handles "is this claim true or false" with the same machinery it uses to write. In code, the only difference between the two is whether you marginalize once for the sequence or once per step:
# one RAG answer: retrieve, run the generator per doc, marginalize
q = query_encoder(x) # the query vector q(x)
docs = mips_topk(index, q, k=5) # top-K passages by inner product d·q
p_eta = softmax([dot(d.vec, q) for d in docs]) # retriever weights pη
# RAG-Sequence: commit to one document for the whole answer
def seq_prob(y):
return sum(p_eta[i] * p_theta(y, x, docs[i]) for i in range(k))
# RAG-Token: a fresh document mixture at every step
def tok_next(y_prev):
return sum(p_eta[i] * p_theta_next(x, docs[i], y_prev) for i in range(k))No answer key for retrieval
Training is the part that should sound impossible. It is routine. You have pairs of questions and gold answers, and nothing else. No labels saying which passage to fetch. The objective is to make the gold answers likely under the marginal, which means minimizing the negative log of the marginal probability summed over the training set:
That single objective trains both halves at once, with plain Adam. The generator learns to write good answers from retrieved passages, and the retriever, by the mechanism of Figures 3 and 4, learns to fetch passages that make good answers likely. The supervision for retrieval is entirely indirect, coming from the answer, not from any annotation of the evidence.
The training is affordable because RAG freezes the expensive half. Re-embedding 21 million documents every time the document encoder changed would mean rebuilding the whole search index over and over, which is the bill REALM pays during its pre-training. RAG never pays it: the document encoder and the index stay frozen, and fine-tuning touches only the query encoder and the generator. The retriever still improves, because what matters is the relative geometry, not which side moves. Pulling the question vectors toward the right documents changes the nearest neighbors exactly as well as pushing the documents toward the question would, turning the compass instead of rotating the map. The paper finds this is plenty; re-indexing the documents is not worth the cost.
# one training step: nobody labels which document is correct
x, y = sample_pair() # question, gold answer
q = query_encoder(x) # only this encoder is trained
docs = mips_topk(index, q, k=5) # the doc encoder + index are frozen
loss = -log(marginal_prob(y, x, docs)) # negative marginal log-likelihood
loss.backward() # gradient reaches q through pη
opt.step() # updates query encoder + generatorDecoding, turning the model into an actual answer at test time, differs between the two variants. RAG-Token is a standard autoregressive generator (one that writes the answer one token at a time, each token conditioned on the last) once you fold the per-token document mixture into its transition probability, so ordinary beam search works. RAG-Sequence does not factor so neatly, since a whole-sequence marginal cannot be scored token by token, so the paper runs beam search separately for each document and then collects the candidates. The careful version ("Thorough Decoding") runs an extra forward pass to score any candidate a given document's beam missed; the cheap version ("Fast Decoding") assumes a missing candidate has probability zero. A small engineering knob, with the usual accuracy-for-speed trade.
What it actually does
On open-domain question answering, where the model gets a question and must produce the answer with no passages handed to it, RAG sets the state of the art on Natural Questions, WebQuestions, and CuratedTrec, beating both the closed-book giants (an 11-billion-parameter T5) and the open-book retrieve-and-extract pipelines. The metric is Exact Match, the fraction of questions whose generated answer string matches the gold answer exactly. And it does so while generating answers rather than extracting spans. A generator can assemble an answer from clues scattered across several passages, and it can even produce a correct answer that appears verbatim in none of the retrieved documents. On Natural Questions that happens for 11.8% of RAG's correct answers, exactly the cases where an extractive reader, which can only copy, scores zero.
TriviaQA carries one caveat the paper is careful about, and so are we. RAG's headline TriviaQA win is on the split designed to compare against T5; on the standard open-domain split, the plain DPR system edges it out. The honest summary is "state of the art on three of the four QA datasets," not all four.
The generation results are where the open book shows its character. On the MS-MARCO abstractive QA task, RAG-Sequence beats the BART baseline by 2.6 points on both BLEU-1 and ROUGE-L (overlap-with-reference scores for generation quality). On Jeopardy question generation, RAG-Token wins, and human raters preferred its output by a wide margin: shown a RAG clue and a BART clue, they judged RAG more factual in 42.7% of pairs versus 7.1% for BART, and more specific in 37.4% versus 16.8%. RAG also generated more diverse text, more distinct phrasings, without any of the usual diversity-promoting decoding tricks. Grounding the model in real passages made it both more accurate and less repetitive.
On FEVER fact verification, classifying whether Wikipedia supports or refutes a claim, RAG lands within 4.3% of the state of the art on the three-way task and within 2.7% on the two-way task. The systems it trails are elaborate pipelines trained with explicit supervision on which sentences are the evidence. RAG uses none of that. It retrieves its own evidence and is told only the claim and the verdict, and its retrieved documents still come from the correct Wikipedia article 71% of the time in the top hit and 90% of the time within the top ten.
Editing what it knows
Because the knowledge lives in the index and not the weights, you can edit it without touching the model. Swap the library and you have changed what the model knows, instantly, with no retraining. The paper demonstrates this with a clean experiment. It builds two indices, one from a 2016 Wikipedia snapshot and one from 2018, takes 82 world leaders who changed in that window, and asks each model "Who is the {position}?" Try it:
The mismatched numbers are the real evidence. If RAG were answering from parametric memory, the index would not matter and the 2016 and 2018 results would look the same; instead the answer tracks the index almost perfectly, and falls apart when the index disagrees with the question's era. A model that recites would sail through the mismatch untouched, only a model that reads can be sabotaged by handing it the wrong book, and it visibly is. That is also what gives RAG its provenance, you can point at the passage it retrieved and check whether the answer follows, which a purely parametric model can never offer: the model is reading the book.
The limits are worth stating plainly. RAG hallucinates less, but it does not stop hallucinating; if the retrieved passage is wrong or the index is biased, the grounded answer inherits that. The retriever is initialized from a DPR model that did see retrieval supervision on Natural Questions and TriviaQA, so the "no labels" story applies to RAG's own fine-tuning, not to the entire history of its retriever. And dense retrieval is not always best: on FEVER, whose claims are heavy with named entities, a plain word-overlap retriever (BM25) wins, because exact entity matching is what that task rewards. The authors point at joint pre-training of the retriever and generator from scratch as the obvious next step, which is roughly the direction the field then took.
Step back and the idea is small and durable. A model does not have to hold every fact in its head. Give it a retriever and an index, treat the retrieved passage as a latent variable, marginalize over the top few, and train the whole thing on answers alone. The retriever learns what to fetch from nothing but whether the answers come out right, and the facts stay in a book you can correct and replace. Most of what people now call "RAG," in every chatbot that cites its sources, is a descendant of this one.
Questions you might still have
If nobody labels which document is correct, how does the retriever ever learn?
Through the marginal-likelihood gradient. A document’s weight rises in proportion to how much it raised the probability of the gold answer, so the documents that help get reinforced and the distractors fade. No retrieval labels are needed.
What is the real difference between RAG-Sequence and RAG-Token?
RAG-Sequence commits to one document for the whole answer; RAG-Token can draw a different document for each token. Token can stitch facts from several sources into one sentence, which is why it wins when an answer combines separate facts.
Does RAG have to retrain when the facts change?
No. The document encoder and index are frozen even during training, and at test time you can swap the whole index for a newer one. The answers update with no change to any weight. That is the hot-swap experiment.
Why generate the answer instead of extracting a span from a document?
Generation can combine clues spread across documents, and it can produce a correct answer that appears verbatim in none of them. On Natural Questions that happens for 11.8% of RAG’s right answers, where an extractive reader would score zero.
Footnotes & further reading
- The paper: Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel, Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Facebook AI Research / UCL / NYU, NeurIPS 2020). Code and a demo live in HuggingFace Transformers.
- The retriever: Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (DPR). The bi-encoder, the dot-product relevance, and the pre-trained index RAG builds on.
- The generator: Lewis et al., BART: Denoising Sequence-to-Sequence Pre-training. BART-large is about 406M parameters; the RAG paper rounds it to 400M.
- The predecessors: Guu et al., REALM, and Lee et al., Latent Retrieval (ORQA), which paired a learned retriever with a masked language model for extractive QA.
- The search: Johnson et al., Billion-scale similarity search (FAISS), and Malkov & Yashunin, HNSW, the graph index RAG uses for fast MIPS.
- The benchmarks: open Natural Questions, TriviaQA, WebQuestions, and CuratedTrec for QA; MS-MARCO for abstractive QA; and FEVER for fact verification.
How could this explainer be improved? Found an error, or something unclear? I read every message.