Retrieval · LLMs

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Give a language model an open book.

A language model keeps its facts locked inside its weights, where you cannot see them or edit them. RAG bolts on a second memory it can actually read: a searchable index of Wikipedia, consulted fresh for every question.

Explaining the paperRetrieval-Augmented Generation for Knowledge-Intensive NLP TasksLewis, Perez, Piktus, et al. · Facebook AI Research · NeurIPS 2020 · arXiv:2005.11401 ↗

A model that has memorized everything still cannot tell you what it does not know. So let it look instead.

Ask a large language model who won some minor election last year and it will often answer with total confidence and a completely invented name. The facts a model knows are the facts that happened to get pressed into its weights while it trained. That store has a name: parametric memory, knowledge held in the parameters. It is impressive, and it is also a sealed box. You cannot look inside to see which fact the model used. You cannot add a fact, or correct one, without retraining. And when the model does not know something, it rarely admits it. It guesses, and a confident wrong guess is a hallucination.

The oldest remedy there is, then, is to let the model use a book. Keep a separate pile of text, a searchable collection of documents, sitting outside the weights. Call that non-parametric memory: knowledge held in data you can read and edit. Before the model answers, it looks something up, and then it writes its answer with the relevant page open in front of it.

Retrieval-Augmented Generation, RAG, is the recipe that married these two memories for ordinary text generation. The parametric half is a pre-trained sequence-to-sequence model; the non-parametric half is a dense index of Wikipedia reached through a neural retriever; and the two are fine-tuned together. Two earlier systems, REALM and ORQA, had already wired a learned retriever to a masked language model, but only for extractive question answering, where the answer is a span copied straight out of a retrieved document. RAG brought the open book to a general generator, one that writes its answer rather than copying it.

We build it up in pieces: how a question finds the right pages, why the model reads several pages at once instead of betting on one, the two ways it can blend them, how the retriever learns which pages help without ever being told, and what that arrangement delivers at the end.

Two kinds of memory

Picture a closed-book exam. Everything you can write down has to already be in your head, and if you misremember a date, nobody can tell whether you reasoned badly or recalled wrong. A language model with only parametric memory sits that exam every time. It is a remarkable student. It has read most of the internet. But it carries the three weaknesses of any closed-book taker: it cannot look anything up, it cannot show its source, and under pressure it will confidently write down something that is not true.

Now hand the same student a library and let the exam go open-book. Three things change at once. The student can answer questions about events that happened after they studied, because the facts live on the shelf, not in their memory. The student can point at the page they used, so a grader can check the work. And when the library simply does not cover it, that is visible too, instead of being papered over with a guess. Non-parametric memory is that library: a body of text the model consults at answer time rather than swallowing in advance.

The paper calls these knowledge-intensive tasks, the ones a person could not reasonably do without looking something up. Who held a particular office, what a specific study found, whether a claim checks out against the record. The knowledge is too specific and too changeable to expect any fixed set of weights to hold it all. The interesting question is how to fuse the two memories so the model can write fluent, general answers and stay anchored to real documents. RAG is one way to build that fusion.

Retrieve, then generate

The whole system is two learned components and one fixed library, run in sequence. First a retrieverreads the question and pulls a handful of likely-relevant passages off the shelf. Then a generatorreads the question together with those passages and writes the answer. Both are pre-trained models that already know how to do their jobs; fine-tuning teaches them to do those jobs for each other.

Two probability distributions name the two halves. The retriever is $p_\eta(z \mid x)$ : given the question $x$ , how likely is passage $z$ to be the one worth reading, with the retriever's parameters written $\eta$ . The generator is $p_\theta(y \mid x, z)$ : given the question and a passage, how likely is the answer $y$ , with the generator's parameters $\theta$ . The retrieved passage $z$ is never observed in the training data. It is a latent variable, an intermediate quantity the model introduces on its own and has to reason about, and almost everything interesting about RAG comes from how it handles that.

The pipeline, end to end. A question is encoded into a vector, a fast search over the document index lights up the top few passages, those passages and the question flow into the generator, and the generator's predictions are combined into one answer:

Figure 1 · the RAG pipeline

A question x is embedded by the query encoder; a Maximum Inner Product Search over a frozen 21M-passage index lights up the top-K passages; those passages plus the question are read by the BART generator, whose per-document predictions are combined into one answer y. The retriever pη(z|x) and generator pθ(y|x,z) are the only learned pieces.

The generator is BART, a pre-trained encoder-decoder Transformer (a sequence-to-sequence, or seq2seq, model: read an input sequence and write an output sequence) with roughly 400 million parameters. It was pre-trained to repair deliberately corrupted text, which makes it a strong general writer. To feed it a retrieved passage, RAG does the simplest possible thing: it concatenates the passage onto the question and lets BART read both. Nothing clever. The cleverness is all in the retriever and in how the passages are combined.

Finding the right pages

How do you search 21 million passages for the few that bear on a question, in a fraction of a second, when the question and the passage may not share a single word? Stop matching words and start matching meaning, by turning text into geometry.

An embedding is a vector, a list of numbers, that a model assigns to a piece of text so that text with similar meaning lands at nearby points. RAG's retriever is DPR, Dense Passage Retrieval, and it uses a bi-encoder: two separate networks, each a BERT-base model (a pre-trained Transformer encoder that maps text to a contextual vector). One embeds documents, $\mathbf{d}(z)$ ; the other embeds the question, $\mathbf{q}(x)$ . The relevance of a passage is the inner product of the two vectors, large when they point the same way. Exponentiate and you have a score you can read as a probability:

p_\eta(z \mid x) \;\propto\; \exp\!\big(\mathbf{d}(z)^\top \mathbf{q}(x)\big), \qquad \mathbf{d}(z) = \mathrm{BERT}_d(z), \quad \mathbf{q}(x) = \mathrm{BERT}_q(x)

(1)

The dense vector space works like a vast library where books about the same thing sit on nearby shelves, in every direction at once. The question is a coordinate dropped into that space, and the passages you want are the ones standing closest to it. Finding those nearest neighbors among 21 million points — the passages whose embedding vectors sit closest to the question's — is a Maximum Inner Product Search (MIPS), and there are index structures (RAG uses FAISS with an HNSW graph) that solve it approximately in sub-linear time, so you never score all 21 million. The name is heavier than the idea: MIPS is nothing more than "find the vectors most aligned with the question," run fast enough to work at that scale. Pick a question below and watch which passages it pulls in, and how the retrieval weights shift:

Figure 2 · retrieval as nearest neighbors

question

top-KK = 5

Every passage is a point in the dense vector space; the question is its own vector. The top-K passages are the ones whose vectors are most aligned with the query by inner product, ringed and connected with brightness set by the softmax weight pη(z|x). Pick a question, and a K. With normalized vectors, largest inner product means nearest point, which is why this looks like nearest-neighbor search.

The equation leaves out a detail. That softmax is taken over the top-K passages the search returned, never over all 21 million; normalizing across the whole index would be hopeless, and retrieving a short list first is precisely how you avoid having to. The $\propto$ in (1) is RAG recasting DPR's raw inner product as a probability. DPR on its own ranks passages by that dot product. The document embeddings are computed once with a pre-trained DPR encoder (itself trained to find answer-bearing passages for Natural Questions and TriviaQA) and then frozen into the index, which will matter a great deal for how the model is trained.

Don't bet on one page

The retriever hands back several passages, not one. You could keep the top passage and throw the rest away. RAG does not.

You do not know which retrieved passage holds the answer. Maybe the second one does. Maybe the answer is spread across two of them. Betting everything on the top hit throws that uncertainty away. So instead of choosing, RAG keeps all $K$ passages and lets each one vote. Each passage produces its own answer through the generator, and the votes are added up, weighted by how relevant the retriever judged each passage to be. There is no safety net for a wrong choice either, nothing in the training data says which passage was right, so committing to the top hit would gamble the training signal on a guess. Spreading the bet across all of them, each in proportion to its retrieval score, cushions the risk, and the operation below formalizes it. In the simplest version, called RAG-Sequence, one passage is assumed responsible for the whole answer, and the probability of an answer is that weighted sum:

p_{\text{RAG-Seq}}(y \mid x) \;\approx\; \sum_{z \in \text{top-}k} p_\eta(z \mid x)\, p_\theta(y \mid x, z) \;=\; \sum_{z \in \text{top-}k} p_\eta(z \mid x) \prod_{i}^{N} p_\theta(y_i \mid x, z, y_{1:i-1})

(2)

This operation, summing a quantity over every value a hidden variable could take, each weighted by its probability, is called marginalizing out the latent $z$ . It is the standard way to deal with something you cannot observe: account for every possibility in proportion to how likely it is. It works like a panel of advisors, each having read a different passage. You ask all of them, and you trust each one's answer in proportion to how on-topic their passage was. The right-hand expansion spells out that $p_\theta(y \mid x, z)$ is itself a product of per-token probabilities, the usual way a generator scores a whole sequence.

This is not a vague safety margin but a principled averaging of the generator over an unknown. Treat "which passage holds the answer" as an unknown with a probability attached, namely the retrieval weight $p_\eta(z \mid x)$ , and the correct way to score the answer is to average the generator's answer over that unknown, weighting each passage by how likely it is to be the right one. That is Bayesian averaging, and (2) is exactly it. The top-1 passage is wrong often enough that committing to it gambles the answer on a single guess that came with no label confirming it; averaging over the top $K$ spends the retriever's own confidence as the hedge, so a near-miss at rank 1 is recovered by a correct passage at rank 2 or 3 instead of being lost.

The averaging is only over $K$ passages, never the full 21 million, and that cutoff is a real tradeoff. Pick $K$ too small and the right passage may never make the shortlist, so no amount of averaging can recover it, the gradient has nothing correct to reinforce. Pick $K$ too large and the sum fills with off-topic passages that each carry a small but nonzero weight, diluting the average with noise and blunting the signal that should sharpen onto the passage that helped. RAG uses $K = 5$ or $10$ , small enough to stay cheap and clean, large enough that the answer-bearing passage is usually somewhere in the list. Slide $K$ on the toy below and watch the marginal answer probability trace out that tradeoff: flat-wrong at $K=1$ , a climb as the right passage enters, a clear best $K$ , then a gentle slide back down as off-topic passages dilute it:

Figure 3 · how many passages to marginalize over

KK = 3 (best)

A ranked list of ten candidate passages, each with a fixed retrieval logit that decays with rank and a hidden answer-prob pθ(ans|z) (does the passage actually contain the answer). The gold passage is at rank 2; rank 1 is a high-scoring near-miss, ranks 3 and 4 partial, the rest off-topic. The slider sets

K

. The amber bars are the retrieval softmax over the included top-

K

, renormalized as

K

moves, and the readout is the marginal p(answer) = Σ pη·pθ(ans|z). At

K=1

it bets on the wrong top hit (0.10); it rises as the gold passage enters and peaks at the best

K=3

(0.42), then falls and plateaus near 0.30 as off-topic passages dilute the softmax mass away from gold. The toy uses a handful of passages; the real RAG marginalizes the top

K

(

K = 5

10

) of millions, document index frozen.

Now the question that should be nagging. The retriever is supposed to learn to find good passages, but the training data only ever contains questions and gold answers. Nobody writes down which passage was the right one to retrieve. So what teaches the retriever?

The marginal does. Because $p_\eta(z \mid x)$ sits inside the sum in (2), the gradient of the answer's likelihood flows back into the retriever weights, and retrieval, which feels like a discrete, un-learnable lookup, becomes a smooth, differentiable knob. This is why RAG can be trained end to end. Slide the training control below and watch the weights drift from flat toward the passages that support the gold answer, and the marginal sharpen onto it:

Figure 4 · the marginal, and learning to retrieve

trainingstart

RAG marginalizes across documents as a weighted vote: p(y|x) = Σ pη·pθ. Each document carries an amber retriever weight pη and a teal generator distribution pθ over candidate answers. As training advances, the weights drift from uniform toward the documents whose generator supports the gold answer, and the marginal sharpens onto it, with no label on which document was correct.

The gradient flows in a specific, sensible direction: a passage's weight is pushed up in proportion to how much that passage raised the probability of the gold answer, relative to the others. A passage whose generator confidently produced the correct answer gets reinforced, a distractor that led nowhere gets suppressed, and the retriever, never told which passage was right, infers it from which passages helped. The figure below runs that gradient live, five documents, no retrieval labels, the scores move on their own:

Figure 5 · the gradient that teaches retrieval

step 0/80

A toy run of the true mechanism: 5 candidate documents with fixed gold-answer usefulness pθ(gold|x,z) (teal values), retrieval scores starting near-uniform, and real gradient steps on log of the marginal answer likelihood. The arrows are each document's current gradient push, up when it raises the gold answer's probability, down when it does not, with magnitude tracking its posterior weight, so the softmax pη(z|x) reshapes toward the helpful document and p(answer) climbs. Watch doc B, partially helpful: pulled up early, pushed down once the marginal probability rises above its contribution. The real model marginalizes the top K of 21M documents, and only the query encoder trains, the document index stays frozen.

One source, or one per word

RAG-Sequence assumes a single passage is behind the entire answer. That is the right assumption for a short factual reply, but it is limiting when an answer needs to weave together facts that live in different documents. So the paper offers a second model, RAG-Token, that marginalizes at a finer grain: it can draw a different passage for every token it writes. The sum over passages moves inside the product over positions:

p_{\text{RAG-Tok}}(y \mid x) \;\approx\; \prod_{i}^{N} \sum_{z \in \text{top-}k} p_\eta(z \mid x)\, p_\theta(y_i \mid x, z, y_{1:i-1})

(3)

Compare (2) and (3) and the difference is exactly where the sum sits. In RAG-Sequence the sum is outside: pick one passage, write the whole answer, then average over which passage it was. In RAG-Token the sum is inside: at each token, blend all the passages, then take the next step. RAG-Sequence is a book report from a single source. RAG-Token is an essay that can cite a different source in each sentence. That difference makes a prediction before any benchmark does: Token should win when an answer needs facts stitched from different passages, Sequence should hold its own when one passage carries the whole answer, because re-voting at every word only pays when the right source changes mid-sentence.

The paper illustrates this with a concrete example. Asked to write a Jeopardy clue for the answer "Hemingway," the model produces a sentence naming two of his books and the prize he won, and you can watch which passage drives each token. The passage about A Farewell to Arms spikes as that title begins; the passage about The Sun Also Rises takes over as the next one starts. Switch between the two models below and watch the document posterior shift across the sentence, or stay locked:

Figure 6 · which document writes which word

model

Writing a Jeopardy clue for Hemingway, token by token. In RAG-Token a different document can win each token: doc 1 (A Farewell to Arms) lights up as that title starts, doc 2 (The Sun Also Rises) as the next does, and each fades once the title is underway, because parametric memory finishes it. In RAG-Sequence one document is locked across the whole answer.

After the first word of each title, the posterior over documents (the per-token spread of retrieval weights pη across passages) flattens. Once "The Sun" is on the page, the generator can finish "Also Rises" from its own parametric memory, without leaning on any particular passage. The two memories are working together: the retrieved document supplies the specific fact, and the weights fill in the rest. That cooperation is the reason RAG-Token tends to win when an answer combines several facts, and the paper finds exactly that on Jeopardy generation.

Both models reduce to the same thing for classification. Treat the class label as a one-token answer and the sum-inside and sum-outside forms coincide, so RAG handles "is this claim true or false" with the same machinery it uses to write. In code, the only difference between the two is whether you marginalize once for the sequence or once per step:

# one RAG answer: retrieve, run the generator per doc, marginalize
q    = query_encoder(x)               # the query vector q(x)
docs = mips_topk(index, q, k=5)       # top-K passages by inner product d·q
p_eta = softmax([dot(d.vec, q) for d in docs])   # retriever weights pη

# RAG-Sequence: commit to one document for the whole answer
def seq_prob(y):
    return sum(p_eta[i] * p_theta(y, x, docs[i]) for i in range(k))

# RAG-Token: a fresh document mixture at every step
def tok_next(y_prev):
    return sum(p_eta[i] * p_theta_next(x, docs[i], y_prev) for i in range(k))

No answer key for retrieval

Training has one thing and one thing only: pairs of questions and gold answers. No labels saying which passage to fetch. The objective is to make the gold answers likely under the marginal, which means minimizing the negative log of the marginal probability summed over the training set:

\mathcal{L} \;=\; \sum_{j} -\log p(y_j \mid x_j)

(4)

That single objective trains both halves at once, with plain Adam (the standard gradient-descent optimizer). The generator learns to write good answers from retrieved passages, and the retriever, by the mechanism of Figures 3 and 4, learns to fetch passages that make good answers likely. The supervision for retrieval is entirely indirect, coming from the answer, not from any annotation of the evidence.

The training is affordable because RAG freezes the expensive half. Re-embedding 21 million documents every time the document encoder changed would mean rebuilding the whole search index over and over, which is the bill REALM pays during its pre-training. RAG never pays it: the document encoder and the index stay frozen, and fine-tuning touches only the query encoder and the generator. The retriever still improves, because what matters is the relative geometry, not which side moves. Pulling the question vectors toward the right documents changes the nearest neighbors exactly as well as pushing the documents toward the question would. The paper finds this is plenty; re-indexing the documents is not worth the cost.

# one training step: nobody labels which document is correct
x, y  = sample_pair()                 # question, gold answer
q     = query_encoder(x)              # only this encoder is trained
docs  = mips_topk(index, q, k=5)      # the doc encoder + index are frozen
loss  = -log(marginal_prob(y, x, docs))  # negative marginal log-likelihood
loss.backward()                       # gradient reaches q through pη
opt.step()                            # updates query encoder + generator

Decoding, turning the model into an actual answer at test time, differs between the two variants. RAG-Token is a standard autoregressive generator (one that writes the answer one token at a time, each token conditioned on the last) once you fold the per-token document mixture into its transition probability, so ordinary beam search works. RAG-Sequence does not factor so neatly, since a whole-sequence marginal cannot be scored token by token, so the paper runs beam search separately for each document and then collects the candidates. The careful version ("Thorough Decoding") runs an extra forward pass to score any candidate a given document's beam missed; the cheap version ("Fast Decoding") assumes a missing candidate has probability zero. A small engineering knob, with the usual accuracy-for-speed trade.

Where the open book wins

On open-domain question answering, where the model gets a question and must produce the answer with no passages handed to it, RAG sets the state of the art on Natural Questions, WebQuestions, and CuratedTrec, beating both the closed-book giants (an 11-billion-parameter T5) and the open-book retrieve-and-extract pipelines. The metric is Exact Match, the fraction of questions whose generated answer string matches the gold answer exactly. It does this while generating answers rather than extracting spans. A generator can assemble an answer from clues scattered across several passages, and it can even produce a correct answer that appears verbatim in none of the retrieved documents. On Natural Questions that happens for 11.8% of RAG's correct answers, exactly the cases where an extractive reader, which can only copy, scores zero.

TriviaQA carries one caveat the paper is careful about, and so are we. RAG's headline TriviaQA result is on the split designed to compare against T5; on the standard open-domain split, the plain DPR system edges it out. So the claim is "state of the art on three of the four QA datasets," not all four.

In the generation results, the open book shows its character. On the MS-MARCO abstractive QA task, RAG-Sequence beats the BART baseline by 2.6 points on both BLEU-1 and ROUGE-L (overlap-with-reference scores for generation quality). On Jeopardy question generation, RAG-Token wins, and human raters preferred its output by a wide margin: shown a RAG clue and a BART clue, they judged RAG more factual in 42.7% of pairs versus 7.1% for BART, and more specific in 37.4% versus 16.8%. RAG also generated more diverse text, more distinct phrasings, without any of the usual diversity-promoting decoding tricks. Grounding the model in real passages made it both more accurate and less repetitive.

On FEVER fact verification, classifying whether Wikipedia supports or refutes a claim, RAG lands within 4.3% of the state of the art on the three-way task and within 2.7% on the two-way task. The systems it trails are elaborate pipelines trained with explicit supervision on which sentences are the evidence. RAG uses none of that. It retrieves its own evidence and is told only the claim and the verdict, and its retrieved documents still come from the correct Wikipedia article 71% of the time in the top hit and 90% of the time within the top ten.

Editing what it knows

Because the knowledge lives in the index and not the weights, you can edit it without touching the model. Swap the library and you have changed what the model knows, instantly, with no retraining. The paper demonstrates this with a clean experiment. It builds two indices, one from a 2016 Wikipedia snapshot and one from 2018, takes 82 world leaders who changed in that window, and asks each model "Who is the {position}?" Try it:

Figure 7 · hot-swapping the index

question

index

The same question, the same frozen model, a different book. Switching the Wikipedia index between its 2016 and 2018 snapshots changes the retrieved passage and so the answer, with no retraining. In the paper: the model answers 70% correctly with the 2016 index for 2016 leaders and 68% with the 2018 index for 2018 leaders. With a mismatched index the answers collapse (12% and 4%), demonstrating that RAG reads the index rather than reciting from memory. The shuffled option is our illustrative negative control in the same spirit, unrelated passages in, no grounded answer out.

The mismatched numbers are the real evidence. If RAG were answering from parametric memory, the index would not matter and the 2016 and 2018 results would look the same; instead the answer tracks the index almost perfectly, and falls apart when the index disagrees with the question's era. If RAG recited from its weights, the mismatched index would not change its answer; because it reads the index, handing it the wrong snapshot breaks the answer, and the numbers confirm it. Reading the index also makes RAG auditable: you can point at the retrieved passage and check whether the answer follows, which a purely parametric model cannot offer.

The limits. RAG hallucinates less, but it does not stop hallucinating; if the retrieved passage is wrong or the index is biased, the grounded answer inherits that. The retriever is initialized from a DPR model that did see retrieval supervision on Natural Questions and TriviaQA, so the "no labels" story applies to RAG's own fine-tuning, not to the entire history of its retriever. And dense retrieval is not always best: on FEVER, whose claims are heavy with named entities, a plain word-overlap retriever (BM25) wins, because that task rewards exact entity matching. The authors point at joint pre-training of the retriever and generator from scratch as the obvious next step, which is roughly the direction the field then took.

What RAG actually proposed is durable because it is so small. A model does not have to hold every fact in its head. Give it a retriever and an index, treat the retrieved passage as a latent variable, marginalize over the top few, and train on answers alone. The retriever learns what to fetch from nothing but whether the answers come out right, and the facts stay in a book you can correct and replace. Every chatbot that now cites its sources is running some descendant of this.

Provenance Verified against primary literature

DPR (2020)Bi-encoder retriever: pη(z|x) ∝ exp(d(z)·q(x)) with BERT-base encoders, searched by MIPS.

BART (2019)The pre-trained denoising seq2seq generator, ~406M parameters (the paper rounds to 400M).

REALM / ORQA (2020)Earlier masked-LM models with a learned retriever, explored only for extractive QA.

Marginal gradientThe retriever learns with no labels: a document is upweighted by how much it raised the gold answer’s probability.

correctionOn TriviaQA, RAG sets the state of the art only on the T5-comparable Wiki split (68.0 vs 60.5). On the standard open-domain split, DPR (57.9 EM) edges out RAG-Sequence (56.8). We keep that distinction; a careful reader should not read “SOTA on all four QA tasks.”

Questions you might still have

If nobody labels which document is correct, how does the retriever ever learn?
Through the marginal-likelihood gradient. A document’s weight rises in proportion to how much it raised the probability of the gold answer, so the documents that help get reinforced and the distractors fade. No retrieval labels are needed.

What is the real difference between RAG-Sequence and RAG-Token?
RAG-Sequence commits to one document for the whole answer; RAG-Token can draw a different document for each token. Token can stitch facts from several sources into one sentence, which is why it wins when an answer combines separate facts.

Does RAG have to retrain when the facts change?
No. The document encoder and index are frozen even during training, and at test time you can swap the whole index for a newer one. The answers update with no change to any weight. That is the hot-swap experiment.

Why generate the answer instead of extracting a span from a document?
Generation can combine clues spread across documents, and it can produce a correct answer that appears verbatim in none of them. On Natural Questions that happens for 11.8% of RAG’s right answers, where an extractive reader would score zero.

Footnotes & further reading

The paper: Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel, Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Facebook AI Research / UCL / NYU, NeurIPS 2020). Code and a demo live in HuggingFace Transformers.
The retriever: Karpukhin et al., Dense Passage Retrieval for Open-Domain Question Answering (DPR). The bi-encoder, the dot-product relevance, and the pre-trained index RAG builds on.
The generator: Lewis et al., BART: Denoising Sequence-to-Sequence Pre-training. BART-large is about 406M parameters; the RAG paper rounds it to 400M.
The predecessors: Guu et al., REALM, and Lee et al., Latent Retrieval (ORQA), which paired a learned retriever with a masked language model for extractive QA.
The search: Johnson et al., Billion-scale similarity search (FAISS), and Malkov & Yashunin, HNSW, the graph index RAG uses for fast MIPS.
The benchmarks: open Natural Questions, TriviaQA, WebQuestions, and CuratedTrec for QA; MS-MARCO for abstractive QA; and FEVER for fact verification.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.