LLMs · Transfer learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Every task is the same task: text in, text out.

T5 does not invent a clever new mechanism. It makes one bet, that every language problem can be written as text-to-text, and then uses that uniformity to run the experiment the field had been too fragmented to run: hold everything fixed and ask what matters.

Explaining the paperExploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerRaffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu · Google · JMLR 2020 · arXiv:1910.10683 ↗

Write translation, sentiment, and question answering as one problem, and you can finally measure what makes a model better.

By late 2019 transfer learning had taken over natural language processing, and the field was a mess of good ideas that nobody could compare. BERT pre-trained an encoder and bolted a classifier on top. GPT-2 ran a decoder and predicted the next word. XLNet, RoBERTa, UniLM, MASS, SpanBERT: each shipped its own architecture, its own pre-training objective, its own unlabeled corpus, its own fine-tuning trick, often all at once. When a new model beat the last one, you could rarely say why. Was it the objective? The data? The extra billion tokens? The architecture? The contributions were tangled together, and that was slowing the field down.

T5, out of Google, is the paper that untangled it. The authors are blunt that they are not proposing a new method. They are running a survey, the careful kind, where you fix everything you can and change one thing at a time. To make that possible they needed a single setup flexible enough to express every task they wanted to study, so they could swap objectives and architectures and corpora without ever changing the wrapper around them. That setup is the text-to-text framework, and everything else is built on it. Once it is in place, the paper becomes a long sequence of controlled experiments, and at the end the winners get stacked together and scaled up into a model the authors call T5 (the Text-to-Text Transfer Transformer), which then sets the state of the art on most of the benchmarks it touches.

The path follows the paper. First the framework that makes everything comparable. Then the backbone Transformer and the three architectures it can take. Then the pre-training objective and the search that picks it. Then the data, the scaling, and what the finished model does. The individual pieces are mostly borrowed and unsurprising; the paper's contribution is comparing them cleanly.

Every task is text in, text out

Everything rests on one idea. A model that classifies sentiment, a model that translates, and a model that answers questions normally look nothing alike: different output layers, different losses, different decoding. T5 discards all of that. Every task is reduced to the same shape. You feed the model some text, and you train it to produce some text.

What makes this work for tasks that are not obviously "generate text" is a task prefix: a few words at the front of the input that name the job. There is no machinery behind the prefix, it is ordinary conditioning text: inputs that begin "translate English to German:" are always paired with German targets in training, so the model learns that those words predict German output, and the prefix steers the decoder the way any context steers a language model, no task-specific head required. You feed the model the input on the left and train it to emit the target on the right:

# every task is one (input string -> target string) pair
"translate English to German: That is good."   ->  "Das ist gut."
"cola sentence: The professor talked us."       ->  "unacceptable"
"mnli premise: I hate pigeons. hypothesis: ..." ->  "entailment"
"stsb sentence1: A plane lands. sentence2: ..." ->  "4.0"
"summarize: <a long news article> ..."          ->  "<a short summary>"

For a classification task the target is the label written out as a word. On the MNLI inference benchmark, where you are given two sentences, a premise and a hypothesis, and decide whether the premise entails, contradicts, or is neutral toward the hypothesis, the prefix is mnli and the target is the single word entailment. Even a regression task fits the format: STS-B asks for a similarity score from 1 to 5, so T5 rounds it to the nearest $0.2$ and predicts the number as a literal string like 3.8, which turns the regression into a 21-way choice over string targets. (If the model ever emits something that is not a valid label or number, it is scored wrong, which the authors say never happened.)

Figure 1 · one model, every task

Pick a task. The teal prefix names the job; the input rides in behind it; the same model emits the answer as plain text, a German sentence, a class label, a similarity number, or a summary. The translate and mnli examples are verbatim from the paper; the rest are illustrative in the same format.

Because the shape never changes, the training objective never changes either. Every task, pre-training and fine-tuning alike, is trained with the same maximum-likelihood loss: the model predicts the target one token at a time, is shown the true previous token while learning (this is called teacher forcing), and is penalized by the cross-entropy between its predicted next-token distribution and the truth. Decoding never changes either. The same model, the same loss, the same hyperparameters, the same decoder, for translation and grammar-checking and summarization.

That uniformity is what makes the paper's experiments controlled. When the wrapper around every task is identical, you can change the pre-training objective and measure the effect with nothing else moving. You can swap the architecture underneath and compare it honestly. You can change the corpus and read off the difference. Task-specific machinery would have been a confound in every one of those comparisons, and the text-to-text format removes it. The rest of the paper spends that budget.

Two edges to flag. A handful of tasks need a little massaging to fit (the Winograd pronoun puzzles get recast as "predict the noun the starred pronoun refers to"). And a plain string interface cannot express every inductive bias a task might want. Translation, in particular, relies on alignment structure that a generic text-to-text model has no way to absorb, and T5 later falls short of the best specialized systems precisely on translation.

The backbone: an encoder-decoder Transformer

Under the text-to-text wrapper sits a Transformer, and T5 deliberately keeps it close to the original 2017 design. Two stacks. An encoder reads the input and builds a representation of it, with each token free to look at every other token. A decoder writes the output one token at a time, looking back at its own partial output and, through a cross-attention step, at the encoder's reading of the input. The engine in both stacks is self-attention, which rebuilds each position as a weighted average of the others:

y_i = \sum_j w_{ij}\, x_j, \qquad w_{ij} = \mathrm{softmax}_j\!\left(\frac{q_i \cdot k_j}{\sqrt{d_{kv}}}\right)

(1)

Here $x_j$ are the incoming token vectors, $q_i$ and $k_j$ are the query and key projections, and $w_{ij}$ is how much position $i$ attends to position $j$ . The weights are a softmax, so they are non-negative and sum to one across $j$ : each output token is a convex blend of the inputs. That is the entire mechanism, repeated across many parallel heads and stacked into blocks of attention plus a small feed-forward network. Cross-attention is the same operation with one twist: the decoder's queries attend to the encoder's keys and values rather than its own, and that is the only channel through which the two stacks exchange information.

T5 makes three small changes to that backbone, and they are worth describing precisely, because two became standard equipment in later models. First, it simplifies the layer normalization: instead of re-centering activations and adding a learned bias, it only rescales them by their root-mean-square. There is no mean subtraction and no additive term. The wager is that normalization's load-bearing half is the rescale, the part that keeps activations bounded, and that the centering and the learned bias contribute little, so T5 keeps the rescale and drops the other two.

\mathrm{LN}(x) = \gamma \odot \frac{x}{\sqrt{\tfrac{1}{d}\sum_k x_k^2 + \epsilon}}

(2)

This is exactly what a paper published the same week would name RMSNorm. T5 does not use that name (the two were contemporaneous, not derivative), but the operation is identical, and it is the normalization that LLaMA and most of what followed adopted. Second, T5 puts the normalization before each sublayer rather than after it, so a block computes $x + \text{sublayer}(\mathrm{LN}(x))$ and the residual skip bypasses the norm entirely. This is the "pre-norm" arrangement, and it trains more stably than the original "post-norm" $\mathrm{LN}(x + \text{sublayer}(x))$ .

Third, and most distinctive, T5 throws out absolute position embeddings. The original Transformer told each token where it sat with a fixed sinusoid; T5 instead learns a relative position bias: a single scalar, added directly to the attention logit in (1), that depends only on the offset $i - j$ between the query and key. Offsets are sorted into 32 buckets (exact for small distances, then growing logarithmically, with everything past 128 tokens collapsed into one far-away bucket), so the whole bias table stays tiny while still capturing how far apart two tokens are. Each bucket owns one learned scalar, and the table is shared across all layers while each attention head keeps its own copy. It is a deliberate simplification of the relative-position scheme from Shaw et al., which added learned vectors rather than a single scalar. The authors note these three tweaks are orthogonal to the study and leave ablating them to future work.

Drag the offset below and watch which bucket catches it: 5 gets a bucket of its own, 90 lands in a band 27 offsets wide, and everything from 91 out shares the last one.

Figure 2 · 32 buckets for every offset

offset i−j5 → b5

The band boundaries are computed with T5's actual bucketing function (relative_position_bucket, num_buckets=32, max_distance=128), not sketched. Offsets 0 through 7 each own a bucket, the bands then widen logarithmically, and one terminal bucket catches every offset from 91 out, which is how everything past the 128 horizon collapses into one. The 32 buckets split 16 per sign of

i-j

; the positive half is shown, the negative half mirrors it. Each bucket owns one learned scalar per head, added to the attention logit; the bias values here are illustrative, not trained.

The baseline that anchors every experiment is sized to be familiar: the encoder and decoder are each about a BERT-Base stack, twelve blocks apiece, a hidden width of $d_{\text{model}}=768$ (the size of each token vector), twelve heads, roughly 220 million parameters in total. That is about twice BERT-Base, because there are two stacks instead of one, and that doubling costs almost no extra compute, for a reason the parameter-and-compute accounting makes precise.

Three shapes of attention

The text-to-text format does not actually require an encoder-decoder. A single decoder-only stack, the GPT shape, can do text-to-text too: glue the input and the target into one long sequence and predict it left to right. So before settling on encoder-decoder, T5 compares the architectures head to head, and the architectures differ in their attention mask, the rule for who is allowed to look at whom.

Self-attention treats its input as a set, so on its own it encodes no ordering of past and future. The mask imposes one by zeroing out forbidden weights before the softmax. Three patterns matter. A fully-visible mask lets every position attend to every other; it is what an encoder uses, and what BERT uses. A causal mask forbids looking ahead, setting $w_{ij}=0$ for every $j>i$ , so position $i$ sees only itself and the past; it is what a decoder uses when it generates. (The convention here is that a zeroed weight means "cannot attend".) The third, causal with a prefix, is the interesting hybrid: fully visible over an initial chunk of the sequence, then causal for the rest.

Those three masks define three architectures. The encoder-decoder uses a fully-visible encoder and a causal decoder joined by cross-attention. A language model is a single causal stack over the glued input-plus-target. And a prefix LM is a single stack that treats the input as a fully-visible prefix and only goes causal once it starts generating the target. The figure shows the masks over a short [input | target] sequence: notice that the language model's mask needlessly blocks later positions from attending to earlier input tokens (its prefix block is a triangle), while the encoder-decoder and the prefix LM both give the input full visibility (a solid square).

Figure 3 · the mask makes the model

Who can attend to whom, over five input tokens then four target tokens. The plain language model is one causal triangle, so even within the input each token hides from its successors. The encoder-decoder and prefix LM light up the whole input block. All three cost the same compute

M

; only the encoder-decoder spends

2P

parameters, and it wins GLUE (the nine-task English understanding suite; from the paper's Table 2, denoising objective).

Here is the accounting that makes the encoder-decoder attractive. Stack the encoder's 12 layers on top of the decoder's 12 and you have twice the parameters of a 12-layer decoder-only model. But you do not pay twice the compute, because the encoder only ever runs over the input and the decoder only ever runs over the target, so each token passes through one stack, not both, whereas a decoder-only model runs all of its layers over the input and target concatenated together. Input tokens flow through the encoder, target tokens through the decoder, so doubling the parameters does not double the work any single token pays for. The upshot, in the paper's bookkeeping, is that an $L+L$ encoder-decoder has about $2P$ parameters but roughly the same compute $M$ as an $L$ -layer language model with only $P$ parameters. (It is not exactly equal, the cross-attention adds about 10% of the parameters and the attention cost depends on the sequence lengths, but it is close.) You get to spend twice the parameters at about the same cost.

The doubling is real, not an accounting trick, which the other side of the comparison makes plain: hold the encoder-decoder against a decoder-only model with the same total $2P$ parameters. In the encoder-decoder those $2P$ parameters are split into two separate stacks of $P$ each, and the two stacks see disjoint tokens: an input token is read by the encoder and never touches the decoder's weights, a target token is written by the decoder and never touches the encoder's. So each token is processed by exactly one stack of $P$ parameters, and the cost per token is set by $P$ , not $2P$ . A decoder-only model that wants the same $2P$ parameters has to put them all in one stack, and a single stack has no way to route some tokens around half of itself, so every token, input and target alike, is pushed through all $2P$ of them. Same parameter budget, but the decoder-only model pays for all of it on every token while the encoder-decoder pays for half. That is why the encoder-decoder buys the second $P$ at roughly the same per-token compute $M$ : the parameters are doubled, but the work each token does is not.

The extra parameters also improve the scores, not just the parameter count. The scores here are on GLUE, a suite of nine English language-understanding tasks (grammaticality, entailment, paraphrase, sentiment, and the like) averaged into a single number. Across every task, the encoder-decoder with a denoising objective came out on top: GLUE $83.28$ , against $81.82$ for the prefix LM and $74.70$ for the plain language model. Sharing the encoder and decoder weights (halving the parameter count back down to $P$ ) barely dented it, to $82.81$ , which is a useful trick when memory is tight. The prefix LM beat the plain LM, confirming that the damage in a decoder-only model is largely that causal masking hides the input from itself. And in every architecture, the denoising objective beat a plain language-modeling objective.

Pre-training: fill in the blanks

Pre-training needs a task you can run on raw text with no human labels, that still teaches the model something general. The historically obvious choice is language modeling: predict the next word. But by 2019 the field had largely moved to denoising objectives, where you corrupt the input and ask the model to repair it. BERT's "masked language modeling" is the famous example: hide some words and predict them from both sides. Denoising consistently beat next-word prediction on downstream tasks, and T5 confirms it.

T5's own objective is a denoising variant tuned for an encoder-decoder. It drops 15% of the tokens. Then every consecutive run of dropped tokens is replaced in the input by a single sentinel token, a special symbol unique within the example (<X>, <Y>, and so on). The target is not the full repaired sentence. It is only the dropped spans, each tagged with the sentinel that stands in for it, ended by one final sentinel. The figure runs the transformation on the paper's own example sentence.

Figure 4 · span corruption

rate15%

Drop 15% of the tokens; collapse each consecutive dropped span into one sentinel; train the decoder to emit only the dropped spans, each tagged by its sentinel, with a final sentinel to stop. The default shows the paper's exact example. Reroll to resample, or push the rate up. The target is always far shorter than the input, which is the design goal.

Two design choices hide in there, and both are about speed, not accuracy. Collapsing a whole span into one sentinel, and predicting only the corrupted tokens rather than reconstructing the entire sentence, both make the target sequence short. Shorter targets mean less work in the decoder, so pre-training runs faster, and the paper will use that to train for far longer later. The decoder's share of pre-training compute scales with the tokens it must produce, and a target that lists only the dropped spans is several times shorter than the full sentence, so the same budget yields correspondingly more pre-training. In pseudocode the transformation is almost trivial:

# span corruption: drop 15% of tokens, in spans of mean length 3
tokens = tokenize(text)                  # "Thank you for inviting me ..."
drop   = sample_spans(tokens, rate=0.15, mean_len=3)
src, tgt, k = [], [], 0
for run in consecutive_runs(drop):       # each run of dropped tokens
    sentinel = f"<extra_id_{k}>"; k += 1
    src += [sentinel]                    # one sentinel replaces the whole span
    tgt += [sentinel] + run              # target lists the span back, in order
tgt += [f"<extra_id_{k}>"]               # a final sentinel ends the target
# every token that was NOT dropped stays in src, in its original place

The paper's "BERT-style" comparison point is not actually BERT's recipe. Real BERT, of the 15% of tokens it selects, replaces 80% with a mask token, 10% with a random token, and leaves 10% unchanged, and it predicts only those selected positions at the output of its single encoder. T5's reimplemented "BERT-style" objective simplifies this to 90% mask and 10% random, and because it is an encoder-decoder, it reconstructs the entire uncorrupted sequence at the decoder. T5's final span-corruption objective then goes further and predicts only the spans. Three different recipes, and T5's "BERT-style" baseline is an adaptation, not BERT itself. (The tokenizer underneath all of this is SentencePiece, a subword tokenizer that learns its units directly from raw text with no language-specific rules, which is why one shared 32,000-token vocabulary can cover English, German, French, and Romanian; the sentinels are extra symbols added to it.)

A controlled search for the objective

The text-to-text instrument pays off in the experiments. With the wrapper fixed, T5 can run a staged search over the space of denoising objectives, changing one knob at a time. The figure walks the four stages, and the GLUE axis is deliberately held fixed across all of them so you can see the shape of the result.

Figure 5 · the objective search, on one axis

stage1 / 4

Four stages of the search (the paper's Figure 5), on a fixed GLUE axis. Stage one, the choice of approach, spans ten points. Every stage after it moves things by barely one: the big lever is whether to denoise at all. Step through the stages. Numbers are verbatim from Tables 4 through 7.

The first stage compares genuinely different ideas: prefix language modeling (split the text, predict the second half), a BERT-style denoiser, and deshuffling (scramble the words, predict the original order). The denoiser wins at $82.96$ , the prefix LM trails at $80.69$ , and deshuffling is well back at $73.17$ . That is a ten-point spread, and it is the only large gap the search will turn up.

From there the differences shrink to almost nothing. Simplifying the BERT objective (drop the random swaps, replace spans with sentinels, or drop corrupted tokens entirely) moves GLUE around inside a single point. Dropping tokens completely actually scores the highest GLUE at $84.44$ , but it does worse on SuperGLUE (a harder sibling benchmark, built specifically to stay difficult for transfer-learning systems), so T5 keeps the sentinel-span version that produces short, well-behaved targets. Varying the corruption rate barely matters until you reach 50%, which finally hurts, so 15% stays. And varying the average span length nudges things by tenths of a point, with a length of 3 edging the others on the harder tasks. Each of those late stages is a near-flat row of bars.

Among denoising objectives the specific choice hardly matters, which is more useful than any single winner. So you should pick whichever is cheapest to train, which means whichever produces the shortest targets. Only the high-level decision changed the outcome: denoise rather than predict the next word. A controlled study that mostly returns "these knobs do not matter" is not a failed study. It tells you where to stop looking. And it licenses frugality: a choice shown to be inert may be made on cost alone, which is exactly what the paper does in keeping the shortest targets, while the attention it frees up concentrates on the one factor that moved, denoising over plain left-to-right prediction.

C4: a clean slice of the web

A pre-training objective needs text to run on, and the amount and quality of that text is its own variable. So the paper built one. The Colossal Clean Crawled Corpus, C4, starts from a single month of Common Crawl, the public archive that scrapes the open web (April 2019, about 20 terabytes of raw text), and cleans it hard with a set of blunt heuristics. Keep only lines that end in real punctuation. Drop pages with fewer than three sentences, or any line under five words. Throw out pages with code (any page containing a curly brace), the "please enable Javascript" notices (any line with the word "Javascript"), placeholder "lorem ipsum", or words from a list of obscenities. Deduplicate repeated three-sentence spans. Keep only text a language detector calls English with at least 99% confidence. What survives is about 750 gigabytes of reasonably clean English, orders of magnitude larger than the pre-training sets that came before it.

The experiments on the data are as careful as the rest. Skipping the cleaning step makes performance worse, so the filtering earns its place. A more in-domain corpus can beat the diverse C4 on a related task: pre-training on Wikipedia plus a books corpus lifts SuperGLUE from $71.36$ to $73.24$ , almost entirely from a reading-comprehension task whose passages come from books. But a single domain is by nature smaller, and a small corpus, repeated enough times during a long pre-training run, starts to hurt: the model begins to memorize rather than generalize, which shows up in the training loss. That is the argument for a corpus like C4, big and varied enough that a long run never has to repeat itself. The authors released it, and it went on to feed a large fraction of the models that came after.

You were given 4× compute. How should you spend it?

Everything so far has held compute roughly fixed. The last lever is to spend more of it, and the paper frames the question as sharply as anyone has: you were just handed four times the compute; what do you do with it? The backdrop is Rich Sutton's "bitter lesson", that general methods which soak up more computation tend to win out over hand-engineered ones. The question is which way of soaking it up pays best.

Figure 6 · spending 4× compute

Six ways to spend four times the baseline compute, scored on SuperGLUE or GLUE. Bigger model (teal) beats more steps or a bigger batch (amber), and a 2× model run 2× as long ties a 4× model. The ensemble of four separate models is orthogonal: strong on generation, flat on SuperGLUE. Numbers verbatim from Table 13.

The options are: train the baseline four times as long, train it with a four-times-larger batch, train a 2× model for 2× as long, train a 4× model for the baseline duration, or ensemble four separately trained models. The findings are clear-cut. More training and a bigger model both help, and they are complementary, so a 2× model trained 2× as long and a 4× model trained for the baseline duration (both of which spend the same 4× compute) land right next to each other (SuperGLUE $77.18$ vs $78.04$ ). Ensembling is orthogonal: four fully separate models, averaged, do well on the generation tasks but barely move SuperGLUE.

It is easy to read too much into this section. T5 is reporting an empirical table of tradeoffs, not fitting a law. The power-law relationships between loss, model size, and compute (the Kaplan scaling laws, then Chinchilla) came after this paper. T5's contribution here is the practical ranking: when in doubt, make the model bigger, and ensembling gives you something extra on top.

Stack the winners, scale up

The finale stacks the winners. Span corruption with mean span length 3. The big C4 corpus, trained for a full million steps on a large batch, which works out to about one trillion tokens, roughly 32 times the baseline. A pass of multi-task pre-training (pre-train on a mixture of the unsupervised objective and all the downstream tasks at once) before fine-tuning on each task on its own. Beam search, which keeps several candidate sequences alive while decoding instead of greedily taking the top token. It helps only on the open-ended generation tasks, where the output is a long free-form string; the single-word classification and number targets gain nothing from it. And then the model is scaled across five sizes, from a 60-million-parameter Small up to an 11-billion-parameter giant. (The largest sizes grow mostly by widening the feed-forward network, the fully-connected layer inside each block, to $d_{\text{ff}}=65{,}536$ at 11B, because the accelerators of the day were happiest doing big dense matrix multiplies.)

Figure 7 · scale climbs toward human

The same recipe at five sizes, on SuperGLUE. Small to 11B climbs from

63.3

88.9

, clearing the previous best (

84.6

) and stopping just short of the human baseline (

89.8

). Only scale changes between the bars. Numbers verbatim from Table 14.

T5-11B set the state of the art on 18 of the 24 tasks the paper measures: all of GLUE, most of SuperGLUE, SQuAD, summarization, and others, though not translation. It scored $90.3$ on GLUE and $91.26$ on SQuAD, a reading-comprehension test where the model answers questions about a passage, scored by exact match, the fraction of answers that match the reference word for word (human performance is about $82.3$ , so the headroom may be mostly gone). The most striking number is on SuperGLUE: T5 pushed the state of the art from $84.6$ to $88.9$ , within $0.9$ points of the human baseline of $89.8$ . The biggest single ingredient in getting there was the jump to 11 billion parameters.

The gaps matter as much as the wins, and the paper is forthright about them. T5 did not reach the state of the art on any of the WMT translation tasks (the standard machine-translation benchmarks, on language pairs like English to German). Its English-only pre-training and its lack of backtranslation (training on extra synthetic data made by translating target-language text back into the source) left it behind systems built for the job, and no amount of scale closed that gap. The reason is mechanical: translation depends on source-to-target alignments, which words map to which, and a model can only absorb those from parallel or back-translated text; an English-only pre-training corpus simply never contains them, so scaling it up buys more English fluency and no more alignment. This is the cost of the text-to-text choice: a generic string interface gives up the task-specific structure that translation rewards.

For a landmark paper, T5 introduced almost no new mechanism. What it introduced instead was a way to ask the question cleanly, a corpus big enough to answer it, and the discipline to report what did not matter alongside what did. The text-to-text format it standardized became the default framing for the instruction-tuned models that followed; the C4 corpus it released fed a generation of them. A survey done this well leaves behind a clear picture of which choices matter, and the field had been missing one. T5 supplied it, then scaled the model as far as compute allowed to test the limit.

Provenance Verified against primary literature

Vaswani et al. (2017)The encoder-decoder Transformer T5 builds on, almost unchanged.

BERT / MLM (2018)The denoising objective T5 adapts; T5’s "BERT-style" is a simplified version.

RMSNorm (2019)T5’s rescale-only LayerNorm is what this contemporaneous paper named RMSNorm.

Shaw et al. (2018)Relative positions; T5 shrinks the learned vectors to a per-head scalar bias.

Sutton, "Bitter Lesson"The premise the scaling experiments put to the test.

correctionT5’s "BERT-style" objective is not BERT’s. BERT corrupts 15% of tokens as 80% mask / 10% random / 10% unchanged and predicts only the masked positions at the encoder. T5’s reimplementation uses 90% mask / 10% random and rebuilds the entire sequence; its final objective predicts only the dropped spans. The explainer teaches each recipe separately.

Questions you might still have

Why an encoder-decoder, when decoder-only models later won?
On T5’s own apples-to-apples comparison, the encoder-decoder beat both the decoder-only language model and the prefix LM on every task, and it carries twice the parameters at about the same compute (the encoder only touches the input, the decoder only the output). Decoder-only models won later for other reasons (raw scale, simplicity, in-context learning), not because they were better on this controlled test.

If the objective barely matters, why search for it at all?
The search’s payoff was the negative result. Denoising beats causal language modeling by about ten GLUE points, but every denoising variant lands within a point of every other. Knowing the choice does not matter lets you pick the cheapest one (short targets), and knowing the one variable that does matter (denoise versus not) is itself a finding.

Is "T5" the model or the paper?
Both. The paper is a controlled survey of transfer-learning choices; T5 is the final artifact that bolts the survey’s winners onto scale. The lasting contribution is as much the comparison and the C4 corpus as the weights.

Why did it miss state-of-the-art on translation?
English-only pre-training and no backtranslation. The best WMT systems lean on cross-lingual data and a heavy data-augmentation scheme that the plain text-to-text recipe did not include. Scale alone did not close that gap, which the paper reports plainly.

Footnotes & further reading

The paper: Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Google, JMLR 2020). Code and the C4 data set.
The backbone: Vaswani et al., Attention Is All You Need (2017).
The denoising objective T5 adapts: Devlin et al., BERT (2018); the span-masking idea, Joshi et al., SpanBERT (2019), and the masked seq2seq objective, Song et al., MASS (2019).
The rescale-only normalization, named the same week: Zhang and Sennrich, Root Mean Square Layer Normalization (2019).
Relative position representations: Shaw et al., Self-Attention with Relative Position Representations (2018).
The prefix-LM architecture: Dong et al., Unified Language Model Pre-training (UniLM) (2019).
The scaling premise: Rich Sutton, The Bitter Lesson (2019); and the optimizer, Shazeer and Stern, Adafactor (2018).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.