VerifiedarXiv:1910.1068326 min
LLMs · Transfer learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Every task is the same task: text in, text out.

T5 does not invent a clever new mechanism. It makes one bet, that every language problem can be written as text-to-text, and then uses that uniformity to run the experiment the field had been too fragmented to run: hold everything fixed and ask what actually matters.

Explaining the paperExploring the Limits of Transfer Learning with a Unified Text-to-Text TransformerRaffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu · Google · JMLR 2020 · arXiv:1910.10683

What if translation, sentiment, and question answering were all one problem?

By late 2019 transfer learning had taken over natural language processing, and the field was a mess of good ideas that nobody could compare. BERT pre-trained an encoder and bolted a classifier on top. GPT-2 ran a decoder and predicted the next word. XLNet, RoBERTa, UniLM, MASS, SpanBERT: each shipped its own architecture, its own pre-training objective, its own unlabeled corpus, its own fine-tuning trick, often all at once. When a new model beat the last one, you could rarely say why. Was it the objective? The data? The extra billion tokens? The architecture? The contributions were tangled together, and the tangle was slowing the field down.

T5, out of Google, is the paper that untangled it. The authors are blunt that they are not proposing a new method. They are running a survey, the careful kind, where you fix everything you can and change one thing at a time. To make that possible they needed a single setup flexible enough to express every task they wanted to study, so they could swap objectives and architectures and corpora without ever changing the wrapper around them. That setup is the text-to-text framework, and it is the whole foundation. Once it is in place, the paper becomes a long, honest list of controlled experiments, and at the end the winners get stacked together and scaled up into a model the authors call T5 (the Text-to-Text Transfer Transformer), which then sets the state of the art on most of the benchmarks it touches.

The path follows the paper. First the framework that makes everything comparable. Then the backbone Transformer and the three architectures it can wear. Then the pre-training objective and the search that picks it. Then the data, the scaling, and what the finished model does. None of the pieces is hard; the discipline of comparing them cleanly is the contribution.

Every task is text in, text out

Start with the one idea everything rests on. A model that classifies sentiment, a model that translates, and a model that answers questions normally look nothing alike: different output layers, different losses, different decoding. T5 refuses all of that. Every task is reduced to the same shape. You feed the model some text, and you train it to produce some text. Nothing else.

The trick that makes this work for tasks that are not obviously "generate text" is a task prefix: a few words at the front of the input that name the job. There is no machinery behind the prefix, it is ordinary conditioning text: inputs that begin "translate English to German:" are always paired with German targets in training, so the model learns that those words predict German output, and the prefix steers the decoder the way any context steers a language model, no task-specific head required. You feed the model the input on the left and train it to emit the target on the right:

# every task is one (input string -> target string) pair
"translate English to German: That is good."   ->  "Das ist gut."
"cola sentence: The professor talked us."       ->  "unacceptable"
"mnli premise: I hate pigeons. hypothesis: ..." ->  "entailment"
"stsb sentence1: A plane lands. sentence2: ..." ->  "4.0"
"summarize: <a long news article> ..."          ->  "<a short summary>"

For a classification task the target is just the label written out as a word. On the MNLI inference benchmark, where you decide whether a premise entails, contradicts, or is neutral toward a hypothesis, the prefix is mnli and the target is the single word entailment. Even a regression task bends into the format: STS-B asks for a similarity score from 1 to 5, so T5 rounds it to the nearest 0.20.2 and predicts the number as a literal string like 3.8, which quietly turns the regression into a 21-way choice over string targets. (If the model ever emits something that is not a valid label or number, it is scored wrong, which the authors say never actually happened.)

Figure 1 · one model, every task
Pick a task. The teal prefix names the job; the input rides in behind it; the same model emits the answer as plain text, a German sentence, a class label, a similarity number, or a summary. The translate and mnli examples are verbatim from the paper; the rest are illustrative in the same format.

Because the shape never changes, the training objective never changes either. Every task, pre-training and fine-tuning alike, is trained with the same maximum-likelihood loss: the model predicts the target one token at a time, is shown the true previous token while learning (this is called teacher forcing), and is penalized by the cross-entropy between its predicted next-token distribution and the truth. Decoding never changes either. The same model, the same loss, the same hyperparameters, the same decoder, for translation and grammar-checking and summarization.

That uniformity is the paper's experimental instrument. When the wrapper around every task is identical, you can change the pre-training objective and measure the effect with nothing else moving. You can swap the architecture underneath and compare it honestly. You can change the corpus and read off the difference. Task-specific machinery would have been a confound in every one of those comparisons, and the text-to-text format removes it. The rest of the paper spends that budget.

Two honest edges. A handful of tasks need a little massaging to fit (the Winograd pronoun puzzles get recast as "predict the noun the starred pronoun refers to"). And a plain string interface cannot express every inductive bias a task might want. Translation, in particular, leans on alignment structure that a generic text-to-text model does not get for free, and translation is exactly where T5 later falls short of the best specialized systems. The format buys comparability, not omnipotence.

The backbone: an encoder-decoder Transformer

Under the text-to-text wrapper sits a Transformer, and T5 deliberately keeps it close to the original 2017 design. Two stacks. An encoder reads the input and builds a representation of it, with each token free to look at every other token. A decoder writes the output one token at a time, looking back at its own partial output and, through a cross-attention step, at the encoder's reading of the input. The engine in both stacks is self-attention, which rebuilds each position as a weighted average of the others:

yi=jwijxj,wij=softmaxj ⁣(qikjdkv)y_i = \sum_j w_{ij}\, x_j, \qquad w_{ij} = \mathrm{softmax}_j\!\left(\frac{q_i \cdot k_j}{\sqrt{d_{kv}}}\right)
(1)

Here xjx_j are the incoming token vectors, qiq_i and kjk_j are the query and key projections, and wijw_{ij} is how much position ii attends to position jj. The weights are a softmax, so they are non-negative and sum to one across jj: each output token is a convex blend of the inputs. That is the entire mechanism, repeated across many parallel heads and stacked into blocks of attention plus a small feed-forward network. Cross-attention is the same operation with one twist: the decoder's queries attend to the encoder's keys and values rather than its own, and that is the only channel through which the two stacks talk.

T5 makes three small changes to that backbone, and it is worth naming them precisely because two of them became standard equipment in later models. First, it simplifies the layer normalization: instead of re-centering activations and adding a learned bias, it only rescales them by their root-mean-square. There is no mean subtraction and no additive term. The wager is that normalization's load-bearing half is the rescale, the part that keeps activations bounded; the centering and the bias turn out to be passengers, so T5 keeps the wheel and drops the passengers.

LN(x)=γx1dkxk2+ϵ\mathrm{LN}(x) = \gamma \odot \frac{x}{\sqrt{\tfrac{1}{d}\sum_k x_k^2 + \epsilon}}
(2)

This is exactly what a paper published the same week would name RMSNorm. T5 does not use that name (the two were contemporaneous, not derivative), but the operation is identical, and it is the normalization that LLaMA and most of what followed adopted. Second, T5 puts the normalization before each sublayer rather than after it, so a block computes x+sublayer(LN(x))x + \text{sublayer}(\mathrm{LN}(x)) and the residual skip bypasses the norm entirely. This is the "pre-norm" arrangement, and it trains more stably than the original "post-norm" LN(x+sublayer(x))\mathrm{LN}(x + \text{sublayer}(x)).

Third, and most distinctive, T5 throws out absolute position embeddings. The original Transformer told each token where it sat with a fixed sinusoid; T5 instead learns a relative position bias: a single scalar, added directly to the attention logit in (1), that depends only on the offset iji - j between the query and key. Offsets are sorted into 32 buckets (exact for small distances, then growing logarithmically, with everything past 128 tokens collapsed into one far-away bucket), so the whole bias table stays tiny while still capturing how far apart two tokens are. Each bucket owns one learned scalar, and the table is shared across all layers while each attention head keeps its own copy. It is a deliberate simplification of the relative-position scheme from Shaw et al., which added learned vectors rather than a single scalar. The authors note these three tweaks are orthogonal to the study and leave ablating them to future work.

Drag the offset below and watch which bucket catches it: 5 gets a bucket of its own, 90 lands in a band 27 offsets wide, and everything from 91 out shares the last one.

Figure 2 · 32 buckets for every offset
5 → b5
The band boundaries are computed with T5's actual bucketing function (relative_position_bucket, num_buckets=32, max_distance=128), not sketched. Offsets 0 through 7 each own a bucket, the bands then widen logarithmically, and one terminal bucket catches every offset from 91 out, which is how everything past the 128 horizon collapses into one. The 32 buckets split 16 per sign of iji-j; the positive half is shown, the negative half mirrors it. Each bucket owns one learned scalar per head, added to the attention logit; the bias values here are illustrative, not trained.

The baseline that anchors every experiment is sized to be familiar: the encoder and decoder are each about a BERT-Base stack, twelve blocks apiece, a hidden width of dmodel=768d_{\text{model}}=768 (the size of each token vector), twelve heads, roughly 220 million parameters in total. That is about twice BERT-Base, because there are two stacks instead of one, and that doubling turns out to be nearly free, which the next section makes precise.

Three shapes of attention

The text-to-text format does not actually require an encoder-decoder. A single decoder-only stack, the GPT shape, can do text-to-text too: just glue the input and the target into one long sequence and predict it left to right. So before settling on encoder-decoder, T5 compares the architectures head to head, and the thing that separates them is the attention mask, the rule for who is allowed to look at whom.

Self-attention treats its input as a set, so on its own it has no notion of past and future. The mask imposes one by zeroing out forbidden weights before the softmax. Three patterns matter. A fully-visible mask lets every position attend to every other; it is what an encoder uses, and what BERT uses. A causal mask forbids looking ahead, setting wij=0w_{ij}=0 for every j>ij>i, so position ii sees only itself and the past; it is what a decoder uses when it generates. (The convention here is that a zeroed weight means "cannot attend".) The third, causal with a prefix, is the interesting hybrid: fully visible over an initial chunk of the sequence, then causal for the rest.

Those three masks define three architectures. The encoder-decoder uses a fully-visible encoder and a causal decoder joined by cross-attention. A language model is a single causal stack over the glued input-plus-target. And a prefix LM is a single stack that treats the input as a fully-visible prefix and only goes causal once it starts generating the target. The figure shows the masks over a short [input | target] sequence: notice that the language model needlessly hides earlier input tokens from later ones (its prefix block is a triangle), while the encoder-decoder and the prefix LM both give the input full visibility (a solid square).

Figure 3 · the mask makes the model
Who can attend to whom, over five input tokens then four target tokens. The plain language model is one causal triangle, so even within the input each token hides from its successors. The encoder-decoder and prefix LM light up the whole input block. All three cost the same compute MM; only the encoder-decoder spends 2P2P parameters, and it wins GLUE (from the paper's Table 2, denoising objective).

Now the accounting that makes the encoder-decoder attractive. Stack the encoder's 12 layers on top of the decoder's 12 and you have twice the parameters of a 12-layer decoder-only model. But you do not pay twice the compute, because the encoder only ever runs over the input and the decoder only ever runs over the target, so each token passes through one stack, not both, whereas a decoder-only model runs all of its layers over the input and target concatenated together. Per token, each token visits one stack, not both: input tokens flow through the encoder, target tokens through the decoder, so doubling the parameters does not double the work any single token pays for. The upshot, in the paper's bookkeeping, is that an L+LL+L encoder-decoder has about 2P2P parameters but roughly the same compute MM as an LL-layer language model with only PP parameters. (It is not exactly equal, the cross-attention adds about 10% of the parameters and the attention cost depends on the sequence lengths, but it is close.) You get to spend twice the parameters at about the same cost.

The reason this is a free lunch and not an accounting trick is worth seeing from the other side, comparing the encoder-decoder to a decoder-only model that holds the same total 2P2P parameters. In the encoder-decoder those 2P2P parameters are split into two separate stacks of PP each, and the two stacks see disjoint tokens: an input token is read by the encoder and never touches the decoder's weights, a target token is written by the decoder and never touches the encoder's. So each token is processed by exactly one stack of PP parameters, and the cost per token is set by PP, not 2P2P. A decoder-only model that wants the same 2P2P parameters has to put them all in one stack, and a single stack has no way to route some tokens around half of itself, so every token, input and target alike, is pushed through all 2P2P of them. Same parameter budget, but the decoder-only model pays for all of it on every token while the encoder-decoder pays for half. That is why the encoder-decoder buys the second PP at roughly the same per-token compute MM: the parameters are doubled, but the work each token does is not.

And the parameters earn their keep. The scores here are on GLUE, a suite of nine English language-understanding tasks (grammaticality, entailment, paraphrase, sentiment, and the like) averaged into a single number. Across every task, the encoder-decoder with a denoising objective came out on top: GLUE 83.2883.28, against 81.8281.82 for the prefix LM and 74.7074.70 for the plain language model. Sharing the encoder and decoder weights (halving the parameter count back down to PP) barely dented it, to 82.8182.81, which is a useful trick when memory is tight. The prefix LM beat the plain LM, confirming that the damage in a decoder-only model is largely that causal masking hides the input from itself. And in every architecture, the denoising objective beat a plain language-modeling objective, which is the thread we pick up next.

Pre-training: fill in the blanks

Pre-training needs a task you can run on raw text with no human labels, that still teaches the model something general. The historically obvious choice is language modeling: predict the next word. But by 2019 the field had largely moved to denoising objectives, where you corrupt the input and ask the model to repair it. BERT's "masked language modeling" is the famous example: hide some words and predict them from both sides. Denoising consistently beat next-word prediction on downstream tasks, and T5 confirms it.

T5's own objective is a denoising variant tuned for an encoder-decoder. It drops 15% of the tokens. Then every consecutive run of dropped tokens is replaced in the input by a single sentinel token, a special symbol unique within the example (<X>, <Y>, and so on). The target is not the whole repaired sentence. It is only the dropped spans, each tagged with the sentinel that stands in for it, ended by one final sentinel. The figure runs the transformation on the paper's own example sentence.

Figure 4 · span corruption
15%
Drop 15% of the tokens; collapse each consecutive dropped span into one sentinel; train the decoder to emit only the dropped spans, each tagged by its sentinel, with a final sentinel to stop. The default shows the paper's exact example. Reroll to resample, or push the rate up. The target is always far shorter than the input, which is the whole point.

Two design choices hide in there, and both are about speed, not accuracy. Collapsing a whole span into one sentinel, and predicting only the corrupted tokens rather than reconstructing the entire sentence, both make the target sequence short. Shorter targets mean less work in the decoder, so pre-training runs faster, and the paper will lean on that to train for far longer later. The arithmetic is the point: the decoder's share of pre-training compute scales with the tokens it must produce, and a target that lists only the dropped spans is several times shorter than the full sentence, so the same budget buys correspondingly more pre-training. In pseudocode the transformation is almost trivial:

# span corruption: drop 15% of tokens, in spans of mean length 3
tokens = tokenize(text)                  # "Thank you for inviting me ..."
drop   = sample_spans(tokens, rate=0.15, mean_len=3)
src, tgt, k = [], [], 0
for run in consecutive_runs(drop):       # each run of dropped tokens
    sentinel = f"<extra_id_{k}>"; k += 1
    src += [sentinel]                    # one sentinel replaces the whole span
    tgt += [sentinel] + run              # target lists the span back, in order
tgt += [f"<extra_id_{k}>"]               # a final sentinel ends the target
# every token that was NOT dropped stays in src, in its original place

One thing to pin down: the paper's "BERT-style" comparison point is not actually BERT's recipe. Real BERT, of the 15% of tokens it selects, replaces 80% with a mask token, 10% with a random token, and leaves 10% unchanged, and it predicts only those selected positions at the output of its single encoder. T5's reimplemented "BERT-style" objective simplifies this to 90% mask and 10% random, and because it is an encoder-decoder, it reconstructs the entire uncorrupted sequence at the decoder. T5's final span-corruption objective then goes further and predicts only the spans. Three different recipes, and T5's "BERT-style" baseline is an adaptation, not BERT itself. (The tokenizer underneath all of this is SentencePiece with a shared 32,000-token vocabulary covering English, German, French, and Romanian; the sentinels are extra symbols added to it.)

The text-to-text instrument pays off in the experiments. With the wrapper fixed, T5 can run a clean, staged search over the space of denoising objectives, changing one knob at a time. The figure walks the four stages, and the GLUE axis is deliberately held fixed across all of them so you can see the shape of the result.

Figure 5 · the objective search, on one axis
1 / 4
Four stages of the search (the paper's Figure 5), on a fixed GLUE axis. Stage one, the choice of approach, spans ten points. Every stage after it moves things by barely one. The big lever was "denoise"; the rest is bookkeeping. Step through the stages. Numbers are verbatim from Tables 4 through 7.

The first stage compares genuinely different ideas: prefix language modeling (split the text, predict the second half), a BERT-style denoiser, and deshuffling (scramble the words, predict the original order). The denoiser wins at 82.9682.96, the prefix LM trails at 80.6980.69, and deshuffling is well back at 73.1773.17. That is a ten-point spread, and it is the only large gap in the whole search.

From there the differences shrink to almost nothing. Simplifying the BERT objective (drop the random swaps, replace spans with sentinels, or drop corrupted tokens entirely) moves GLUE around inside a single point. Dropping tokens completely actually scores the highest GLUE at 84.4484.44, but it does worse on SuperGLUE (a harder sibling benchmark, built specifically to stay difficult for transfer-learning systems), so T5 keeps the sentinel-span version that produces short, well-behaved targets. Varying the corruption rate barely matters until you reach 50%, which finally hurts, so 15% stays. And varying the average span length nudges things by tenths of a point, with a length of 3 edging the others on the harder tasks. Each of those late stages is a near-flat row of bars.

The meta-result is more useful than any single winner. Among denoising objectives, the specific choice hardly matters, so you should pick whichever is cheapest to train, which means whichever produces the shortest targets. The one decision that moved the needle was the high-level one: denoise, do not just predict the next word. A controlled study that mostly returns "these knobs do not matter" is not a failed study. It tells you where to stop looking. And it licenses frugality: a choice shown to be inert may be made on cost alone, which is exactly what the paper does in keeping the shortest targets, while the attention it frees up concentrates on the one lever that moved, denoising over plain left-to-right prediction.

C4: a clean slice of the web

A pre-training objective needs text to run on, and the amount and quality of that text is its own variable. So the paper built one. The Colossal Clean Crawled Corpus, C4, starts from a single month of Common Crawl, the public archive that scrapes the open web (April 2019, about 20 terabytes of raw text), and cleans it hard with a set of blunt heuristics. Keep only lines that end in real punctuation. Drop pages with fewer than three sentences, or any line under five words. Throw out pages with code (any page containing a curly brace), boilerplate ("terms of use", "privacy policy"), placeholder "lorem ipsum", or words from a list of obscenities. Deduplicate repeated three-sentence spans. Keep only text a language detector calls English with at least 99% confidence. What survives is about 750 gigabytes of reasonably clean English, orders of magnitude larger than the pre-training sets that came before it.

The experiments on the data are as careful as the rest. Skipping the cleaning step makes performance worse, so the filtering earns its place. A more in-domain corpus can beat the diverse C4 on a related task: pre-training on Wikipedia plus a books corpus lifts SuperGLUE from 71.3671.36 to 73.2473.24, almost entirely from a reading-comprehension task whose passages come from books. But a single domain is by nature smaller, and a small corpus, repeated enough times during a long pre-training run, starts to hurt: the model begins to memorize rather than generalize, and the training loss gives it away. That is the argument for a corpus like C4, big and varied enough that a long run never has to repeat itself. The authors released it, and it went on to feed a large fraction of the models that came after.

You were given 4× compute. How should you spend it?

Everything so far has held compute roughly fixed. The last lever is to spend more of it, and the paper frames the question as sharply as anyone has: you were just handed four times the compute; what do you do with it? The honest backdrop is Rich Sutton's "bitter lesson", that general methods which soak up more computation tend to win out over hand-engineered ones. The question is which way of soaking it up pays best.

Figure 6 · spending 4× compute
Six ways to spend four times the baseline compute, scored on SuperGLUE or GLUE. Bigger model (teal) beats more steps or a bigger batch (amber), and a 2× model run 2× as long ties a 4× model. The ensemble of four separate models is orthogonal: strong on generation, flat on SuperGLUE. Numbers verbatim from Table 13.

The options are: train the baseline four times as long, train it with a four-times-larger batch, train a 2× model for 2× as long, train a 4× model for the baseline duration, or ensemble four separately trained models. The findings are clean. More training and a bigger model both help, and they are complementary, so a 2× model trained 2× as long (SuperGLUE 77.1877.18) lands right next to a 4× model trained for the baseline duration (78.0478.04). Ensembling is a genuinely orthogonal lever: four fully separate models, averaged, do well on the generation tasks but barely move SuperGLUE.

One framing note, because it is easy to read too much into this section. T5 is reporting an empirical table of tradeoffs, not fitting a law. The clean power-law relationships between loss, model size, and compute (the Kaplan scaling laws, then Chinchilla) came after this paper. T5's contribution here is the practical ranking: when in doubt, make the model bigger, and ensembling buys you something extra on top.

So what does it actually do

The finale stacks the winners. Span corruption with mean span length 3. The big, clean C4 corpus, trained for a full million steps on a large batch, which works out to about one trillion tokens, roughly 32 times the baseline. A pass of multi-task pre-training (fine-tune on a mixture of all the downstream tasks at once) before fine-tuning on each task on its own. Beam search, which keeps several candidate sequences alive while decoding instead of greedily taking the top token, for the open-ended generation tasks. And then the model is scaled across five sizes, from a 60-million-parameter Small up to an 11-billion-parameter giant. (The largest sizes grow mostly by widening the feed-forward network, the fully-connected layer inside each block, to dff=65,536d_{\text{ff}}=65{,}536 at 11B, because the accelerators of the day were happiest doing big dense matrix multiplies.)

Figure 7 · scale climbs toward human
The same recipe at five sizes, on SuperGLUE. Small to 11B climbs from 63.363.3 to 88.988.9, clearing the previous best (84.684.6) and stopping just short of the human baseline (89.889.8). Only scale changes between the bars. Numbers verbatim from Table 14.

T5-11B set the state of the art on 18 of the 24 tasks the paper measures: all of GLUE, most of SuperGLUE, SQuAD, summarization, and others, though not, as we are about to see, translation. It scored 90.390.3 on GLUE and 91.2691.26 on SQuAD, a reading-comprehension test where the model answers questions about a passage, scored by exact match, the fraction of answers that match the reference word for word (human performance is about 82.382.3, so the headroom may be mostly gone). The most striking number is on SuperGLUE: T5 pushed the state of the art from 84.684.6 to 88.988.9, within 0.90.9 points of the human baseline of 89.889.8. The biggest single ingredient in getting there was the jump to 11 billion parameters.

The honest gaps matter as much as the wins, and the paper is forthright about them. T5 did not reach the state of the art on any of the WMT translation tasks (the standard machine-translation benchmarks, on language pairs like English to German). Its English-only pre-training and its lack of backtranslation (training on extra synthetic data made by translating target-language text back into the source) left it behind systems built for the job, and no amount of scale closed that gap. The why is mechanical: translation lives on source-to-target alignments, which words map to which, and a model can only absorb those from parallel or back-translated text; an English-only pre-training corpus simply never contains them, so scaling it up buys more English fluency and no more alignment. That is the text-to-text bet showing its edge: a generic string interface gives up the task-specific structure that translation rewards. The framework is a leveller, and levelling cuts both ways.

Step back and the shape of the contribution is unusual. T5 introduced almost no new mechanism. What it introduced was a way to ask the question cleanly, a corpus big enough to answer it, and the discipline to report what did not matter alongside what did. The text-to-text format it standardized became the default framing for the instruction-tuned models that followed; the C4 corpus it released fed a generation of them. The lasting product of a survey, done well, is a map. T5 drew the one the field had been missing, and then walked to the far edge of it to see how far scale alone would go.

Provenance Verified against primary literature
Vaswani et al. (2017)The encoder-decoder Transformer T5 builds on, almost unchanged.
BERT / MLM (2018)The denoising objective T5 adapts; T5’s "BERT-style" is a simplified version.
RMSNorm (2019)T5’s rescale-only LayerNorm is what this contemporaneous paper named RMSNorm.
Shaw et al. (2018)Relative positions; T5 shrinks the learned vectors to a per-head scalar bias.
Sutton, "Bitter Lesson"The premise the scaling experiments put to the test.
correctionT5’s "BERT-style" objective is not BERT’s. BERT corrupts 15% of tokens as 80% mask / 10% random / 10% unchanged and predicts only the masked positions at the encoder. T5’s reimplementation uses 90% mask / 10% random and rebuilds the whole sequence; its final objective predicts only the dropped spans. The explainer teaches each recipe separately.

Questions you might still have

?

Why an encoder-decoder, when decoder-only models later won?
On T5’s own apples-to-apples comparison, the encoder-decoder beat both the decoder-only language model and the prefix LM on every task, and it carries twice the parameters at about the same compute (the encoder only touches the input, the decoder only the output). Decoder-only models won later for other reasons (raw scale, simplicity, in-context learning), not because they were better on this controlled test.

?

If the objective barely matters, why search for it at all?
The search’s payoff was the negative result. Denoising beats causal language modeling by about ten GLUE points, but every denoising variant lands within a point of every other. Knowing the choice does not matter lets you pick the cheapest one (short targets), and knowing the one lever that does matter (denoise versus not) is itself a finding.

?

Is "T5" the model or the paper?
Both. The paper is a controlled survey of transfer-learning choices; T5 is the final artifact that bolts the survey’s winners onto scale. The lasting contribution is as much the comparison and the C4 corpus as the weights.

?

Why did it miss state-of-the-art on translation?
English-only pre-training and no backtranslation. The best WMT systems lean on cross-lingual data and a heavy data-augmentation scheme that the plain text-to-text recipe did not include. Scale alone did not close that gap, which is the paper’s most honest result.

Footnotes & further reading

  1. The paper: Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, Liu, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Google, JMLR 2020). Code and the C4 data set.
  2. The backbone: Vaswani et al., Attention Is All You Need (2017).
  3. The denoising objective T5 adapts: Devlin et al., BERT (2018); the span-masking idea, Joshi et al., SpanBERT (2019), and the masked seq2seq objective, Song et al., MASS (2019).
  4. The rescale-only normalization, named the same week: Zhang and Sennrich, Root Mean Square Layer Normalization (2019).
  5. Relative position representations: Shaw et al., Self-Attention with Relative Position Representations (2018).
  6. The prefix-LM architecture: Dong et al., Unified Language Model Pre-training (UniLM) (2019).
  7. The scaling premise: Rich Sutton, The Bitter Lesson (2019); and the optimizer, Shazeer and Stern, Adafactor (2018).