NLP · Pre-training

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

To read a sentence both ways, hide a word and predict it.

A language model reads left to right, so each word's representation depends only on what came before it. BERT aims for a representation in which every word is informed by its full sentence, left and right at once. An obstacle stood in the way, and getting around it is most of the paper.

Explaining the paperBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingDevlin, Chang, Lee, Toutanova · Google AI · NAACL 2019 · arXiv:1810.04805 ↗

One model, pre-trained once on plain text, then fine-tuned for an hour apiece, set records on eleven different language benchmarks at once.

By 2018 one idea had taken over natural language processing, borrowed from how computer vision used ImageNet. Pre-train a big model on a mountain of unlabeled data, then adapt it to your actual task with a small amount of labeled data. The pre-training teaches the model what language is like in general; the fine-tuning specializes it. The expensive part happens once, and everyone downstream gets to start from it.

Two flavors of this recipe were already working. ELMo (a 2018 model that builds contextual word vectors from a deep LSTM) was feature-based: it produced pre-trained word vectors that you fed, as extra inputs, into whatever task-specific architecture you had already built. OpenAI GPT was fine-tuning: it had almost no task-specific parts, and you adapted the entire pre-trained network by continuing to train it on your task. BERT is a fine-tuning method, like GPT. BERT does not depart from this recipe. Both ELMo and GPT accepted one constraint, and BERT had to break it.

BERT is an encoder: it turns text into vectors that a task layer can read, and that is all it does. It is not a generative model. It has no decoder bolted on, it was never trained to write the next word, and you cannot prompt it to continue a sentence.

Pre-train once, fine-tune for anything

The entire system has a simple shape. In phase one, pre-training, BERT reads enormous amounts of unlabeled text and learns from two self-supervised games below: guess a hidden word, and guess whether one sentence follows another. No human labels are involved, so the text can be the entire English Wikipedia and a corpus of books. In phase two, fine-tuning, you take that same network, keep its weights, and bolt one output layer on top for your actual task. Then you train briefly on your labeled data and you are done.

The striking part is how little changes between tasks. The body, the deep stack of Transformer layers, is identical for sentiment classification, for entailment, for named-entity tagging, for question answering. Only the input framing and the small head on top differ. Switch the task below and watch the encoder stay put while the head swaps:

Figure 1 · the recipe

Pre-training on unlabeled text: predict masked words (MLM) and whether sentence B follows A (NSP).

One BERT encoder, pre-trained once with a masked-word head and a next-sentence head. For any downstream task you keep the same body and attach a single thin head: the [CLS] output for classification, or every token's output for tagging and span answers. The weights are shared; the head costs almost nothing.

The result was eleven language tasks at a new state of the art, each obtained by adding essentially one layer to the same pre-trained model and fine-tuning for about an hour. Whether that is even possible depends on the quality of the pre-trained representation, which rests on one design decision.

Why you cannot just read both ways

A standard language model is unidirectional. It predicts each word from the words before it, left to right, so the representation it builds for a given position encodes only that position's left context. GPT works exactly this way. So does the backward half of ELMo, just flipped. This is fine for generating text, where you genuinely only have the past to go on. It is a real handicap for understanding text, where the word after often settles the meaning of the word before. To tag bank as a riverbank or a financial institution, you want to see what comes next.

So why not train a deep model that conditions on both sides at every layer? The paper's answer is one sentence: bidirectional conditioning would let each word indirectly see itself, so the model could trivially predict the target word in a multi-layered context. The key is that last phrase: the problem is not bidirectionality in the abstract, it is bidirectionality combined with stacked layers.

The leak takes a two-step path. Suppose you want to predict the word at some position from the representation your network builds above that position. If the network attends in both directions, then at the first layer the word to the left of the target builds a representation that has looked at the target and absorbed it. At the second layer, the target's own representation looks back at that left neighbor and gets the answer handed back. The information took a detour through a neighbor and came home. You did not even have to let the target attend to itself directly; the stack routes the answer around any such block. Counting the hops shows why this is a stacking problem: the detour needs one layer for a neighbor to absorb the target and a second layer to hand it back, so a single bidirectional layer predicting a removed token has no path to cheat with, and the leak only opens once layers stack. The model stops learning language and starts copying. Toggle the modes and follow the amber path that traces how information moves between positions:

Figure 2 · the see-itself leak

Predict the word at the middle position from the output built above it. In bidirectional mode the answer routes from the input up through a neighbor and back to its own output, so prediction is trivial. Left-only cuts the right side and breaks the leak, but the target now never sees what follows it. Masked deletes the word from the input entirely: no leak, and both sides are still visible.

That is why prior pre-training methods were unidirectional. Going left-to-right makes the leak impossible, because a word's representation never has access to the word it is about to predict. ELMo incorporated both directions by training a separate left-to-right model and a separate right-to-left model and gluing their outputs together at the end. That is real, but it is shallow. The two halves are never combined during pre-training; the model never builds a single representation that jointly weighs both sides. ("Shallow" here is about that top-level concatenation, not about network depth. ELMo's language models are perfectly deep LSTMs (stacked recurrent networks).)

The third mode in the figure avoids the leak. If you remove the target word from the input and replace it with a placeholder, there is no answer anywhere in the input for the stack to route home. With the word gone, nothing stops you from letting every other position attend in both directions. You get deep, jointly bidirectional context, and the prediction is valid because the thing being predicted is absent from the input.

Masked language modeling

BERT's first pre-training task is the masked language model (MLM). It is an old idea from psycholinguistics called the Cloze task (Taylor, 1953): delete some words from a passage and ask a reader to fill the blanks from context. BERT does exactly that. It picks 15% of the tokens in a sequence at random, hides them, and trains the network to predict the originals. Because the hidden word is gone from the input, the network is free to read the entire rest of the sequence, both sides, at every layer.

The prediction itself is a plain classification over the vocabulary. BERT works in WordPiece tokens, a fixed set of 30,000 subword pieces rather than whole words, so that a rare word like "unfreeze" splits into known parts ("un", "##freeze") and nothing is ever fully out of vocabulary. The final hidden vector at a masked position is pushed through a softmax over those 30,000 pieces (a softmax turns the raw scores into probabilities that sum to one), and the model is trained with ordinary cross-entropy against the piece that was actually there. Written out over the set $\mathcal{M}$ of masked positions, with $\tilde{\mathbf{x}}$ the corrupted sequence the network actually sees:

\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log p_{\theta}\!\left(x_i \mid \tilde{\mathbf{x}}\right)

(1)

The 30,000-piece vocabulary makes that softmax a fixed, finite classification rather than a guess over an open-ended dictionary. Common words are a single piece each, so "the" and "cat" map to one token. Rare or made-up words split into known subword pieces: "unfreeze" becomes "un" and "##freeze", and "embeddings" becomes "embed", "##ding", "##s", where the "##" marks a piece that continues the previous one rather than starting a new word. Because any string can be spelled out of these 30,000 pieces, there is no such thing as an out-of-vocabulary word, and every token BERT sees is one of a known, fixed set. Because of this, the masked position can be predicted by a plain softmax over exactly 30,000 choices: hide a piece, and filling the blank is choosing the right entry from that fixed list, scored with cross-entropy. A model facing a truly open vocabulary could not write that loss down at all.

Step the mask along the sentence below. The context arcs feeding each prediction arrive from both sides, which is the context BERT can use. A left-to-right model predicting mat in "the cat sat on the ___" only has the left half; BERT also gets to use anything that comes after.

Figure 3 · masked prediction

masking "sat" · p=0.33

One word is hidden as [MASK]. Context arcs converge on it from every other token, left and right, and the model emits a softmax over the vocabulary with the true word tallest. Step to mask a different word. The whole pre-training signal is this: fill the blank from the surrounding sentence.

It is tempting to read this as a denoising autoencoder, a network that corrupts its input and then rebuilds all of it from the corrupted version. The paper draws that contrast explicitly: BERT predicts only the masked positions, never the entire sequence. Reconstructing just the blanks is less work and it is enough, 15% of the tokens per pass, over a billion-word corpus and a million steps, is plenty of supervision.

There is a cost to this trick, and the paper is upfront about it. A left-to-right model gets a learning signal at every position of every sentence, because every position predicts the next token. BERT only learns from the 15% it masked, so it needs to see more text to converge.

The 80/10/10 compromise

If you have been paying attention you have spotted a new problem. The masked language model leans on a special [MASK] token. But that token only exists during pre-training. When you fine-tune on a real task, there are no blanks; the model sees ordinary, complete sentences. So the model has spent all of pre-training learning to behave a certain way around a symbol it will never encounter again. That is a mismatch between pre-training and fine-tuning, and it is the paper's stated reason for the next move. Concretely, a model trained on [MASK] alone is free to learn the crutch that the symbol itself marks the prediction site; fine-tuning never shows it that symbol, so the crutch breaks exactly when the representation needs to work.

BERT's response is to not always use [MASK]. When a token is chosen for prediction, it replaces the token with [MASK] only 80% of the time. 10% of the time it swaps in a random word from the vocabulary, and 10% of the time it leaves the word untouched. In all three cases the model still has to predict the original. Sample the fates below and watch the split:

Figure 4 · the corruption split

0 sampled

Of the 15% of tokens chosen for prediction, 80% become [MASK], 10% become a random word, and 10% are left unchanged. All three are still predicted. The running counter beneath the boxes approaches 80 / 10 / 10. Because of the random and unchanged cases, the model can never be sure a given token was not chosen.

Here is the precise arithmetic. 15% of all tokens are chosen. Of those, 80% get the mask, which is 12% of all tokens. The random and unchanged cases are 1.5% of all tokens each. The masking rate is therefore 12%, while the 15% figure is the fraction chosen for prediction.

The paper cites the mismatch above as its reason: by sometimes showing the model a real or random word in a position it must predict, you stop it from assuming that a prediction site always carries the [MASK] symbol, which softens the gap with fine-tuning. A second reason is not stated in the paper, but it follows naturally and is widely cited. Because 1-in-10 chosen tokens looks completely normal and 1-in-10 looks like a plausible wrong word, the model cannot tell by looking which tokens it will be graded on, so it has to keep a reliable, contextual representation of every token, not only the blanked ones. That habit, treating every position as potentially the one that matters, is exactly the part fine-tuning inherits, since downstream tasks read representations of ordinary tokens with no blanks anywhere. The paper's own ablations note this masking recipe matters more when BERT is used as a feature extractor (like ELMo) than when fine-tuning the entire model, where BERT is fairly robust to the exact percentages.

The three fates show up together in one example. The sentence "the cat sat on the mat" has six tokens. BERT chooses 15% of the tokens at random to predict; on this short sentence that rounds to one token, say it lands on "sat". Now a single die roll decides what the network actually sees in that slot. With probability 80% the slot becomes "the cat [MASK] on the mat". With probability 10% it becomes a random word, "the cat apple on the mat". With the remaining 10% it is left alone, "the cat sat on the mat", unchanged. In all three the training target is the same, the original token "sat", and in all three the loss is the cross-entropy between the model's softmax at that position and "sat". The 80/10/10 only changes the input the network reads; it never changes the answer it is graded against. Note the rates are of the chosen 15%, not of the full sentence: across a real corpus, 12% of all tokens get [MASK], 1.5% get a random word, and 1.5% are left as they were.

The input, and a second task

A lot of the tasks people care about are about pairs of sentences. Does this hypothesis follow from that premise? Is this question answered by that passage? A model that only ever saw single sentences would have no notion of how two of them relate. So BERT's input is built to hold a pair, and it gets a second pre-training task aimed squarely at the relationship between them.

Every input starts with a special [CLS] token. Two segments are packed into one sequence with a [SEP] token between them and after the last. To tell the two segments apart, BERT adds a learned segment embedding to every token marking it as sentence A or sentence B. The segment embedding makes the pair structure legible everywhere: the [SEP] token only marks the boundary at one position, while the segment vector tags every token directly, so an attention head never has to infer which half of the pair it is reading. BERT also adds a learned position embedding (learned from data, rather than the fixed sine-wave position codes the original Transformer hard-wired in). The vector that actually enters the network at each position is the sum of three learned embeddings:

\mathbf{e}_i = \mathbf{e}^{\text{token}}_i + \mathbf{e}^{\text{segment}}_i + \mathbf{e}^{\text{position}}_i

(2)

The second task is next sentence prediction (NSP). For half the training pairs, sentence B really is the sentence that followed A in the source document (labeled IsNext). For the other half, B is a random sentence pulled from somewhere else in the corpus (labeled NotNext). The final hidden vector of the [CLS] token, written $\mathbf{C}$ , is fed to a small classifier that has to call it. Toggle sentence B below:

Figure 5 · input and next-sentence prediction

IsNext

The packed input [CLS] A [SEP] B [SEP]. Each token's vector is the sum of a token, a segment (A teal, B amber), and a position embedding. The entire sequence runs through BERT; the [CLS] output decides whether B truly follows A. Swap B for a random sentence and the verdict flips.

NSP is easy to generate from any plain text and the final model reaches 97 to 98% accuracy on it. The paper itself flags a caveat here: $\mathbf{C}$ is only a useful summary of the pair after fine-tuning. Straight out of pre-training it is not a general-purpose sentence embedding, so do not reach for it as one.

NSP is also the part of BERT that aged the least well. BERT's own ablation argued NSP helped, and given BERT's training budget and data pipeline it did. Later work, most directly RoBERTa (2019), found that you can drop next-sentence prediction entirely, train the masked language model longer on full documents, and do just as well or better, a finding from after the fact, not something the BERT paper claims. Both can be true: NSP helped inside BERT's recipe and was dispensable once people trained harder.

Encoder, not decoder

All of this rides on the Transformer from Attention Is All You Need. BERT is its encoder stack, nothing exotic. The single knob that separates BERT from GPT is which entries of the attention matrix are allowed to be nonzero. Attention is an $n \times n$ grid: row $i$ is the token doing the looking, column $j$ is a token it might look at. GPT, a decoder, masks out everything above the diagonal so each token sees only itself and its past. BERT, an encoder, leaves the grid open so each token sees the entire sequence. Flip the mask:

Figure 6 · the attention grid

i = 5

With the mask off, every cell is live: this is BERT's encoder, where each position attends to all positions in both directions. With the mask on, the future is blacked out and only a lower triangle survives: that is GPT's decoder.

BERT comes in two sizes. BERT_BASE has $L{=}12$ layers, hidden size $H{=}768$ , $A{=}12$ attention heads, and 110M parameters. It was sized deliberately to match OpenAI GPT, so the two could be compared with the architecture held fixed and only the unidirectional-versus-bidirectional choice varying. BERT_LARGE goes to $L{=}24$ , $H{=}1024$ , $A{=}16$ , and 340M parameters. (The feed-forward inner width is $4H$ in both, the usual Transformer ratio.) Pre-training ran on the BooksCorpus, 800M words, plus English Wikipedia, 2,500M words, for a total of 3.3 billion words, over a million steps of the Adam optimizer at batches of 128,000 tokens. On the hardware of the day that was four days on 16 Cloud TPUs (64 TPU chips) for the large model.

One model, many tasks

Now the recipe from Figure 1 pays off. Whatever the task, you keep the pre-trained body and add a lightweight head, and the head is usually a single matrix. For sentence or sentence-pair classification, you read the [CLS] vector $\mathbf{C}$ into a softmax over labels. For sequence tagging, you read each token's output $\mathbf{T}_i$ into a per-token label. The input is reshaped to fit: a single sentence pairs with an empty second segment; a sentence pair uses both segments; a question and a passage become segments A and B.

Question answering needs the smallest head. On SQuAD, the model is given a question and a passage and must point at the span of the passage that answers it. BERT introduces exactly two new vectors at fine-tuning, a start vector $\mathbf{S}$ and an end vector $\mathbf{E}$ , both living in the same $H$ -dimensional space as the token outputs. The probability that token $i$ begins the answer is a softmax of dot products across the passage:

P^{\text{start}}_i = \frac{e^{\mathbf{S}\cdot\mathbf{T}_i}}{\sum_{j} e^{\mathbf{S}\cdot\mathbf{T}_j}}

(3)

with the identical formula for the end, using $\mathbf{E}$ . A candidate answer running from $i$ to $j$ scores $\mathbf{S}\cdot\mathbf{T}_i + \mathbf{E}\cdot\mathbf{T}_j$ , and the highest-scoring span with $j \ge i$ is the prediction. The $j \ge i$ constraint keeps the end from landing before the start, which two independent softmaxes would not guarantee on their own. The entire question-answering architecture is those two vectors on top of the pre-trained model.

# SQuAD: learn one start vector S and one end vector E
T = bert(question, passage)          # T[i] = final hidden of token i
p_start = softmax(T @ S)             # P(token i starts the answer)
p_end   = softmax(T @ E)             # P(token i ends the answer)
# pick the span that maximizes S.T[i] + E.T[j] with j >= i
i, j = best_span(T @ S, T @ E)
answer = tokens[i : j + 1]

And here, for symmetry, is the corruption that builds a masked-LM example, the 80/10/10 from Figure 4 in code:

# build one masked-LM example from a tokenized sequence
ids    = tokenize(text)              # WordPiece ids
chosen = sample(positions(ids), rate=0.15)  # 15% to predict
labels = [IGNORE] * len(ids)         # IGNORE = "not a target"
for i in chosen:
    labels[i] = ids[i]               # the original id is the answer
    r = random()
    if   r < 0.80: ids[i] = MASK_ID  # 80%: replace with [MASK]
    elif r < 0.90: ids[i] = rand_id()  # 10%: a random token
    # else 10%: leave ids[i] as is (still predicted)
loss = cross_entropy(model(ids), labels)  # only the chosen count

Fine-tuning is cheap. Every result in the paper can be reproduced in at most an hour on a single Cloud TPU, starting from the one shared pre-trained model. That asymmetry motivates the entire approach: pay once, up front, for the representation; pay almost nothing per task after that.

Eleven benchmarks, and the ablation that explains them

BERT set a new state of the art on eleven NLP tasks. On the GLUE benchmark, a suite of nine language-understanding tasks, it pushed the leaderboard score to 80.5, up 7.7 points from OpenAI GPT's 72.8. On SQuAD v1.1 its best configuration reached 93.2 test F1 (F1 is the standard precision/recall accuracy score for these tasks; that number is an ensemble with extra question-answering data, while a single fine-tuned model lands around 91 dev F1). On the harder SQuAD v2.0, which lets a question have no answer, it took test F1 to 83.1, a 5.1-point jump. On SWAG, a commonsense sentence-completion task, BERT_LARGE hit 86.3, past OpenAI GPT by 8.3 points and within a hair of the 88.0 that five human annotators reach together (and above the 85.0 of a single expert).

The numbers need an explanation, and the ablation gives one, by isolating which choice did the work. The ablation holds the same BERT_BASE model, the same data, and the same fine-tuning, varying only the pre-training objective. Remove next-sentence prediction. Then remove bidirectionality too, training a plain left-to-right model. Then bolt a BiLSTM (a bidirectional recurrent layer that reads the sequence both ways) back on top to claw some right context back. Watch what happens, and on which tasks:

Figure 7 · the bidirectionality ablation

LTR, No NSP · A plain left-to-right LM. Bidirectionality is gone; every token sees only its left.

Same model size, same data; only the pre-training objective changes (Table 5, Dev set). Dropping bidirectionality (the left-only bars) barely dents sentiment, where left context is enough, but collapses SQuAD and MRPC (a sentence-pair paraphrase task), where a token genuinely needs to see what follows it. Switch tasks to feel where the right context matters.

The figure shows a clear pattern. On SST-2, sentiment, going left-only costs almost nothing, because the polarity of a review is usually clear from a left-to-right read. On SQuAD it is a catastrophe: the left-only model falls from 88.5 to 77.8 F1, because answering a question about a passage means weighing words that come after the candidate answer as much as before it. The right context was not a luxury; for the hard tasks it was most of the signal. That gap, and the fact that it shows up exactly where you would predict, is the paper's evidence that deep bidirectionality, not the incidental details, made BERT work.

BERT also scaled cleanly with size, including on small datasets where bigger pre-trained models had been expected to overfit, and it worked as a feature extractor too: freezing BERT and feeding a concatenation of its top four layers into a task model landed within 0.3 F1 of full fine-tuning on named-entity recognition. The representation was good enough that you could use it either way.

Under all the engineering, BERT is two sentences. Reading both directions makes a better representation than reading one. You cannot train that directly, because the answer leaks back through the layers, so you hide the word and predict it from what is left. The 80/10/10 split, the next-sentence task, the packed input, the task heads, all of it serves those two claims. The model that fell out of them reset the state of the art across the field and turned "pre-train a big encoder, fine-tune for your task" into the default way to do NLP.

Provenance Verified against primary literature

Transformer (2017)BERT is the encoder stack with full, non-causal self-attention.

ELMo (2018)Feature-based and only shallowly bidirectional: two LMs concatenated at the top.

OpenAI GPT (2018)Fine-tuning, but left-to-right. BERT_BASE matches its size on purpose.

Cloze task (Taylor, 1953)The fill-the-blank objective the masked LM revives.

RoBERTa (2019)Later showed next-sentence prediction is not actually needed.

correctionIt is widely said BERT replaces 15% of tokens with [MASK]. It does not: 15% are chosen for prediction, and only 80% of those (12% of all tokens) become [MASK]; 10% become a random word and 10% are left unchanged. We teach the exact split and why it exists.

Questions you might still have

Is BERT a chatbot? Does it generate text?
No. BERT is an encoder: it turns text into vectors. It has no decoder, was never trained to write the next word, and cannot be prompted to continue a sentence. It predicts hidden words and feeds a lightweight task head.

Why can’t you just train a deep model to read both directions?
In a multi-layer bidirectional model, the word you are trying to predict leaks back to its own output through neighboring positions across the stack, so the model copies it instead of learning. Masking the word removes the answer from the input, which makes valid bidirectional prediction possible.

Why 80/10/10 instead of always using [MASK]?
The [MASK] token never appears at fine-tuning, so always masking would create a pre-train/fine-tune mismatch. Replacing 10% with a random word and leaving 10% unchanged softens that gap and, as a useful side effect, forces the model to keep a good representation of every token, not just the blanked ones.

Does next-sentence prediction actually help?
BERT’s own ablation said yes. Later work (RoBERTa, 2019) found you can drop NSP, train the masked LM longer on full documents, and match or beat BERT. So NSP helped within BERT’s budget but proved dispensable.

Can I use the [CLS] vector as a sentence embedding?
Not off the shelf. The paper warns that the [CLS] output is not a meaningful sentence representation without fine-tuning. Sentence-BERT (2019) was built later precisely to produce good sentence embeddings.

Footnotes & further reading

The paper: Devlin, Chang, Lee, Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Google AI, NAACL 2019). Code and models.
The architecture is the encoder from Vaswani et al., Attention Is All You Need.
The two pre-training predecessors BERT positions itself against: Peters et al., Deep contextualized word representations (ELMo), and Radford et al., Improving Language Understanding by Generative Pre-Training (GPT).
The masked-word objective revives the Cloze task: Taylor, "Cloze procedure": a new tool for measuring readability (1953). The denoising-autoencoder contrast is Vincent et al. (2008); subword tokenization is WordPiece (Wu et al., 2016); the activation is GELU (Hendrycks & Gimpel, 2016).
On next-sentence prediction being dispensable, and on [CLS] not being an off-the-shelf sentence vector: Liu et al., RoBERTa (2019), and Reimers & Gurevych, Sentence-BERT (2019).
The evaluation suites: GLUE (Wang et al., 2018), SQuAD (Rajpurkar et al., 2016), and SWAG (Zellers et al., 2018).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.