BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
To read a sentence both ways, hide a word and predict it.
A language model reads left to right, so each word only knows what came before it. BERT wanted every word informed by its whole sentence, left and right at once. The obstacle, and the one trick that gets around it, is the whole story.
Explaining the paperBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingWhat if the same model, pre-trained once on plain text, could be fine-tuned to top eleven different language benchmarks?
By 2018 one idea had taken over natural language processing, borrowed from how computer vision used ImageNet. Pre-train a big model on a mountain of cheap, unlabeled data, then adapt it to your actual task with a small amount of labeled data. The pre-training teaches the model what language is like in general; the fine-tuning specializes it. The expensive part happens once, and everyone downstream gets to start from it.
Two flavors of this recipe were already working. ELMo was feature-based: it produced pre-trained word vectors that you fed, as extra inputs, into whatever task-specific architecture you had already built. OpenAI GPT was fine-tuning: it had almost no task-specific parts, and you adapted the whole pre-trained network by continuing to train it on your task. BERT is a fine-tuning method, like GPT. Its quarrel is not with the recipe. It is with a constraint both ELMo and GPT accepted, and that constraint is where we have to start.
BERT is an encoder, and the label matters because it gets muddled: it turns text into vectors that a thin task layer can read, and that is all it does. It is not a generative model. It has no decoder bolted on, it was never trained to write the next word, and you cannot prompt it to continue a sentence. Keep that straight and the rest follows.
Pre-train once, fine-tune for anything
The whole system has a simple shape. In phase one, pre-training, BERT reads enormous amounts of unlabeled text and learns from two self-supervised games we build up below: guess a hidden word, and guess whether one sentence follows another. No human labels are involved, so the text can be the entire English Wikipedia and a corpus of books. In phase two, fine-tuning, you take that same network, keep its weights, and bolt one thin output layer on top for your actual task. Then you train briefly on your labeled data and you are done.
The striking part is how little changes between tasks. The body, the deep stack of Transformer layers, is identical for sentiment classification, for entailment, for named-entity tagging, for question answering. Only the input framing and the small head on top differ. Switch the task below and watch the encoder stay put while the head swaps:
Pre-training on unlabeled text: predict masked words (MLM) and whether sentence B follows A (NSP).
The result was eleven language tasks at a new state of the art, each obtained by adding essentially one layer to the same pre-trained model and fine-tuning for about an hour. Whether that is even possible comes down to how good the pre-trained representation is, and that comes down to one design decision.
Why you cannot just read both ways
A standard language model is unidirectional. It predicts each word from the words before it, left to right, so the representation it builds for a given position only knows about that position's left context. GPT works exactly this way. So does the backward half of ELMo, just flipped. This is fine for generating text, where you genuinely only have the past to go on. It is a real handicap for understanding text, where the word after often settles the meaning of the word before. To tag bank as a riverbank or a financial institution, you want to see what comes next.
So why not train a deep model that conditions on both sides at every layer? The paper's answer is one sentence, and it is the crux of the whole design: bidirectional conditioning would let each word indirectly see itself, so the model could trivially predict the target word in a multi-layered context. The weight sits on that last phrase: the problem is not bidirectionality in the abstract, it is bidirectionality combined with stacked layers.
The leak takes a two-step path. Suppose you want to predict the word at some position from the representation your network builds above that position. If the network attends in both directions, then at the first layer the word to the left of the target builds a representation that has looked at the target and absorbed it. At the second layer, the target's own representation looks back at that left neighbor and gets the answer handed back. The information took a detour through a neighbor and came home. You did not even have to let the target attend to itself directly; the stack routes the answer around any such block. Counting the hops shows why this is a stacking problem: the detour needs one layer for a neighbor to absorb the target and a second layer to hand it back, so a single bidirectional layer predicting a removed token has no path to cheat with, and the leak only opens once layers stack. The model stops learning language and starts copying. Toggle the modes and follow the amber path that traces how information moves between positions:
That is why the field had settled for unidirectional pre-training. Going left-to-right makes the leak impossible, because a word's representation is never allowed to peek at the very word it is about to predict. ELMo got a taste of both directions by training a separate left-to-right model and a separate right-to-left model and gluing their outputs together at the end. That is real, but it is shallow. The two halves never talk during pre-training; the model never builds a single representation that jointly weighs both sides. ("Shallow" here is about that top-level concatenation, not about network depth. ELMo's language models are perfectly deep LSTMs.)
The third mode in the figure is the way out. If you remove the target word from the input and replace it with a placeholder, there is no answer anywhere in the input for the stack to route home. With the word gone, nothing stops you from letting every other position attend in both directions. You get genuine, deep, jointly bidirectional context, and the prediction is honest because the thing being predicted is not in the room. That placeholder has a name.
Masked language modeling
BERT's first pre-training task is the masked language model (MLM). It is an old idea from psycholinguistics called the Cloze task (Taylor, 1953): delete some words from a passage and ask a reader to fill the blanks from context. BERT does exactly that. It picks 15% of the tokens in a sequence at random, hides them, and trains the network to predict the originals. Because the hidden word is gone from the input, the network is free to read the entire rest of the sequence, both sides, at every layer.
The prediction itself is a plain classification over the vocabulary. BERT works in WordPiece tokens, a fixed set of 30,000 subword pieces rather than whole words, so that a rare word like "unfreeze" splits into known parts ("un", "##freeze") and nothing is ever fully out of vocabulary. The final hidden vector at a masked position is pushed through a softmax over those 30,000 pieces (a softmax just turns the raw scores into probabilities that sum to one), and the model is trained with ordinary cross-entropy against the piece that was actually there. Written out over the set of masked positions, with the corrupted sequence the network actually sees:
The 30,000-piece vocabulary is what makes that softmax a fixed, finite classification rather than a guess over an open-ended dictionary. Common words are a single piece each, so "the" and "cat" map to one token. Rare or made-up words split into known subword pieces: "unfreeze" becomes "un" and "##freeze", and "embeddings" becomes "embed", "##ding", "##s", where the "##" marks a piece that continues the previous one rather than starting a new word. Because any string can be spelled out of these 30,000 pieces, there is no such thing as an out-of-vocabulary word, and every token BERT sees is one of a known, fixed set. That is what lets the masked position be predicted by a plain softmax over exactly 30,000 choices: hide a piece, and filling the blank is choosing the right entry from that fixed list, scored with cross-entropy. A model facing a truly open vocabulary could not write that loss down at all.
Step the mask along the sentence below. Notice that the context arcs feeding each prediction arrive from both sides, which is the entire point. A left-to-right model predicting mat in "the cat sat on the ___" only has the left half; BERT also gets to use anything that comes after.
A common slip is to treat this as a denoising autoencoder, a network that corrupts its input and then rebuilds all of it from the corrupted version. The paper draws that contrast explicitly: BERT predicts only the masked positions, never the whole sequence. Reconstructing just the blanks is cheaper and it is enough, 15% of the tokens per pass, over a billion-word corpus and a million steps, is plenty of supervision.
There is a cost to this trick, and the paper is upfront about it. A left-to-right model gets a learning signal at every position of every sentence, because every position predicts the next token. BERT only learns from the 15% it masked, so it needs to see more text to converge. The bet is that deep bidirectional context is worth the slower start, and the results say it is.
The 80/10/10 compromise
If you have been paying attention you have spotted a new problem. The masked language model leans on a special [MASK] token. But that token only exists during pre-training. When you fine-tune on a real task, there are no blanks; the model sees ordinary, complete sentences. So the model has spent all of pre-training learning to behave a certain way around a symbol it will never encounter again. That is a mismatch between pre-training and fine-tuning, and it is the paper's stated reason for the next move. Concretely, a model trained on [MASK] alone is free to learn the crutch that the symbol itself marks the prediction site; fine-tuning never shows it that symbol, so the crutch breaks exactly when the representation is supposed to pay off.
The fix is to not always use [MASK]. When a token is chosen for prediction, BERT replaces it with [MASK] only 80% of the time. 10% of the time it swaps in a random word from the vocabulary, and 10% of the time it leaves the word untouched. In all three cases the model still has to predict the original. Sample the fates below and watch the split:
[MASK], 10% become a random word, and 10% are left unchanged. All three are still predicted. The running counter beneath the boxes approaches 80 / 10 / 10. The point is the model can never be sure a given token was not chosen.The arithmetic is worth getting exactly right. 15% of all tokens are chosen. Of those, 80% get the mask, which is 12% of all tokens. The random and unchanged cases are 1.5% of all tokens each. It is wrong to say BERT replaces 15% of the input with [MASK]; the real masking rate is 12%.
The paper's stated reason is the mismatch above: by sometimes showing the model a real or random word in a position it must predict, you stop it from assuming that a prediction site always carries the [MASK] symbol, which softens the gap with fine-tuning. A second reason is not stated in the paper, but it follows naturally and is widely cited. Because 1-in-10 chosen tokens looks completely normal and 1-in-10 looks like a plausible wrong word, the model cannot tell by looking which tokens it will be graded on, so it has to keep an honest, contextual representation of every token, not only the blanked ones. That habit, treating every position as potentially the one that matters, is exactly the part fine-tuning inherits, since downstream tasks read representations of ordinary tokens with no blanks anywhere. The paper's own ablations note this masking recipe matters more when BERT is used as a feature extractor (like ELMo) than when fine-tuning the whole model, where BERT turns out to be fairly robust to the exact percentages.
Walk one example through to see the three fates in one place. Take the sentence "the cat sat on the mat", six tokens. BERT chooses 15% of the tokens at random to predict; on this short sentence that rounds to one token, say it lands on "sat". Now a single die roll decides what the network actually sees in that slot. With probability 80% the slot becomes "the cat [MASK] on the mat". With probability 10% it becomes a random word, "the cat apple on the mat". With the remaining 10% it is left alone, "the cat sat on the mat", unchanged. In all three the training target is the same, the original token "sat", and in all three the loss is the cross-entropy between the model's softmax at that position and "sat". The 80/10/10 only changes the input the network reads; it never changes the answer it is graded against. Note the rates are of the chosen 15%, not of the whole sentence: across a real corpus, 12% of all tokens get [MASK], 1.5% get a random word, and 1.5% are left as they were.
The input, and a second task
A lot of the tasks people care about are about pairs of sentences. Does this hypothesis follow from that premise? Is this question answered by that passage? A model that only ever saw single sentences would have no notion of how two of them relate. So BERT's input is built to hold a pair, and it gets a second pre-training task aimed squarely at the relationship between them.
Every input starts with a special [CLS] token. Two segments are packed into one sequence with a [SEP] token between them and after the last. To tell the two segments apart, BERT adds a learned segment embedding to every token marking it as sentence A or sentence B. The segment embedding is what makes the pair structure legible everywhere: the [SEP] token only marks the boundary at one position, while the segment vector tags every token directly, so an attention head never has to infer which half of the pair it is reading. It also adds a learned position embedding (learned, not the fixed sinusoids of the original Transformer). The vector that actually enters the network at each position is the sum of three learned embeddings:
The second task is next sentence prediction (NSP). For half the training pairs, sentence B really is the sentence that followed A in the source document (labeled IsNext). For the other half, B is a random sentence pulled from somewhere else in the corpus (labeled NotNext). The final hidden vector of the [CLS] token, written , is fed to a small classifier that has to call it. Toggle sentence B below:
[CLS] A [SEP] B [SEP]. Each token's vector is the sum of a token, a segment (A teal, B amber), and a position embedding. The whole sequence runs through BERT; the [CLS] output decides whether B truly follows A. Swap B for a random sentence and the verdict flips.NSP is cheap to generate from any plain text and the final model reaches 97 to 98% accuracy on it. One caveat the paper itself flags: is only a useful summary of the pair after fine-tuning. Straight out of pre-training it is not a general-purpose sentence embedding, so do not reach for it as one.
NSP is also the part of BERT that aged the least well. BERT's own ablation argued NSP helped, and given BERT's training budget and data pipeline it did. Later work, most directly RoBERTa (2019), found that you can drop next-sentence prediction entirely, train the masked language model longer on full documents, and do just as well or better, a finding from after the fact, not something the BERT paper claims. Both can be true: NSP earned its keep inside BERT's recipe and turned out to be dispensable once people trained harder.
Encoder, not decoder
All of this rides on the Transformer from Attention Is All You Need. BERT is its encoder stack, nothing exotic. The single knob that separates BERT from GPT is which entries of the attention matrix are allowed to be nonzero. Attention is an grid: row is the token doing the looking, column is a token it might look at. GPT, a decoder, masks out everything above the diagonal so each token sees only itself and its past. BERT, an encoder, leaves the whole grid open so each token sees the entire sequence. Flip the mask:
BERT comes in two sizes. BERTBASE has layers, hidden size , attention heads, and 110M parameters. It was sized deliberately to match OpenAI GPT, so the two could be compared with the architecture held fixed and only the unidirectional-versus-bidirectional choice varying. BERTLARGE goes to , , , and 340M parameters. (The feed-forward inner width is in both, the usual Transformer ratio.) Pre-training ran on the BooksCorpus, 800M words, plus English Wikipedia, 2,500M words, for a total of 3.3 billion words, over a million steps of Adam at batches of 128,000 tokens. On the hardware of the day that was four days on 16 TPU chips for the large model.
One model, many tasks
Now the recipe from Figure 1 pays off. Whatever the task, you keep the pre-trained body and add a thin head, and the head is usually a single matrix. For sentence or sentence-pair classification, you read the [CLS] vector into a softmax over labels. For sequence tagging, you read each token's output into a per-token label. The input is reshaped to fit: a single sentence pairs with an empty second segment; a sentence pair uses both segments; a question and a passage become segments A and B.
Question answering is the prettiest case because the head is almost nothing. On SQuAD, the model is given a question and a passage and must point at the span of the passage that answers it. BERT introduces exactly two new vectors at fine-tuning, a start vector and an end vector , both living in the same -dimensional space as the token outputs. The probability that token begins the answer is a softmax of dot products across the passage:
with the identical formula for the end, using . A candidate answer running from to scores , and the highest-scoring span with is the prediction. The constraint keeps the end from landing before the start, which two independent softmaxes would not guarantee on their own. That is the entire question-answering architecture: two vectors on top of a frozen idea.
# SQuAD: learn one start vector S and one end vector E
T = bert(question, passage) # T[i] = final hidden of token i
p_start = softmax(T @ S) # P(token i starts the answer)
p_end = softmax(T @ E) # P(token i ends the answer)
# pick the span that maximizes S.T[i] + E.T[j] with j >= i
i, j = best_span(T @ S, T @ E)
answer = tokens[i : j + 1]And here, for symmetry, is the corruption that builds a masked-LM example, the 80/10/10 from Figure 4 in code:
# build one masked-LM example from a tokenized sequence
ids = tokenize(text) # WordPiece ids
chosen = sample(positions(ids), rate=0.15) # 15% to predict
labels = [IGNORE] * len(ids) # IGNORE = "not a target"
for i in chosen:
labels[i] = ids[i] # the original id is the answer
r = random()
if r < 0.80: ids[i] = MASK_ID # 80%: replace with [MASK]
elif r < 0.90: ids[i] = rand_id() # 10%: a random token
# else 10%: leave ids[i] as is (still predicted)
loss = cross_entropy(model(ids), labels) # only the chosen countFine-tuning is cheap. Every result in the paper can be reproduced in at most an hour on a single Cloud TPU, starting from the one shared pre-trained model. That asymmetry is the point of the whole approach: pay once, up front, for the representation; pay almost nothing per task after that.
What it actually did
BERT set a new state of the art on eleven NLP tasks. On the GLUE benchmark, a suite of nine language-understanding tasks, it pushed the leaderboard score to 80.5, up 7.7 points from OpenAI GPT's 72.8. On SQuAD v1.1 its best configuration reached 93.2 test F1 (that number is an ensemble with extra question-answering data; a single fine-tuned model lands around 91 dev F1). On the harder SQuAD v2.0, which lets a question have no answer, it took test F1 to 83.1, a 5.1-point jump. On SWAG, a commonsense sentence-completion task, BERTLARGE hit 86.3, past OpenAI GPT by 8.3 points and within a hair of the 88.0 that five human annotators reach together (and above the 85.0 of a single expert).
Numbers alone do not explain themselves. The ablation does, by isolating which choice did the work. Take the same BERTBASE model, the same data, the same fine-tuning, and vary only the pre-training objective. Remove next-sentence prediction. Then remove bidirectionality too, training a plain left-to-right model. Then bolt a BiLSTM back on top to claw some right context back. Watch what happens, and on which tasks:
LTR, No NSP · A plain left-to-right LM. Bidirectionality is gone; every token sees only its left.
The story the figure tells is clean. On SST-2, sentiment, going left-only costs almost nothing, because the polarity of a review is usually clear from a left-to-right read. On SQuAD it is a catastrophe: the left-only model falls from 88.5 to 77.8 F1, because answering a question about a passage means weighing words that come after the candidate answer as much as before it. The right context was not a luxury; for the hard tasks it was most of the signal. That gap, and the fact that it shows up exactly where you would predict, is the paper's evidence that deep bidirectionality, not the incidental details, is what made BERT work.
BERT also scaled cleanly with size, including on small datasets where bigger pre-trained models had been expected to overfit, and it worked as a feature extractor too: freezing BERT and feeding a concatenation of its top four layers into a task model landed within 0.3 F1 of full fine-tuning on named-entity recognition. The representation was good enough that you could use it either way.
Step back and BERT is two ideas wearing a lot of engineering. Reading both directions makes a better representation than reading one. You cannot train that directly, because the answer leaks back through the layers, so you hide the word and predict it from what is left. Everything else, the 80/10/10, the next-sentence task, the packed input, the thin task heads, is in service of those two sentences. The model that fell out of them reset the state of the art across the field and turned "pre-train a big encoder, fine-tune for your task" into the default way to do NLP.
Questions you might still have
Is BERT a chatbot? Does it generate text?
No. BERT is an encoder: it turns text into vectors. It has no decoder, was never trained to write the next word, and cannot be prompted to continue a sentence. It predicts hidden words and feeds a thin task head.
Why can’t you just train a deep model to read both directions?
In a multi-layer bidirectional model, the word you are trying to predict leaks back to its own output through neighboring positions across the stack, so the model copies it instead of learning. Masking the word removes the answer from the input, which makes honest bidirectional prediction possible.
Why 80/10/10 instead of always using [MASK]?
The [MASK] token never appears at fine-tuning, so always masking would create a pre-train/fine-tune mismatch. Replacing 10% with a random word and leaving 10% unchanged softens that gap and, as a useful side effect, forces the model to keep a good representation of every token, not just the blanked ones.
Does next-sentence prediction actually help?
BERT’s own ablation said yes. Later work (RoBERTa, 2019) found you can drop NSP, train the masked LM longer on full documents, and match or beat BERT. So NSP helped within BERT’s budget but turned out to be dispensable.
Can I use the [CLS] vector as a sentence embedding?
Not off the shelf. The paper warns that the [CLS] output is not a meaningful sentence representation without fine-tuning. Sentence-BERT (2019) was built later precisely to produce good sentence embeddings.
Footnotes & further reading
- The paper: Devlin, Chang, Lee, Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Google AI, NAACL 2019). Code and models.
- The architecture is the encoder from Vaswani et al., Attention Is All You Need.
- The two pre-training predecessors BERT positions itself against: Peters et al., Deep contextualized word representations (ELMo), and Radford et al., Improving Language Understanding by Generative Pre-Training (GPT).
- The masked-word objective revives the Cloze task: Taylor, "Cloze procedure": a new tool for measuring readability (1953). The denoising-autoencoder contrast is Vincent et al. (2008); subword tokenization is WordPiece (Wu et al., 2016); the activation is GELU (Hendrycks & Gimpel, 2016).
- On next-sentence prediction being dispensable, and on
[CLS]not being an off-the-shelf sentence vector: Liu et al., RoBERTa (2019), and Reimers & Gurevych, Sentence-BERT (2019). - The evaluation suites: GLUE (Wang et al., 2018), SQuAD (Rajpurkar et al., 2016), and SWAG (Zellers et al., 2018).
How could this explainer be improved? Found an error, or something unclear? I read every message.