Attention Is All You Need
Stop reading one word at a time. Let every token look at every other token, at once.
The Transformer, built from the ground up: embeddings, self-attention, the softmax scaling trick, multiple heads, sinusoidal positions, and the encoder-decoder. Intuition first, then the exact math.
Explaining the paperAttention Is All You NeedWhat if a model could read a whole sentence at once, and learn which words matter to which, instead of trudging through it one word at a time?
Almost everything you have heard of in modern AI runs on the architecture in this one 2017 paper. GPT, BERT, the chatbots and code assistants and image generators you have used: under the hood they are Transformers, and the Transformer is what this paper introduced. It is built from a small number of ideas, each simple once you see it from the right angle.
The setup the authors were working in is machine translation. You feed in an English sentence and you want a German one out. For years the standard tool was the recurrent network, which reads a sentence the way you might read it aloud: one word at a time, carrying a running summary in your head and updating it at each step. It worked, but it had a structural problem that no amount of tuning could fix, and that problem is where the whole paper starts.
The Transformer's answer was to throw recurrence out completely and keep only one mechanism: attention, the ability of each word to look directly at every other word and decide what is relevant. No reading left to right. It trained much faster and scored higher on translation than the recurrent systems before it. Getting there takes a handful of ideas: turning words into vectors, what attention computes, the division by , running several heads at once, restoring word order, and assembling an encoder and a decoder. None of it is hard on its own.
The cost of reading left to right
A recurrent network processes a sequence step by step. It reads the first word, updates a hidden state, reads the second word, updates the state again, and so on to the end. That state is the running summary. The hidden state at position is a function of the previous state and the current word. That little dependency, each step needing the one before it, is the whole trouble.
Two consequences follow, and both are about distance. The first is speed: because step cannot begin until step has finished, a sentence of length takes sequential steps no matter how many processors you own. A GPU is a machine for doing thousands of things at once, and recurrence forces it to wait in line. The second is memory, in the human sense. For the first word to influence the last, its information has to survive being passed hand to hand through every word in between. Across a long sentence that signal gets diluted, and the model struggles to connect things that are far apart.
Self-attention removes both problems in one move. Instead of a chain, it wires every position directly to every other position. The longest path between any two words is a single hop, and every position is computed at the same time. Toggle between the two below and drag the sentence length. Watch the recurrent path grow with while the attention path stays pinned at one.
This is the trade the paper makes explicit in its complexity table, which we will come back to at the end. A self-attention layer needs more arithmetic per layer (it compares every word with every other, which is comparisons), but it spends that arithmetic in parallel, and it makes the path between distant words constant instead of growing with the sentence. For the sentence lengths used in translation, that is a trade worth making many times over.
Words become vectors
Before any attention happens, the words have to become numbers. A Transformer keeps a lookup table with one row per token in its vocabulary, and each row is a learned vector of length , which in the base model is . The word "animal" becomes a specific point in 512-dimensional space; "road" becomes another. These embeddings are learned during training, and similar words drift toward similar regions of the space.
From here on, a token is just its vector, and a sentence is a stack of vectors, a matrix of shape for a sentence of words. Every layer of the Transformer reads a stack of vectors and writes out a new stack of the same shape, gradually refining what each position "means" in context. The word "it" starts as a generic pronoun vector and, layer by layer, picks up information about which noun it refers to.
One small detail that matters later: the paper multiplies these embedding vectors by before anything else. The paper just states the multiply. One common reading is that it keeps the embedding and the position signal we add in a moment on a similar scale, so neither dominates early in training. Either way, keep it separate from the inside attention: two different square roots, two different jobs.
Attention is a soft lookup
Here is the central idea. Think about how a dictionary works. You arrive with a word you want to look up (a query), you scan the entries until you find a matching headword (a key), and you read off its definition (a value). A dictionary is a hard lookup: one key matches, you take its value, done.
Attention is the soft version of that. Every word produces a query, a key, and a value (three different vectors, each a learned linear projection of the word's embedding, which just means the embedding vector multiplied by a learned matrix). To compute the new representation of a word, you take its query and compare it against the keys of all the words, including itself. The comparison gives each word a relevance score. You turn those scores into weights that sum to one, and your answer is the weighted blend of all the value vectors, and that blended vector becomes the word's new representation. Instead of reading one definition, you read a little from every entry, in proportion to how well it matches.
The comparison is a dot product. The dot product of two vectors is large and positive when they point the same way, near zero when they are unrelated, negative when they point apart. So a query and a key that "agree" produce a high score. Turning the scores into weights is the job of the softmax function, which takes a list of numbers and squashes them into positive values that sum to one, with the largest input getting the largest share. High score, high weight, more of that word's value in the blend.
The classic example is a pronoun. In "the animal crossed the road because it got tired," the word "it" needs to figure out what it refers to. Its query, compared against every key, scores highest against "animal," so the new representation of "it" is built mostly from the value of "animal." The model has resolved the reference by attending to the right word. Pick a query token below and watch where its attention goes.
When all the words' queries, keys, and values are stacked into matrices , , and , the whole operation is two matrix multiplications with a softmax in between. This is the paper's equation (1), and once you have the picture above, you can read it straight off:
Read it left to right. is every query dotted with every key, an grid of raw scores. Dividing by is the scaling we explain next. The softmax turns each row of scores into a row of weights that sums to one. Multiplying by replaces each word with the weighted blend of all the values. That is the whole computation. In code it is just as short:
# scaled dot-product attention, one head
# Q: [n, d_k] K: [m, d_k] V: [m, d_v]
scores = Q @ K.T / sqrt(d_k) # [n, m] query-key match
weights = softmax(scores, axis=-1) # each row sums to 1
out = weights @ V # [n, d_v] weighted valuesThe naming trips people up. When the queries, keys, and values all come from the same sequence, this is self-attention: the sentence attending to itself. Later we will see the same equation used with queries from one sequence and keys and values from another, which is how the decoder reads the encoder. The mechanism never changes; only where the three inputs come from.
Why divide by the square root of d_k
That in the denominator looks arbitrary, and it is the one part of equation (1) that a newcomer cannot guess. It is there to keep the softmax from saturating, collapsing onto one word so hard that learning stalls, and the reasoning is short.
A score is a dot product of a query and a key, each a vector of length (in the base model, ). Suppose, as a rough model, that the components of each are independent with mean zero and variance one. The dot product is a sum of products of such components. Each product has mean zero and variance one, and variances of independent things add, so the sum has variance :
So the typical size of a raw score grows like . With the scores are spread out by a factor of eight before the softmax ever sees them. That matters because softmax is sensitive to the scale of its inputs. Feed it numbers that are all close together and it returns a gentle, spread-out distribution. Feed it numbers that are far apart and it returns a near-spike: one weight close to one, the rest close to zero. A spike is bad for learning, because the gradient of softmax, the signal that says how to change the weights, is tiny wherever it has saturated, so the model can barely adjust which word it attends to. Dividing by cancels the growth exactly, pulling the variance back to one and keeping the distribution soft. Drag and compare the two:
Why and not ? The spread of the scores is set by their standard deviation, which is . Dividing by the variance instead would over-correct and crush the scores too far toward zero. This is also the only difference between this and the older "dot-product attention" it descends from: same idea, missing the scale factor, which is why it underperformed at large .
Many heads at once
A single attention computation has a built-in limit. It produces one set of weights per word, so each word ends up with one weighted blend. But the relationships in a sentence are not one thing. The word "it" cares about the noun it refers to; a verb cares about its subject and its object. One distribution cannot point at both at once. Forced to choose, it averages them together.
Multi-head attention is the fix, and it is the obvious one: run several attentions in parallel and let each specialize. Each head projects the full -dimensional vector down into its own smaller -dimensional subspace ( with heads); the split is by projection, not by slicing the vector into chunks. Each head gets its own learned projections, so each can choose its own notion of relevance. One head can track the previous word; another can latch onto the subject of a verb several words away. Pick a query and see four heads attend four different ways, with the single-head average at the bottom:
The eight heads run independently, then their outputs are concatenated back into a single -dimensional vector and passed through one more learned projection that mixes them. Because each head works in the dimension, the total cost is about the same as one full-size attention. In math, with being attention computed on the -th set of projections:
Here and are the per-head projections, and is the final mix. In code the structure is exactly as it sounds:
# multi-head attention: h heads run in parallel
def mha(X, h): # X: [n, d_model]
heads = []
for i in range(h): # each head, its own projections
Q = X @ Wq[i]; K = X @ Wk[i]; V = X @ Wv[i] # [n, d_k]
heads.append(attention(Q, K, V)) # [n, d_v]
return concat(heads, axis=-1) @ Wo # back to [n, d_model]The paper checked that the heads matter: a single-head model scored about points lower on a held-out English-to-German tuning set (the development set), measured by BLEU, the translation-quality score we define later. Be honest about the interpretation, though. The clean "this head does syntax" stories hold only sometimes. Later work showed that many heads can be pruned after training with little loss, so read the tidy roles in the figure as an illustration of what heads sometimes do.
Putting the order back
Throwing out recurrence cost us something we have not paid for yet: word order. Look again at the attention equation. If you shuffle the words of the sentence, you shuffle the rows of , , and , and the output is shuffled the same way but otherwise identical. Attention treats the input as a bag of words. To it, "dog bites man" and "man bites dog" are the same set of vectors, which is clearly wrong.
The fix is to stamp each token with its position before the first layer. The paper adds a positional encoding, a vector the same size as the embedding, to each token's embedding. The choice of vector is the clever part. Instead of learning one, they use a fixed pattern of sines and cosines at a range of frequencies:
Unpacking it: is the position in the sentence and indexes the dimension. Even dimensions get a sine, odd dimensions the matching cosine. The frequency falls off as you move through the dimensions, so the leftmost dimensions oscillate quickly as you step through positions and the rightmost ones barely move. The wavelengths run in a geometric progression from up to about . Think of a row of clocks ticking at wildly different speeds: read all their hands at once and you get a code that is unique to each position. The heatmap shows it; drag the position to read off one row's fingerprint:
Why sinusoids rather than just learning a position vector? The authors give two reasons. The practical one is extrapolation: a fixed formula is defined at every position, so the model is at least defined at positions longer than any it saw in training (whether it translates them well is another matter). The elegant one is about relative position. For a fixed offset , the encoding of position is a fixed linear function of the encoding of position . Each sine-cosine pair is just a point on a circle, and stepping forward by rotates it by a fixed angle, the same rotation regardless of where you started. That makes "three words back" a consistent linear operation, which is exactly the kind of relationship attention can learn to use. (They also tried learned position embeddings and got nearly identical results. The sinusoids are a reasonable default; they did not clearly beat the learned version.)
The rest of the block
Attention is the headline, but a Transformer layer has a second half and some connective tissue that does real work. The second half is a feed-forward network applied to each position on its own, identically. After attention has let a word gather information from the rest of the sentence, the feed-forward network sits and thinks about that word in isolation. It is two linear layers with a (the function , which zeroes out negatives) in between:
It expands from up to an inner dimension of and back down, so it is wide in the middle. Inside each layer the feed-forward network holds about twice as many parameters as the attention, so a lot of the raw capacity sits here, not in the attention everyone talks about.
Now the connective tissue, which is much of why the network trains. Around each of the two sublayers (the attention and the feed-forward) the Transformer wraps a residual connection: it adds the sublayer's input back to its output. So a sublayer never has to produce the full answer from scratch; it only has to produce a correction to what it received. That single trick, borrowed from ResNets, is most of why very deep networks are trainable, because it gives the gradient a clean path straight back through the stack. Then each sum is run through layer normalization, which rescales the vector to a stable mean and variance so the numbers flowing through the network do not blow up or vanish. In the paper's notation each sublayer computes .
# one encoder layer (post-LN, the 2017 convention)
a = LayerNorm(x + MultiHeadSelfAttention(x)) # sublayer 1
z = LayerNorm(a + FeedForward(a)) # sublayer 2
# modern code is usually pre-LN: x + Sublayer(LayerNorm(x))This ordering is the one place where the original paper and modern practice disagree. The 2017 design puts the normalization after the residual add, outside the loop. This is called post-LN, and it has a quirk: it is unstable to train unless you ramp the learning rate up slowly at the start, the "warmup" the paper uses for its first steps. Almost every Transformer since (GPT-2 onward) moves the normalization inside the residual branch instead, computing . This is pre-LN, and it trains stably without the warmup. If you read the original equations and then read a modern codebase and they look subtly different here, this is why.
Encoder, decoder, and the mask
Now assemble it. The translation model has two stacks, each six layers tall. The encoder reads the source sentence. Each of its layers is exactly the block above: multi-head self-attention, then a feed-forward network, each wrapped in a residual and a norm. After six layers, the encoder has turned the English sentence into a stack of context-rich vectors, one per source word.
The decoder writes the translation, one word at a time, and it is where attention shows its range, because the decoder uses it in two different ways inside each layer. Counting the encoder, the paper uses attention in three distinct roles:
- Encoder self-attention. The source sentence attends to itself. Queries, keys, and values all come from the previous encoder layer. Every word can see every other word.
- Decoder self-attention (masked). The translation-so-far attends to itself, but with a restriction we get to in a moment.
- Encoder-decoder attention (cross-attention). Here the queries come from the decoder and the keys and values come from the encoder's output. This is how each word being generated reaches back and reads the source sentence. Same equation (1), different sources for the three inputs.
The restriction on decoder self-attention is the last essential idea, and it is a one-line change with a big purpose. When the model generates the translation, it produces word before it has produced word . So while it is deciding word , it must not be allowed to look at words and beyond, because at inference time those do not exist yet. During training, though, the whole target sentence is present at once (that is what makes training parallel), so we have to actively forbid each position from peeking ahead. That is the causal mask: before the softmax, every score that would let position attend to a later position is set to , which softmax turns into a weight of exactly zero. The attention grid becomes a lower triangle. (The mask alone is not the whole story: the target is also fed in shifted one position to the right, so the model is always asked to predict the next token from the tokens strictly before it.) Toggle the mask on and off below: off is the encoder seeing everything; on is the decoder allowed to look only backward.
With the mask in place, the decoder can be trained on an entire target sentence in one parallel pass, while still behaving as if it were generating left to right. At the top of the decoder, a final linear layer and a softmax turn each output vector into a probability distribution over the vocabulary, the model's guess for the next word. The same weight matrix is reused in three places (the source embedding, the decoder's target embedding, and the final projection that turns vectors back into word scores). Tying them saves parameters and tends to help, because mapping a word to a vector and scoring a vector against words are mirror-image problems.
Why it won
The motivation from the first section was a claim about complexity, and the paper makes it precise. For a sequence of length with representation dimension , here is how a self-attention layer compares with the recurrent and convolutional layers it replaced:
| Layer type | Complexity / layer | Sequential ops | Max path length |
|---|---|---|---|
| Self-attention | O(n²·d) | O(1) | O(1) |
| Recurrent | O(n·d²) | O(n) | O(n) |
| Convolutional | O(k·n·d²) | O(1) | O(logₖ n) |
| Self-attention (restricted) | O(r·n·d) | O(1) | O(n/r) |
Read the two right-hand columns. Self-attention does all its work in a constant number of sequential steps and connects any two positions in one hop, while recurrence needs of each. It does pay for this in the first column: its cost grows with because it compares every pair of words, whereas recurrence grows with . The authors point out that this trade favors self-attention whenever the sentence length is smaller than the representation dimension , which is the usual case for translation. (The term is also exactly what later work on long-context models spends its effort trying to reduce.)
The translation numbers backed it up. On the WMT 2014 English-to-German task the big Transformer reached BLEU (the standard translation-quality score, higher is better), beating every previously published model, including ensembles, by more than BLEU. On English-to-French it reached BLEU, a new single-model state of the art. The base model alone scored and . And it was cheap by the standards of the day: the big model trained in days on eight P100 GPUs, a fraction of the cost of the recurrent and convolutional systems it beat, which is the parallelism from the first section showing up on the clock.
What the paper could not have stated is the part we now know. The Transformer turned out to be more general than a translation architecture. Drop the encoder and keep the masked decoder and you get the family GPT belongs to. Drop the decoder and keep the encoder and you get BERT. Feed it image patches instead of words and you get a vision Transformer. The reason the same backbone keeps working is the reason it won here: it is a general way to let a set of elements exchange information in parallel, with the path between any two held constant. The title was a slight overstatement (you still need the embeddings, the feed-forward layers, the positions, and the residuals) but as a bet on which mechanism mattered most, it has held up unusually well.
Questions you might still have
If attention has no sense of order, why not just feed in the position as a single number?
A lone scalar would be one dimension competing with 512 others, and it would not compose. The sinusoids spread position across all dimensions, and because a fixed offset is a fixed rotation of each sine-cosine pair, relative position becomes a clean linear relationship the model can attend by.
Why divide by √dₖ exactly, not by dₖ?
The spread of a dot-product score is set by its standard deviation, √dₖ. The variance, dₖ, is the wrong scale: dividing by it would over-shrink the scores toward zero and make every key look equally relevant.
Is the original Transformer the same thing as GPT?
No. The 2017 paper is a full encoder-decoder for translation. GPT is decoder-only: it keeps the masked self-attention stack and drops the encoder and the cross-attention. BERT is the opposite, encoder-only. Same building blocks, different subsets.
Do attention heads really specialize the way the figure shows?
Some do, learning roughly positional or syntactic patterns, but only as a tendency. Later work pruned many heads after training with little loss in quality, so the tidy per-head roles are best read as illustration.
Footnotes & further reading
- The paper: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Attention Is All You Need (NeurIPS 2017). For a line-by-line implementation, Harvard NLP's The Annotated Transformer.
- The attention idea this builds on: Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (additive attention), and Luong, Pham, Manning, Effective Approaches to Attention-based NMT (multiplicative).
- Residual connections: He, Zhang, Ren, Sun, Deep Residual Learning for Image Recognition. Layer normalization: Ba, Kiros, Hinton, Layer Normalization.
- Post-LN vs pre-LN: Xiong et al., On Layer Normalization in the Transformer Architecture, which explains why the original ordering needs learning-rate warmup and the rearranged one does not.
- Weight tying between the embeddings and the output projection: Press & Wolf, Using the Output Embedding to Improve Language Models.
- On how interpretable heads really are: Michel, Levy, Neubig, Are Sixteen Heads Really Better than One?, and Voita et al., Analyzing Multi-Head Self-Attention.
- What the architecture became: Devlin et al., BERT (encoder-only); Brown et al., GPT-3 (decoder-only); Dosovitskiy et al., An Image is Worth 16x16 Words (vision).
How could this explainer be improved? Found an error, or something unclear? I read every message.