VerifiedarXiv:2506.1420232 min
Architecture · Foundations

Attention Is All You Need

Stop reading one word at a time. Let every token look at every other token, at once.

The Transformer, built from the ground up: embeddings, self-attention, the softmax scaling trick, multiple heads, sinusoidal positions, and the encoder-decoder. Intuition first, then the exact math.

Explaining the paperAttention Is All You NeedVaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · Google · NeurIPS 2017 · arXiv:1706.03762

What if a model could read a whole sentence at once, and learn which words matter to which, instead of trudging through it one word at a time?

Almost everything you have heard of in modern AI runs on the architecture in this one 2017 paper. GPT, BERT, the chatbots and code assistants and image generators you have used: under the hood they are Transformers, and the Transformer is what this paper introduced. It is built from a small number of ideas, each simple once you see it from the right angle.

The setup the authors were working in is machine translation. You feed in an English sentence and you want a German one out. For years the standard tool was the recurrent network, which reads a sentence the way you might read it aloud: one word at a time, carrying a running summary in your head and updating it at each step. It worked, but it had a structural problem that no amount of tuning could fix, and that problem is where the whole paper starts.

The Transformer's answer was to throw recurrence out completely and keep only one mechanism: attention, the ability of each word to look directly at every other word and decide what is relevant. No reading left to right. It trained much faster and scored higher on translation than the recurrent systems before it. Getting there takes a handful of ideas: turning words into vectors, what attention computes, the division by dk\sqrt{d_k}, running several heads at once, restoring word order, and assembling an encoder and a decoder. None of it is hard on its own.

The cost of reading left to right

A recurrent network processes a sequence step by step. It reads the first word, updates a hidden state, reads the second word, updates the state again, and so on to the end. That state is the running summary. The hidden state hth_t at position tt is a function of the previous state ht1h_{t-1} and the current word. That little dependency, each step needing the one before it, is the whole trouble.

Two consequences follow, and both are about distance. The first is speed: because step tt cannot begin until step t1t-1 has finished, a sentence of length nn takes nn sequential steps no matter how many processors you own. A GPU is a machine for doing thousands of things at once, and recurrence forces it to wait in line. The second is memory, in the human sense. For the first word to influence the last, its information has to survive being passed hand to hand through every word in between. Across a long sentence that signal gets diluted, and the model struggles to connect things that are far apart.

Self-attention removes both problems in one move. Instead of a chain, it wires every position directly to every other position. The longest path between any two words is a single hop, and every position is computed at the same time. Toggle between the two below and drag the sentence length. Watch the recurrent path grow with nn while the attention path stays pinned at one.

Figure 1 · path length
n = 9
In a recurrent layer, information from the first token reaches the last only by hopping through every token between them, and the steps run in sequence. In a self-attention layer every token is wired straight to every other, so the longest path is one hop and all positions compute in parallel.

This is the trade the paper makes explicit in its complexity table, which we will come back to at the end. A self-attention layer needs more arithmetic per layer (it compares every word with every other, which is n2n^2 comparisons), but it spends that arithmetic in parallel, and it makes the path between distant words constant instead of growing with the sentence. For the sentence lengths used in translation, that is a trade worth making many times over.

Words become vectors

Before any attention happens, the words have to become numbers. A Transformer keeps a lookup table with one row per token in its vocabulary, and each row is a learned vector of length dmodeld_{\text{model}}, which in the base model is 512512. The word "animal" becomes a specific point in 512-dimensional space; "road" becomes another. These embeddings are learned during training, and similar words drift toward similar regions of the space.

From here on, a token is just its vector, and a sentence is a stack of vectors, a matrix of shape n×dmodeln \times d_{\text{model}} for a sentence of nn words. Every layer of the Transformer reads a stack of vectors and writes out a new stack of the same shape, gradually refining what each position "means" in context. The word "it" starts as a generic pronoun vector and, layer by layer, picks up information about which noun it refers to.

One small detail that matters later: the paper multiplies these embedding vectors by dmodel\sqrt{d_{\text{model}}} before anything else. The paper just states the multiply. One common reading is that it keeps the embedding and the position signal we add in a moment on a similar scale, so neither dominates early in training. Either way, keep it separate from the 1/dk1/\sqrt{d_k} inside attention: two different square roots, two different jobs.

Attention is a soft lookup

Here is the central idea. Think about how a dictionary works. You arrive with a word you want to look up (a query), you scan the entries until you find a matching headword (a key), and you read off its definition (a value). A dictionary is a hard lookup: one key matches, you take its value, done.

Attention is the soft version of that. Every word produces a query, a key, and a value (three different vectors, each a learned linear projection of the word's embedding, which just means the embedding vector multiplied by a learned matrix). To compute the new representation of a word, you take its query and compare it against the keys of all the words, including itself. The comparison gives each word a relevance score. You turn those scores into weights that sum to one, and your answer is the weighted blend of all the value vectors, and that blended vector becomes the word's new representation. Instead of reading one definition, you read a little from every entry, in proportion to how well it matches.

The comparison is a dot product. The dot product of two vectors is large and positive when they point the same way, near zero when they are unrelated, negative when they point apart. So a query and a key that "agree" produce a high score. Turning the scores into weights is the job of the softmax function, which takes a list of numbers and squashes them into positive values that sum to one, with the largest input getting the largest share. High score, high weight, more of that word's value in the blend.

The classic example is a pronoun. In "the animal crossed the road because it got tired," the word "it" needs to figure out what it refers to. Its query, compared against every key, scores highest against "animal," so the new representation of "it" is built mostly from the value of "animal." The model has resolved the reference by attending to the right word. Pick a query token below and watch where its attention goes.

Figure 2 · scaled dot-product attention
Each token sends a query and gets a weight on every other token, the softmax of their scaled dot products. The output is the weighted sum of the value vectors. For the query "it," the weight piles onto "animal," the noun it refers to. (Weights here are illustrative; the mechanism is exact.)

When all the words' queries, keys, and values are stacked into matrices QQ, KK, and VV, the whole operation is two matrix multiplications with a softmax in between. This is the paper's equation (1), and once you have the picture above, you can read it straight off:

Attention(Q,K,V)=softmax ⁣(QKdk)V\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V(1)

Read it left to right. QKQK^{\top} is every query dotted with every key, an n×nn \times n grid of raw scores. Dividing by dk\sqrt{d_k} is the scaling we explain next. The softmax turns each row of scores into a row of weights that sums to one. Multiplying by VV replaces each word with the weighted blend of all the values. That is the whole computation. In code it is just as short:

# scaled dot-product attention, one head
# Q: [n, d_k]   K: [m, d_k]   V: [m, d_v]
scores  = Q @ K.T / sqrt(d_k)       # [n, m] query-key match
weights = softmax(scores, axis=-1)  # each row sums to 1
out     = weights @ V               # [n, d_v] weighted values

The naming trips people up. When the queries, keys, and values all come from the same sequence, this is self-attention: the sentence attending to itself. Later we will see the same equation used with queries from one sequence and keys and values from another, which is how the decoder reads the encoder. The mechanism never changes; only where the three inputs come from.

Why divide by the square root of d_k

That dk\sqrt{d_k} in the denominator looks arbitrary, and it is the one part of equation (1) that a newcomer cannot guess. It is there to keep the softmax from saturating, collapsing onto one word so hard that learning stalls, and the reasoning is short.

A score is a dot product of a query and a key, each a vector of length dkd_k (in the base model, 6464). Suppose, as a rough model, that the components of each are independent with mean zero and variance one. The dot product is a sum of dkd_k products of such components. Each product has mean zero and variance one, and variances of independent things add, so the sum has variance dkd_k:

qk=i=1dkqiki,Var(qk)=dkq \cdot k = \sum_{i=1}^{d_k} q_i k_i, \qquad \mathrm{Var}(q \cdot k) = d_k

So the typical size of a raw score grows like dk\sqrt{d_k}. With dk=64d_k = 64 the scores are spread out by a factor of eight before the softmax ever sees them. That matters because softmax is sensitive to the scale of its inputs. Feed it numbers that are all close together and it returns a gentle, spread-out distribution. Feed it numbers that are far apart and it returns a near-spike: one weight close to one, the rest close to zero. A spike is bad for learning, because the gradient of softmax, the signal that says how to change the weights, is tiny wherever it has saturated, so the model can barely adjust which word it attends to. Dividing by dk\sqrt{d_k} cancels the growth exactly, pulling the variance back to one and keeping the distribution soft. Drag dkd_k and compare the two:

Figure 3 · the scaling trick
d_k = 64
The same eight scores, run through softmax two ways. Without scaling, the spread grows like dk\sqrt{d_k} and the distribution collapses onto one key as dkd_k rises, where the gradient is nearly flat. Dividing by dk\sqrt{d_k} holds the spread near one, keeping attention soft and trainable.

Why dk\sqrt{d_k} and not dkd_k? The spread of the scores is set by their standard deviation, which is dk\sqrt{d_k}. Dividing by the variance dkd_k instead would over-correct and crush the scores too far toward zero. This is also the only difference between this and the older "dot-product attention" it descends from: same idea, missing the scale factor, which is why it underperformed at large dkd_k.

Many heads at once

A single attention computation has a built-in limit. It produces one set of weights per word, so each word ends up with one weighted blend. But the relationships in a sentence are not one thing. The word "it" cares about the noun it refers to; a verb cares about its subject and its object. One distribution cannot point at both at once. Forced to choose, it averages them together.

Multi-head attention is the fix, and it is the obvious one: run several attentions in parallel and let each specialize. Each head projects the full 512512-dimensional vector down into its own smaller 6464-dimensional subspace (dk=dv=dmodel/h=64d_k = d_v = d_{\text{model}}/h = 64 with h=8h = 8 heads); the split is by projection, not by slicing the vector into chunks. Each head gets its own learned projections, so each can choose its own notion of relevance. One head can track the previous word; another can latch onto the subject of a verb several words away. Pick a query and see four heads attend four different ways, with the single-head average at the bottom:

Figure 4 · multi-head attention
Each head attends in its own subspace, so different heads lock onto different relations at once: here h1 the previous token, h2 the next, h3 the nouns, h4 a broad neighborhood. A lone head, with only one distribution to spend, blurs them into the single average in the last row. The roles shown are illustrative; in a trained model not every head is this tidy.

The eight heads run independently, then their outputs are concatenated back into a single 512512-dimensional vector and passed through one more learned projection WOW^O that mixes them. Because each head works in 1/81/8 the dimension, the total cost is about the same as one full-size attention. In math, with headi\mathrm{head}_i being attention computed on the ii-th set of projections:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\,W^O
headi=Attention(QWiQ,  KWiK,  VWiV)\mathrm{head}_i = \mathrm{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

Here WiQ,WiKRdmodel×dkW_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k} and WiVRdmodel×dvW_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v} are the per-head projections, and WORhdv×dmodelW^O \in \mathbb{R}^{h d_v \times d_{\text{model}}} is the final mix. In code the structure is exactly as it sounds:

# multi-head attention: h heads run in parallel
def mha(X, h):                          # X: [n, d_model]
    heads = []
    for i in range(h):                  # each head, its own projections
        Q = X @ Wq[i]; K = X @ Wk[i]; V = X @ Wv[i]   # [n, d_k]
        heads.append(attention(Q, K, V))             # [n, d_v]
    return concat(heads, axis=-1) @ Wo  # back to [n, d_model]

The paper checked that the heads matter: a single-head model scored about 0.90.9 points lower on a held-out English-to-German tuning set (the development set), measured by BLEU, the translation-quality score we define later. Be honest about the interpretation, though. The clean "this head does syntax" stories hold only sometimes. Later work showed that many heads can be pruned after training with little loss, so read the tidy roles in the figure as an illustration of what heads sometimes do.

Putting the order back

Throwing out recurrence cost us something we have not paid for yet: word order. Look again at the attention equation. If you shuffle the words of the sentence, you shuffle the rows of QQ, KK, and VV, and the output is shuffled the same way but otherwise identical. Attention treats the input as a bag of words. To it, "dog bites man" and "man bites dog" are the same set of vectors, which is clearly wrong.

The fix is to stamp each token with its position before the first layer. The paper adds a positional encoding, a vector the same size as the embedding, to each token's embedding. The choice of vector is the clever part. Instead of learning one, they use a fixed pattern of sines and cosines at a range of frequencies:

PE(pos,2i)=sin ⁣(pos100002i/dmodel),PE(pos,2i+1)=cos ⁣(pos100002i/dmodel)PE_{(pos,\,2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \qquad PE_{(pos,\,2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Unpacking it: pospos is the position in the sentence and ii indexes the dimension. Even dimensions get a sine, odd dimensions the matching cosine. The frequency falls off as you move through the dimensions, so the leftmost dimensions oscillate quickly as you step through positions and the rightmost ones barely move. The wavelengths run in a geometric progression from 2π2\pi up to about 100002π10000 \cdot 2\pi. Think of a row of clocks ticking at wildly different speeds: read all their hands at once and you get a code that is unique to each position. The heatmap shows it; drag the position to read off one row's fingerprint:

Figure 5 · positional encoding
pos = 10
Each position gets a fingerprint of sines and cosines across the dmodeld_{\text{model}} dimensions. Wavelengths run geometrically from 2π2\pi (fast, left columns) to 100002π10000 \cdot 2\pi (slow, right). Teal marks positive values, amber negative, so each position row is a distinct striped fingerprint.

Why sinusoids rather than just learning a position vector? The authors give two reasons. The practical one is extrapolation: a fixed formula is defined at every position, so the model is at least defined at positions longer than any it saw in training (whether it translates them well is another matter). The elegant one is about relative position. For a fixed offset kk, the encoding of position pos+kpos+k is a fixed linear function of the encoding of position pospos. Each sine-cosine pair is just a point on a circle, and stepping forward by kk rotates it by a fixed angle, the same rotation regardless of where you started. That makes "three words back" a consistent linear operation, which is exactly the kind of relationship attention can learn to use. (They also tried learned position embeddings and got nearly identical results. The sinusoids are a reasonable default; they did not clearly beat the learned version.)

The rest of the block

Attention is the headline, but a Transformer layer has a second half and some connective tissue that does real work. The second half is a feed-forward network applied to each position on its own, identically. After attention has let a word gather information from the rest of the sentence, the feed-forward network sits and thinks about that word in isolation. It is two linear layers with a ReLU\mathrm{ReLU} (the function max(0,x)\max(0, x), which zeroes out negatives) in between:

FFN(x)=max(0,  xW1+b1)W2+b2\mathrm{FFN}(x) = \max(0,\; xW_1 + b_1)\,W_2 + b_2(2)

It expands from dmodel=512d_{\text{model}} = 512 up to an inner dimension of dff=2048d_{ff} = 2048 and back down, so it is wide in the middle. Inside each layer the feed-forward network holds about twice as many parameters as the attention, so a lot of the raw capacity sits here, not in the attention everyone talks about.

Now the connective tissue, which is much of why the network trains. Around each of the two sublayers (the attention and the feed-forward) the Transformer wraps a residual connection: it adds the sublayer's input back to its output. So a sublayer never has to produce the full answer from scratch; it only has to produce a correction to what it received. That single trick, borrowed from ResNets, is most of why very deep networks are trainable, because it gives the gradient a clean path straight back through the stack. Then each sum is run through layer normalization, which rescales the vector to a stable mean and variance so the numbers flowing through the network do not blow up or vanish. In the paper's notation each sublayer computes LayerNorm(x+Sublayer(x))\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x)).

# one encoder layer (post-LN, the 2017 convention)
a = LayerNorm(x + MultiHeadSelfAttention(x))   # sublayer 1
z = LayerNorm(a + FeedForward(a))              # sublayer 2
# modern code is usually pre-LN: x + Sublayer(LayerNorm(x))

This ordering is the one place where the original paper and modern practice disagree. The 2017 design puts the normalization after the residual add, outside the loop. This is called post-LN, and it has a quirk: it is unstable to train unless you ramp the learning rate up slowly at the start, the "warmup" the paper uses for its first 40004000 steps. Almost every Transformer since (GPT-2 onward) moves the normalization inside the residual branch instead, computing x+Sublayer(LayerNorm(x))x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x)). This is pre-LN, and it trains stably without the warmup. If you read the original equations and then read a modern codebase and they look subtly different here, this is why.

Encoder, decoder, and the mask

Now assemble it. The translation model has two stacks, each six layers tall. The encoder reads the source sentence. Each of its layers is exactly the block above: multi-head self-attention, then a feed-forward network, each wrapped in a residual and a norm. After six layers, the encoder has turned the English sentence into a stack of context-rich vectors, one per source word.

The decoder writes the translation, one word at a time, and it is where attention shows its range, because the decoder uses it in two different ways inside each layer. Counting the encoder, the paper uses attention in three distinct roles:

The restriction on decoder self-attention is the last essential idea, and it is a one-line change with a big purpose. When the model generates the translation, it produces word ii before it has produced word i+1i+1. So while it is deciding word ii, it must not be allowed to look at words i+1i+1 and beyond, because at inference time those do not exist yet. During training, though, the whole target sentence is present at once (that is what makes training parallel), so we have to actively forbid each position from peeking ahead. That is the causal mask: before the softmax, every score that would let position ii attend to a later position is set to -\infty, which softmax turns into a weight of exactly zero. The attention grid becomes a lower triangle. (The mask alone is not the whole story: the target is also fed in shifted one position to the right, so the model is always asked to predict the next token from the tokens strictly before it.) Toggle the mask on and off below: off is the encoder seeing everything; on is the decoder allowed to look only backward.

Figure 6 · the causal mask
i = 5
Attention as an n×nn \times n grid: row ii is the query, column jj the key. The encoder leaves the whole grid live. The decoder sets every cell above the diagonal to -\infty before the softmax, so each position attends only to itself and the past. Step the position to watch the allowed region grow.

With the mask in place, the decoder can be trained on an entire target sentence in one parallel pass, while still behaving as if it were generating left to right. At the top of the decoder, a final linear layer and a softmax turn each output vector into a probability distribution over the vocabulary, the model's guess for the next word. The same weight matrix is reused in three places (the source embedding, the decoder's target embedding, and the final projection that turns vectors back into word scores). Tying them saves parameters and tends to help, because mapping a word to a vector and scoring a vector against words are mirror-image problems.

Why it won

The motivation from the first section was a claim about complexity, and the paper makes it precise. For a sequence of length nn with representation dimension dd, here is how a self-attention layer compares with the recurrent and convolutional layers it replaced:

Layer typeComplexity / layerSequential opsMax path length
Self-attentionO(n²·d)O(1)O(1)
RecurrentO(n·d²)O(n)O(n)
ConvolutionalO(k·n·d²)O(1)O(logₖ n)
Self-attention (restricted)O(r·n·d)O(1)O(n/r)

Read the two right-hand columns. Self-attention does all its work in a constant number of sequential steps and connects any two positions in one hop, while recurrence needs nn of each. It does pay for this in the first column: its cost grows with n2n^2 because it compares every pair of words, whereas recurrence grows with nn. The authors point out that this trade favors self-attention whenever the sentence length nn is smaller than the representation dimension dd, which is the usual case for translation. (The n2n^2 term is also exactly what later work on long-context models spends its effort trying to reduce.)

The translation numbers backed it up. On the WMT 2014 English-to-German task the big Transformer reached 28.428.4 BLEU (the standard translation-quality score, higher is better), beating every previously published model, including ensembles, by more than 2.02.0 BLEU. On English-to-French it reached 41.841.8 BLEU, a new single-model state of the art. The base model alone scored 27.327.3 and 38.138.1. And it was cheap by the standards of the day: the big model trained in 3.53.5 days on eight P100 GPUs, a fraction of the cost of the recurrent and convolutional systems it beat, which is the parallelism from the first section showing up on the clock.

What the paper could not have stated is the part we now know. The Transformer turned out to be more general than a translation architecture. Drop the encoder and keep the masked decoder and you get the family GPT belongs to. Drop the decoder and keep the encoder and you get BERT. Feed it image patches instead of words and you get a vision Transformer. The reason the same backbone keeps working is the reason it won here: it is a general way to let a set of elements exchange information in parallel, with the path between any two held constant. The title was a slight overstatement (you still need the embeddings, the feed-forward layers, the positions, and the residuals) but as a bet on which mechanism mattered most, it has held up unusually well.

Provenance Verified against primary literature
Vaswani et al. (2017)The Transformer: scaled dot-product and multi-head attention, the encoder-decoder, sinusoidal positions.
Bahdanau et al. (2015)Additive attention, the alignment idea the dot-product version streamlines.
He et al. (2015)Residual connections, the +x wrapped around every sublayer.
Ba, Kiros, Hinton (2016)Layer normalization, applied after each residual add (post-LN).
Xiong et al. (2020)Pre-LN: moving the norm inside the residual removes the need for warmup. The modern default.
correctionThe paper's abstract and Table 2 report 41.8 BLEU for the big model on English-to-French, while the body text in Section 6.1 says 41.0. We use the table value, 41.8, and note the discrepancy.

Questions you might still have

?

If attention has no sense of order, why not just feed in the position as a single number?
A lone scalar would be one dimension competing with 512 others, and it would not compose. The sinusoids spread position across all dimensions, and because a fixed offset is a fixed rotation of each sine-cosine pair, relative position becomes a clean linear relationship the model can attend by.

?

Why divide by √dₖ exactly, not by dₖ?
The spread of a dot-product score is set by its standard deviation, √dₖ. The variance, dₖ, is the wrong scale: dividing by it would over-shrink the scores toward zero and make every key look equally relevant.

?

Is the original Transformer the same thing as GPT?
No. The 2017 paper is a full encoder-decoder for translation. GPT is decoder-only: it keeps the masked self-attention stack and drops the encoder and the cross-attention. BERT is the opposite, encoder-only. Same building blocks, different subsets.

?

Do attention heads really specialize the way the figure shows?
Some do, learning roughly positional or syntactic patterns, but only as a tendency. Later work pruned many heads after training with little loss in quality, so the tidy per-head roles are best read as illustration.

Footnotes & further reading

  1. The paper: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Attention Is All You Need (NeurIPS 2017). For a line-by-line implementation, Harvard NLP's The Annotated Transformer.
  2. The attention idea this builds on: Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (additive attention), and Luong, Pham, Manning, Effective Approaches to Attention-based NMT (multiplicative).
  3. Residual connections: He, Zhang, Ren, Sun, Deep Residual Learning for Image Recognition. Layer normalization: Ba, Kiros, Hinton, Layer Normalization.
  4. Post-LN vs pre-LN: Xiong et al., On Layer Normalization in the Transformer Architecture, which explains why the original ordering needs learning-rate warmup and the rearranged one does not.
  5. Weight tying between the embeddings and the output projection: Press & Wolf, Using the Output Embedding to Improve Language Models.
  6. On how interpretable heads really are: Michel, Levy, Neubig, Are Sixteen Heads Really Better than One?, and Voita et al., Analyzing Multi-Head Self-Attention.
  7. What the architecture became: Devlin et al., BERT (encoder-only); Brown et al., GPT-3 (decoder-only); Dosovitskiy et al., An Image is Worth 16x16 Words (vision).