Architecture · Foundations

Attention Is All You Need

Stop reading one word at a time. Let every token look at every other token, at once.

The Transformer, built from the ground up: embeddings, self-attention, the softmax scaling trick, multiple heads, sinusoidal positions, and the encoder-decoder. Intuition first, then the exact math.

Explaining the paperAttention Is All You NeedVaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin · Google · NeurIPS 2017 · arXiv:1706.03762 ↗

Throw out the part of the network that reads one word at a time, keep only the part that lets words look at each other, and translation gets both faster and better.

Almost everything you have heard of in modern AI runs on the architecture in this one 2017 paper. GPT, BERT, the chatbots and code assistants and image generators you have used: under the hood they are Transformers, and this paper introduced the Transformer. It is built from a small number of ideas, each simple once you see it from the right angle.

The setup the authors were working in is machine translation. You feed in an English sentence and you want a German one out. For years the standard tool was the recurrent network, which reads a sentence the way you might read it aloud: one word at a time, carrying a running summary in your head and updating it at each step. It worked, but it had a structural problem that no amount of tuning could fix, and the paper starts from that problem.

The Transformer's answer was to throw recurrence out completely and keep only one mechanism: attention, the ability of each word to look directly at every other word and decide what is relevant. No reading left to right. It trained much faster and scored higher on translation than the recurrent systems before it. Getting there takes a handful of ideas: turning words into vectors, what attention computes, the division by $\sqrt{d_k}$ , running several heads at once, restoring word order, and assembling an encoder and a decoder. We take them in that order.

The cost of reading left to right

A recurrent network processes a sequence step by step. It reads the first word, updates a hidden state, reads the second word, updates the state again, and so on to the end. That state is the running summary. The hidden state $h_t$ at position $t$ is a function of the previous state $h_{t-1}$ and the current word. That little dependency, each step needing the one before it, is the source of the problem.

Two consequences follow, and both are about distance. The first is speed: because step $t$ cannot begin until step $t-1$ has finished, a sentence of length $n$ takes $n$ sequential steps no matter how many processors you own. A GPU is a machine for doing thousands of things at once, and recurrence makes it process the steps one at a time. The second is memory, in the human sense. For the first word to influence the last, its information has to survive being passed hand to hand through every word in between. Across a long sentence that signal gets diluted, and the model is less able to connect things that are far apart.

Self-attention removes both problems in one move. Instead of a chain, it wires every position directly to every other position. The longest path between any two words is a single hop, and every position is computed at the same time. Toggle between the two below and drag the sentence length. Watch the recurrent path grow with $n$ while the attention path stays pinned at one.

Figure 1 · path length

n = 9

In a recurrent layer, information from the first token reaches the last only by hopping through every token between them, and the steps run in sequence. In a self-attention layer every token is wired straight to every other, so the longest path is one hop and all positions compute in parallel.

This is the trade the paper makes explicit in its complexity table, which we come back to at the end. A self-attention layer needs more arithmetic per layer (it compares every word with every other, which is $n^2$ comparisons), but it spends that arithmetic in parallel, and it makes the path between distant words constant instead of growing with the sentence. For the sentence lengths used in translation, that trade pays off many times over.

Words become vectors

Before any attention happens, the words have to become numbers. A Transformer keeps a lookup table with one row per token in its vocabulary, and each row is a learned vector of length $d_{\text{model}}$ , which in the base model is $512$ . The word "animal" becomes a specific point in 512-dimensional space; "road" becomes another. These embeddings are learned during training, and similar words drift toward similar regions of the space.

From here on, a token is its vector, and a sentence is a stack of vectors, a matrix of shape $n \times d_{\text{model}}$ for a sentence of $n$ words. Every layer of the Transformer reads a stack of vectors and writes out a new stack of the same shape, gradually refining what each position "means" in context. The word "it" starts as a generic pronoun vector and, layer by layer, picks up information about which noun it refers to.

A small detail that matters later: the paper multiplies these embedding vectors by $\sqrt{d_{\text{model}}}$ before anything else. The paper states the multiply without justifying it. A common reading is that it keeps the embedding and the position signal on a similar scale, so neither dominates early in training. Either way, keep it separate from the $1/\sqrt{d_k}$ inside attention.

Attention is a soft lookup

The central idea works the way a dictionary does. You arrive with a word you want to look up (a query), you scan the entries until you find a matching headword (a key), and you read off its definition (a value). A dictionary is a hard lookup: one key matches, you take its value, done.

Attention is the soft version of that. Every word produces a query, a key, and a value (three different vectors, each a learned linear projection of the word's embedding, which means the embedding vector multiplied by a learned matrix). To compute the new representation of a word, you take its query and compare it against the keys of all the words, including itself. The comparison gives each word a relevance score. You turn those scores into weights that sum to one, and your answer is the weighted sum of all the value vectors, and that combined vector becomes the word's new representation. Instead of reading one definition, you read a little from every entry, in proportion to how well it matches.

The comparison is a dot product. The dot product of two vectors is large and positive when they point the same way, near zero when they are unrelated, negative when they point apart. So a query and a key that "agree" produce a high score. Turning the scores into weights is the job of the softmax function, which takes a list of numbers and squashes them into positive values that sum to one, with the largest input getting the largest share.

The classic example is a pronoun. In "the animal crossed the road because it got tired," the new representation of "it" must resolve what it refers to. Its query, compared against every key, scores highest against "animal," so the new representation of "it" is built mostly from the value of "animal." The model has resolved the reference by attending to the right word. Pick a query token below and watch where its attention goes.

Figure 2 · scaled dot-product attention

query

Each token sends a query and gets a weight on every other token, the softmax of their scaled dot products. The output is the weighted sum of the value vectors. For the query "it," the weight piles onto "animal," the noun it refers to. (Weights here are illustrative; the mechanism is exact.)

When all the words' queries, keys, and values are stacked into matrices $Q$ , $K$ , and $V$ , the operation is two matrix multiplications with a softmax in between. This is the paper's equation (1), and once you have the picture above, you can read it straight off:

\mathrm{Attention}(Q,K,V) = \mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V

(1)

The equation reads left to right. $QK^{\top}$ is every query dotted with every key, an $n \times n$ grid of raw scores. Dividing by $\sqrt{d_k}$ rescales the scores. The softmax turns each row of scores into a row of weights that sums to one. Multiplying by $V$ replaces each word with the weighted blend of all the values. In code it is equally short:

# scaled dot-product attention, one head
# Q: [n, d_k]   K: [m, d_k]   V: [m, d_v]
scores  = Q @ K.T / sqrt(d_k)       # [n, m] query-key match
weights = softmax(scores, axis=-1)  # each row sums to 1
out     = weights @ V               # [n, d_v] weighted values

When the queries, keys, and values all come from the same sequence, this is self-attention: the sentence attending to itself. The same equation also runs with queries from one sequence and keys and values from another, which is how the decoder reads the encoder.

Why divide by the square root of d_k

That $\sqrt{d_k}$ in the denominator looks arbitrary, and it is the one part of equation (1) that a newcomer cannot guess. It is there to keep the softmax from saturating, collapsing onto one word so hard that learning stalls, and the reasoning is short.

A score is a dot product of a query and a key, each a vector of length $d_k$ (in the base model, $64$ ). Suppose, as a rough model, that the components of each are independent with mean zero and variance one. The dot product is a sum of $d_k$ products of such components. Each product has mean zero and variance one, and variances of independent things add, so the sum has variance $d_k$ :

q \cdot k = \sum_{i=1}^{d_k} q_i k_i, \qquad \mathrm{Var}(q \cdot k) = d_k

So the typical size of a raw score grows like $\sqrt{d_k}$ . With $d_k = 64$ the scores are spread out by a factor of eight before the softmax ever sees them. That matters because softmax is sensitive to the scale of its inputs. Feed it numbers that are all close together and it returns a gentle, spread-out distribution. Feed it numbers that are far apart and it returns a near-spike: one weight close to one, the rest close to zero. A spike is bad for learning, because the gradient of softmax, the signal that says how to change the weights, is tiny wherever it has saturated, so the model can barely adjust which word it attends to. Dividing by $\sqrt{d_k}$ cancels the growth exactly, pulling the variance back to one and keeping the distribution soft. Drag $d_k$ and compare the two:

Figure 3 · the scaling trick

d_kd_k = 64

The same eight scores, run through softmax two ways. Without scaling, the spread grows like

\sqrt{d_k}

and the distribution collapses onto one key as

d_k

rises, where the gradient is nearly flat. Dividing by

\sqrt{d_k}

holds the spread near one, keeping attention soft and trainable.

Why $\sqrt{d_k}$ and not $d_k$ ? The spread of the scores is set by their standard deviation, which is $\sqrt{d_k}$ . Dividing by the variance $d_k$ instead would over-correct and crush the scores too far toward zero. This is also the only difference between this and the older "dot-product attention" it descends from: same idea, missing the scale factor, which is why it underperformed at large $d_k$ .

Many heads at once

A single attention computation has a built-in limit. It produces one set of weights per word, so each word ends up with one weighted blend. But the relationships in a sentence are not one thing. The word "it" relates to the noun it refers to; a verb relates to its subject and its object. One distribution cannot point at both at once. Forced to choose, it averages them together.

Multi-head attention solves this in the obvious way: run several attentions in parallel and let each specialize. Each head projects the full $512$ -dimensional vector down into its own smaller $64$ -dimensional subspace ( $d_k = d_v = d_{\text{model}}/h = 64$ with $h = 8$ heads); the split is by projection, not by slicing the vector into chunks. Each head gets its own learned projections, so each can learn its own notion of relevance. One head can track the previous word; another can latch onto the subject of a verb several words away. Pick a query and see four heads attend four different ways, with the single-head average at the bottom:

Figure 4 · multi-head attention

query

Each head attends in its own subspace, so different heads lock onto different relations at once: here h1 the previous token, h2 the next, h3 the nouns, h4 a broad neighborhood. A lone head, with only one distribution to spend, blurs them into the single average in the last row. The roles shown are illustrative; in a trained model not every head is this tidy.

The eight heads run independently, then their outputs are concatenated back into a single $512$ -dimensional vector and passed through one more learned projection $W^O$ that mixes them. Because each head works in $1/8$ the dimension, the total cost is about the same as one full-size attention. In math, with $\mathrm{head}_i$ being attention computed on the $i$ -th set of projections:

\mathrm{MultiHead}(Q,K,V) = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h)\,W^O

\mathrm{head}_i = \mathrm{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

Here $W_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ and $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ are the per-head projections, and $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$ is the final mix. In code the structure is exactly as it sounds:

# multi-head attention: h heads run in parallel
def mha(X, h):                          # X: [n, d_model]
    heads = []
    for i in range(h):                  # each head, its own projections
        Q = X @ Wq[i]; K = X @ Wk[i]; V = X @ Wv[i]   # [n, d_k]
        heads.append(attention(Q, K, V))             # [n, d_v]
    return concat(heads, axis=-1) @ Wo  # back to [n, d_model]

The paper checked that the heads matter: a single-head model scored about $0.9$ points lower on a held-out English-to-German tuning set (the development set), measured by BLEU, a standard translation-quality score. The interpretation needs care, though. The clean "this head does syntax" stories hold only sometimes. Later work showed that many heads can be pruned after training with little loss, so read the roles in the figure as an illustration of what heads sometimes do.

Putting the order back

Throwing out recurrence cost us something we have not paid for yet: word order. The attention equation makes this clear. If you shuffle the words of the sentence, you shuffle the rows of $Q$ , $K$ , and $V$ , and the output is shuffled the same way but otherwise identical. Attention treats the input as a bag of words. To it, "dog bites man" and "man bites dog" are the same set of vectors, which is clearly wrong.

Each token has to be stamped with its position before the first layer. The paper adds a positional encoding, a vector the same size as the embedding, to each token's embedding. Instead of learning that vector, they use a fixed pattern of sines and cosines at a range of frequencies:

PE_{(pos,\,2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \qquad PE_{(pos,\,2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Unpacking it: $pos$ is the position in the sentence and $i$ indexes the dimension. Even dimensions get a sine, odd dimensions the matching cosine. The frequency falls off as you move through the dimensions, so the leftmost dimensions oscillate quickly as you step through positions and the rightmost ones barely move. The wavelengths run in a geometric progression from $2\pi$ up to about $10000 \cdot 2\pi$ . It is like a row of clocks ticking at wildly different speeds: reading all their hands at once gives a code that is unique to each position. The heatmap shows it; drag the position to read off one row's signature:

Figure 5 · positional encoding

positionpos = 10

Each position gets a code of sines and cosines across the

d_{\text{model}}

dimensions. Wavelengths run geometrically from

2\pi

(fast, left columns) to

10000 \cdot 2\pi

(slow, right). Teal marks positive values, amber negative, so each position row is a distinct striped fingerprint.

Why sinusoids rather than learning a position vector? The authors give two reasons. The practical one is extrapolation: a fixed formula is defined at every position, so the model is at least defined at positions longer than any it saw in training (whether it translates them well is another matter). The elegant one is about relative position. For a fixed offset $k$ , the encoding of position $pos+k$ is a fixed linear function of the encoding of position $pos$ . Each sine-cosine pair is a point on a circle, and stepping forward by $k$ rotates it by a fixed angle, the same rotation regardless of where you started. That makes "three words back" a consistent linear operation, which is exactly the kind of relationship attention can learn to use. (They also tried learned position embeddings and got nearly identical results. The sinusoids are a reasonable default; they did not clearly beat the learned version.)

The rest of the block

Attention gets most of the discussion, but a Transformer layer has a second half and some connective tissue that does real work. The second half is a feed-forward network applied to each position on its own, identically. After attention has let a word gather information from the rest of the sentence, the feed-forward network then transforms that word's vector in isolation. It is two linear layers with a $\mathrm{ReLU}$ (the function $\max(0, x)$ , which zeroes out negatives) in between:

\mathrm{FFN}(x) = \max(0,\; xW_1 + b_1)\,W_2 + b_2

(2)

It expands from $d_{\text{model}} = 512$ up to an inner dimension of $d_{ff} = 2048$ and back down, so it is wide in the middle. Inside each layer the feed-forward network holds about twice as many parameters as the attention, so a lot of the raw capacity sits here, not in the attention everyone talks about.

Now the connective tissue, which is much of why the network trains. Around each of the two sublayers (the attention and the feed-forward) the Transformer wraps a residual connection: it adds the sublayer's input back to its output. So a sublayer never has to produce the full answer from scratch; it only has to produce a correction to what it received. That single trick, borrowed from ResNets (residual networks, the image-classification architecture), is most of why very deep networks are trainable, because it gives the gradient a clean path straight back through the stack. Then each sum is run through layer normalization, which rescales the vector to a stable mean and variance so the numbers flowing through the network do not blow up or vanish. In the paper's notation each sublayer computes $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$ .

# one encoder layer (post-LN, the 2017 convention)
a = LayerNorm(x + MultiHeadSelfAttention(x))   # sublayer 1
z = LayerNorm(a + FeedForward(a))              # sublayer 2
# modern code is usually pre-LN: x + Sublayer(LayerNorm(x))

This ordering is the one place where the original paper and modern practice disagree. The 2017 design puts the normalization after the residual add, outside the loop. This is called post-LN, and it has a quirk: it is unstable to train unless you ramp the learning rate up slowly at the start, the "warmup" the paper uses for its first $4000$ steps. Almost every Transformer since (GPT-2 onward) moves the normalization inside the residual branch instead, computing $x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))$ . This is pre-LN, and it trains stably without the warmup. If you read the original equations and then read a modern codebase and they look subtly different here, this is why.

Encoder, decoder, and the mask

Now assemble it. The translation model has two stacks, each six layers tall. The encoder reads the source sentence. Each of its layers is exactly the block above: multi-head self-attention, then a feed-forward network, each wrapped in a residual and a norm. After six layers, the encoder has turned the English sentence into a stack of context-rich vectors, one per source word.

The decoder writes the translation, one word at a time, and the decoder uses attention in two different ways inside each layer. Counting the encoder, the paper uses attention in three distinct roles:

Encoder self-attention. The source sentence attends to itself. Queries, keys, and values all come from the previous encoder layer. Every word can see every other word.
Decoder self-attention (masked). The translation-so-far attends to itself, but each position is barred from looking at words that come after it.
Encoder-decoder attention (cross-attention). Here the queries come from the decoder and the keys and values come from the encoder's output. This is how each word being generated reaches back and reads the source sentence. Same equation (1), different sources for the three inputs.

Decoder self-attention needs one more restriction, a one-line change with a big purpose. When the model generates the translation, it produces word $i$ before it has produced word $i+1$ . So while it is deciding word $i$ , it must not be allowed to look at words $i+1$ and beyond, because at inference time those do not exist yet. During training, though, the entire target sentence is present at once, which allows parallel training, so we have to actively forbid each position from peeking ahead. A model that learned to lean on those future tokens would be useless at generation time, when they simply do not exist. The mask makes training play by inference's rules, so the model practices exactly the prediction problem it will face. That is the causal mask: before the softmax, every score that would let position $i$ attend to a later position is set to $-\infty$ , which softmax turns into a weight of exactly zero. The attention grid becomes a lower triangle. (The mask alone is not the full picture: the target is also fed in shifted one position to the right, so the model is always asked to predict the next token from the tokens strictly before it.) Toggle the mask on and off below: off is the encoder seeing everything; on is the decoder allowed to look only backward.

Figure 6 · the causal mask

i = 5

Attention as an

n \times n

grid: row

i

is the query, column

j

the key. The encoder leaves the whole grid live. The decoder sets every cell above the diagonal to

-\infty

before the softmax, so each position attends only to itself and the past. Step the position to watch the allowed region grow.

With the mask in place, the decoder can be trained on an entire target sentence in one parallel pass, while still behaving as if it were generating left to right. At the top of the decoder, a final linear layer and a softmax turn each output vector into a probability distribution over the vocabulary, the model's guess for the next word. The same weight matrix is reused in three places (the source embedding, the decoder's target embedding, and the final projection that turns vectors back into word scores). Tying them saves parameters and tends to help, because mapping a word to a vector and scoring a vector against words are mirror-image problems.

Why it won

The motivation, the cost of reading left to right, was a claim about complexity, and the paper makes it precise. For a sequence of length $n$ with representation dimension $d$ , here is how a self-attention layer compares with the recurrent and convolutional layers it replaced:

Layer type	Complexity / layer	Sequential ops	Max path length
Self-attention	O(n²·d)	O(1)	O(1)
Recurrent	O(n·d²)	O(n)	O(n)
Convolutional	O(k·n·d²)	O(1)	O(logₖ n)
Self-attention (restricted)	O(r·n·d)	O(1)	O(n/r)

The two right-hand columns carry the comparison that motivated the design. Self-attention does all its work in a constant number of sequential steps and connects any two positions in one hop, while recurrence needs $n$ of each. It does pay for this in the first column: its cost grows with $n^2$ because it compares every pair of words, whereas recurrence grows with $n$ . The authors point out that the trade is advantageous for self-attention whenever the sentence length $n$ is smaller than the representation dimension $d$ , which is the usual case for translation. (The $n^2$ term is also exactly what later work on long-context models spends its effort trying to reduce.)

The translation numbers backed it up. On the WMT 2014 English-to-German task (the standard machine-translation benchmark from the annual Workshop on Machine Translation) the big Transformer reached $28.4$ BLEU (the standard translation-quality score, higher is better), beating every previously published model, including ensembles, by more than $2.0$ BLEU. On English-to-French it reached $41.8$ BLEU, a new single-model state of the art. The base model alone scored $27.3$ and $38.1$ . And it was cheap by the standards of the day: the big model trained in $3.5$ days on eight P100 GPUs, a fraction of the cost of the recurrent and convolutional systems it beat, because every position trains in parallel instead of in sequence.

What the paper could not have stated is the part we now know. The Transformer proved more general than a translation architecture. Drop the encoder and keep the masked decoder and you get the family GPT belongs to. Drop the decoder and keep the encoder and you get BERT. Feed it image patches instead of words and you get a vision Transformer. The same backbone keeps working for the same reason it won here: it is a general way to let a set of elements exchange information in parallel, with the path between any two held constant. The title was a slight overstatement (you still need the embeddings, the feed-forward layers, the positions, and the residuals) but as a bet on which mechanism mattered most, it has held up unusually well.

Provenance Verified against primary literature

Vaswani et al. (2017)The Transformer: scaled dot-product and multi-head attention, the encoder-decoder, sinusoidal positions.

Bahdanau et al. (2015)Additive attention, the alignment idea the dot-product version streamlines.

He et al. (2015)Residual connections, the +x wrapped around every sublayer.

Ba, Kiros, Hinton (2016)Layer normalization, applied after each residual add (post-LN).

Xiong et al. (2020)Pre-LN: moving the norm inside the residual removes the need for warmup. The modern default.

correctionThe paper's abstract and Table 2 report 41.8 BLEU for the big model on English-to-French, while the body text in Section 6.1 says 41.0. We use the table value, 41.8, and note the discrepancy.

Questions you might still have

If attention has no sense of order, why not feed in the position as a single number?
A lone scalar would be one dimension competing with 512 others, and it would not compose. The sinusoids spread position across all dimensions, and because a fixed offset is a fixed rotation of each sine-cosine pair, relative position becomes a clean linear relationship the model can attend by.

Why divide by √dₖ exactly, not by dₖ?
The spread of a dot-product score is set by its standard deviation, √dₖ. The variance, dₖ, is the wrong scale: dividing by it would over-shrink the scores toward zero and make every key look equally relevant.

Is the original Transformer the same thing as GPT?
No. The 2017 paper is a full encoder-decoder for translation. GPT is decoder-only: it keeps the masked self-attention stack and drops the encoder and the cross-attention. BERT is the opposite, encoder-only. Same building blocks, different subsets.

Do attention heads really specialize the way the figure shows?
Some do, learning roughly positional or syntactic patterns, but only as a tendency. Later work pruned many heads after training with little loss in quality, so the per-head roles are best read as illustration.

Footnotes & further reading

The paper: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Attention Is All You Need (NeurIPS 2017). For a line-by-line implementation, Harvard NLP's The Annotated Transformer.
The attention idea this builds on: Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (additive attention), and Luong, Pham, Manning, Effective Approaches to Attention-based NMT (multiplicative).
Residual connections: He, Zhang, Ren, Sun, Deep Residual Learning for Image Recognition. Layer normalization: Ba, Kiros, Hinton, Layer Normalization.
Post-LN vs pre-LN: Xiong et al., On Layer Normalization in the Transformer Architecture, which explains why the original ordering needs learning-rate warmup and the rearranged one does not.
Weight tying between the embeddings and the output projection: Press & Wolf, Using the Output Embedding to Improve Language Models.
On how interpretable heads really are: Michel, Levy, Neubig, Are Sixteen Heads Really Better than One?, and Voita et al., Analyzing Multi-Head Self-Attention.
What the architecture became: Devlin et al., BERT (encoder-only); Brown et al., GPT-3 (decoder-only); Dosovitskiy et al., An Image is Worth 16x16 Words (vision).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.