Architecture · NLP

Neural Machine Translation by Jointly Learning to Align and Translate

The attention mechanism, born in a translator.

A decoder that reads from one fixed summary of the source forgets the start of a long sentence by the time it reaches the end. This paper lets it look back at every source word, weighted, for each word it writes. That weighted look is attention.

Explaining the paperNeural Machine Translation by Jointly Learning to Align and TranslateBahdanau, Cho, Bengio · Jacobs University / Université de Montréal · ICLR 2015 · arXiv:1409.0473 ↗

The model is never told which source word aligns with which translated word. It learns that on its own, as a by-product of learning to translate.

In 2014 the new way to translate with a neural network was to read the entire source sentence into a single vector and then write the translation out of that vector. One network, the encoder, consumed the English word by word and ended with a fixed list of numbers meant to hold the entire meaning. A second network, the decoder, generated the French from that list alone. Train the pair end to end on millions of sentence pairs and it worked well enough to rival systems built from a dozen hand-tuned parts. The whole sentence, however long, had to fit in that one fixed list.

This paper broke that constraint, and the attention mechanism that now runs every large language model first appears in it. Bahdanau et al. stop forcing the source into one vector. They keep a separate vector for every source word, and let the decoder decide, for each word it emits, which of those vectors to read and how much. The authors call it learning to align (which source words matter now) and translate (what to write) at the same time. The sections that follow cover the encoder–decoder it extends, the per-word context vector that replaces the fixed one, the softmax that decides the weights, the bidirectional encoder that supplies the per-word vectors, and why the model still trains with plain gradient descent.

One vector, however long the sentence

The paper sets out to fix the encoder–decoder setup, where the encoder reads the source and produces one fixed-length vector, written $c$ , and the decoder generates the entire translation conditioned on that one $c$ . The vector has the same size whether the source is four words or forty. For a short sentence that is plenty of room. For a long one it is too small: everything the decoder will ever need has to survive compression into a list whose length never grows.

The authors are careful about how strongly they claim this hurts. They conjecture that the fixed-length vector is a bottleneck, and they point to a companion study (Cho et al., 2014b) that measured the symptom: a basic encoder–decoder translates short sentences well and then degrades sharply as the source gets longer. So the bottleneck is a hypothesis motivated by evidence, not a proven theorem, and the rest of the paper is the experiment that confirms it. Drag the length below and switch between the two designs. In the fixed-vector design the box never widens, so each word's share of it falls as $1/L$ ; in the soft-search design the encoder keeps one annotation per word, and that strip grows with the sentence:

Figure 1 · the fixed-vector bottleneck

L = 24

A source of L words is read by an encoder. RNNencdec squeezes all of it into one fixed vector c whose width never changes, so each word's share is

1/L

and shrinks as the sentence grows; every output word reads the same c. RNNsearch keeps one annotation per word, a strip that grows with the sentence, and the decoder reads a different slice of it per output word. Drag L to the long end and watch the fixed box crowd.

The fixed vector has a capacity that does not grow with the sentence; the annotations add capacity for every word. The rest of the paper builds the machinery for the right panel: building one vector per source word, and teaching the decoder to read the right ones.

The encoder–decoder it extends

Changing the model precisely means first naming the baseline precisely. An RNN reads the source one symbol at a time, folding each new word into a running hidden state:

h_t = f(x_t, h_{t-1}), \qquad c = q(\{h_1, \dots, h_{T_x}\})

(1)

Here $x_t$ is the $t$ -th source word, $h_t$ is the hidden state after reading it, and $f$ is the recurrent cell that updates the state. After the last word the encoder summarizes all of the states into the context $c$ through some function $q$ . In the systems this paper builds on, $q$ takes the final state: Sutskever and colleagues used an LSTM for $f$ and set $c = h_{T_x}$ , the state after the last word.

The decoder turns that vector back into a sentence. It defines the probability of a translation $\mathbf{y}$ by emitting one word at a time, each conditioned on the words so far and on the same $c$ :

p(\mathbf{y}) = \prod_{t=1}^{T} p(y_t \mid \{y_1, \dots, y_{t-1}\}, c), \qquad p(y_t \mid \{y_1, \dots, y_{t-1}\}, c) = g(y_{t-1}, s_t, c)

(2,3)

The second piece matters because the change comes down to one symbol. The next-word distribution depends on the previous word $y_{t-1}$ , the decoder's own hidden state $s_t$ , and the context $c$ . That $c$ carries no subscript: it is identical for the first word and the fortieth. The decoder is reading from a frozen summary it cannot refresh. The change the paper makes is to give $c$ a subscript, and to recompute it for every word. The remaining work is building a good $c_i$ .

A context vector for every word

The new decoder uses a distinct context vector $c_i$ for each target word $y_i$ :

p(y_i \mid y_1, \dots, y_{i-1}, \mathbf{x}) = g(y_{i-1}, s_i, c_i), \qquad s_i = f(s_{i-1}, y_{i-1}, c_i)

(4)

Same shape as before, one index richer. The decoder state $s_i$ and the output for word $i$ now see $c_i$ , a context built specifically for this step, rather than the one frozen $c$ . So the question becomes: where does $c_i$ come from? It is a weighted sum of the source representations. Suppose the encoder has turned the source into a sequence of vectors $(h_1, \dots, h_{T_x})$ , one per word, which the paper calls annotations. Each $h_j$ summarizes the entire sentence with a focus on the words around position $j$ (the encoder constructs them below; for now take them as given, one rich vector per source word). The context for target word $i$ is then:

c_i = \sum_{j=1}^{T_x} \alpha_{ij}\, h_j

(5)

The weights $\alpha_{ij}$ are nonnegative and sum to one across the source (the softmax below sets them), so $c_i$ is a convex combination of the annotations: a weighted average that lands somewhere inside the cloud of source vectors, pulled toward whichever ones carry the most weight right now. Put all the weight on one annotation and $c_i$ is essentially that source word; spread it evenly and $c_i$ is a blur of the entire sentence. The paper calls $c_i$ an expected annotation, the average annotation under the weight distribution $\alpha_{i\cdot}$ . That expectation is over the model's own learned weights, a deterministic mean it computes, not a Bayesian posterior over some true hidden alignment.

Equation (5) gives the decoder a fresh summary instead of one fixed vector. At every step it gets to mix a new one, drawn from the parts of the source that matter for the word it is about to write. The fixed vector is gone; the remaining choice is how to set the mixing weights.

The weights come from a softmax over alignment scores

The decoder needs, at each step $i$ , a number for each source word saying how relevant it is right now. Call that raw score the energy $e_{ij}$ , and turn the energies into weights with a softmax:

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \qquad e_{ij} = a(s_{i-1}, h_j)

(6)

The softmax does two jobs at once. It makes every weight positive and forces the row to sum to one, so the weights are a genuine distribution over source positions and the context (5) is a proper average. And it is smooth: a small change in an energy nudges all the weights a little, which lets gradients flow back into whatever produced the energies. The energy itself is an alignment model $a$ , a small network that scores how well the source around position $j$ matches what the decoder is about to do. It reads two things: the decoder's hidden state $s_{i-1}$ , the state just before emitting $y_i$ , and the annotation $h_j$ .

The score is computed against a specific state. The query is $s_{i-1}$ , the previous decoder state, available before $y_i$ exists, so the model can choose where to look and only then commit to a word. (A later variant, Luong's, scores against the current state instead; the Transformer drops the recurrence and scores a query against keys with a scaled dot product.) The alignment model is parametrized as a one-hidden-layer network with a $\tanh$ :

a(s_{i-1}, h_j) = v_a^\top \tanh\!\big(W_a\, s_{i-1} + U_a\, h_j\big)

(7)

This is additive attention: combine the query $s_{i-1}$ and the annotation $h_j$ inside a $\tanh$ , then project to a single number with $v_a$ . The same annotation $h_j$ plays two roles, the thing being scored (the key) and the thing being averaged in (the value), with no separate projections for the two. A practical detail makes this affordable: $U_a h_j$ does not depend on the target step $i$ , so it is computed once per source word and reused for every output word. The alignment model and the rest of the translator are trained together, from the same loss; alignment is not a separate pre-processing step or a fixed table.

Below is one decoder step under your control. Step through the target sentence and watch the energies become a row of weights that sums to one, then watch those weights gather the annotations into the context. The distribution slides along the source as the decoder advances, mostly tracking the matching word:

Figure 2 · one attention step

word 1/7

For each target word, the source annotations h₁…h₆ are scored against the decoder state, softmaxed into weights α that sum to 1 (the bars), and pooled into the context c, drawn as a stacked bar whose segments are each source word's share. Press Play or scrub: the attention slides along the source as the decoder writes the translation.

With (5), (6) and (7) the decoder has what it needs: a relevance score per source word, a normalized set of weights, and a context that is their weighted blend, all recomputed every step and all differentiable. The annotations $h_j$ still have to be defined; the encoder produces them below.

Annotations that read from both directions

Each annotation $h_j$ should describe its own source word together with the context it sits in, and context runs both ways: the words before $x_j$ and the words after. A plain forward RNN only sees the words before, so the paper reads the source with a bidirectional RNN. A forward RNN reads left to right and produces states $\overrightarrow{h}_1, \dots, \overrightarrow{h}_{T_x}$ ; a backward RNN reads right to left and produces $\overleftarrow{h}_1, \dots, \overleftarrow{h}_{T_x}$ . The annotation for word $j$ is the two stuck together:

h_j = \big[\, \overrightarrow{h}_j^\top \,;\; \overleftarrow{h}_j^\top \,\big]^\top

(9)

The forward half has read everything up to and including word $j$ ; the backward half has read everything from the end back to word $j$ . Together they summarize the entire sentence. And because an RNN leans on its most recent input, each half is sharpest near $j$ , so the concatenated $h_j$ stays focused on the words around position $j$ while still carrying the global picture. That local focus is exactly what makes the alignment weights interpretable later: when the decoder puts weight on $h_j$ , it is mostly looking at the region around the $j$ -th source word. Concatenation makes $h_j$ twice as wide as a single direction's state, which is why the alignment model's $U_a$ and the decoder's context weights act on a $2n$ -wide vector. Hover a word to see its annotation assembled from the two passes:

Figure 3 · bidirectional annotations

hover a word to build its annotation

A forward RNN reads left to right, a backward RNN right to left. The annotation for a word is the two states concatenated,

h_j = [\overrightarrow{h}_j; \overleftarrow{h}_j]

. Hover a word: the words to its left glow as its forward context and the words to its right as its backward context, brightest at the word itself, so the annotation sees the whole sentence yet stays focused on its own position.

The decoder's first state $s_0$ is initialized from the first backward annotation, $s_0 = \tanh(W_s \overleftarrow{h}_1)$ . The backward RNN reaches position 1 last, after sweeping the entire sentence right to left, so $\overleftarrow{h}_1$ is the state that has just seen all of the source. The decoder, in other words, begins from a vector that has already seen the entire input, and then refines its view word by word through attention.

Why soft alignment trains at all

What makes the alignment learnable at all is the word soft. Older alignment ideas in translation treated the alignment as a hidden choice: word $i$ comes from source word $j$ , a discrete decision you would have to search over or sample. A discrete choice is a step function, and you cannot push a gradient through a step, so you are forced into sampling and high-variance estimators (this is what later hard attention does, training the picked location with a REINFORCE-style estimator: the policy-gradient trick that learns a discrete choice from samples of it, paying for the non-differentiability with high gradient variance). This paper makes the alignment soft instead. The softmax (6) produces a smooth distribution, the context (5) is a differentiable average, and the alignment is explicitly not a latent variable. So the gradient of the translation loss flows straight back through the weights into the alignment model and the encoder. Alignment and translation are trained by the same backprop, on the same objective, with nothing special bolted on.

That single decision, soft rather than hard, is doing several jobs. It removes the need for a separate aligner or a fixed alignment table. It lets the model hedge, splitting weight across two or three source words when it is unsure, instead of betting everything on one. And it keeps the entire system one differentiable function from source to loss, which is the property that let a single network match systems assembled from many tuned parts. It does run up a bill: the alignment model is evaluated once for every (source word, target word) pair, so the work scales with the product $T_x \times T_y$ of the two sentence lengths. The paper judges that acceptable for sentences of 15 to 40 words and flags it as a possible limit elsewhere. It is also the first appearance of the quadratic cost that attention research has been fighting ever since.

The recurrent cell $f$ is the gated hidden unit from Cho et al. (the unit later called the GRU), not the LSTM that Sutskever used. It keeps a running state with two gates: an update gate $z_i$ that decides how much of the old state to carry forward unchanged, and a reset gate $r_i$ that decides how much of the old state to ignore when proposing a new one. Carrying the old state through nearly untouched lets gradients survive over many steps, the same long-memory trick the LSTM buys with more machinery. The output layer is a single maxout hidden layer (it takes the max over pairs of pre-activations, a cheap way to learn a flexible nonlinearity) feeding a softmax over the target vocabulary. In code, one decoder step is the five lines we have been describing:

# one decoder step i: emit target word y_i (bias terms omitted)
# h[j]: source annotations (2n-dim), j = 1..Tx ; s_prev: state s_{i-1}
e   = [v_a @ tanh(W_a @ s_prev + U_a @ h[j]) for j in range(Tx)]  # (7)
a   = softmax(e)                       # weights alpha_ij, sum to 1   (6)
c   = sum(a[j] * h[j] for j in range(Tx))   # context c_i = blend     (5)
s   = gru(s_prev, embed(y_prev), c)    # new decoder state s_i
y_i = softmax(maxout(s, embed(y_prev), c))  # next-word distribution  (4)

None of these pieces is the contribution; they are the solid 2014 parts the contribution is built from. The three lines in the middle (score, normalize, sum) carry the paper's change. Swap the fixed $c$ for that per-step context and the translator preserves the start of long sentences.

What the alignments show

The test is WMT'14 English-to-French, trained on 348M words with a 30,000-word vocabulary per language. Two models, matched in size at 1000 hidden units: the fixed-vector baseline (RNNencdec) and the soft-search model (RNNsearch), each trained once on sentences up to 30 words and once up to 50. Quality is BLEU, a modified n-gram precision against reference translations, with a penalty for being too short; higher is better.

On the full test set RNNsearch-50 scores 26.75 BLEU against RNNencdec-50's 17.82, a gap of nearly nine points, roughly the difference between a translation you can read and one you have to decode. The robustness shows up most clearly against sentence length: drag the marker below and the curves match what the bottleneck conjecture would predict. The fixed-vector model climbs for short sentences and then falls off as the source outgrows its one vector; the soft-search model trained to length 50 stays essentially flat past 50 words, while the one trained only to 30 holds until its training length and then declines. The gap is near zero for short sentences (where one vector is enough) and widens past ten BLEU for long ones (where it is not):

Figure 4 · quality vs sentence length

drag across the plot to read the gap at any length

BLEU against source length, following the trend of the paper's Figure 2. RNNsearch-50 stays flat even past 50 words; RNNencdec-50 (dashed) climbs then collapses as the sentence outgrows its fixed vector; RNNsearch-30 holds to its training length then dips. Drag the marker to read the gap at any length. Plateaus are anchored to the Table 1 scores.

Two smaller numbers make the point sharper. RNNsearch-30, trained only on short sentences, still beats RNNencdec-50 overall (21.50 versus 17.82): the better architecture trained on less wins. And trained longer, RNNsearch-50's best run reaches 28.45 on all sentences, 36.15 once sentences with unknown words are excluded. That second figure edges the conventional phrase-based system Moses, at 35.63, which had years of engineering and an extra 418M words of monolingual text behind it. Be careful with that comparison, though: on the full test set, unknown words and all, Moses still wins 33.30 to 28.45. The accurate claim, and the one the paper makes, is that a single neural network drew level with phrase-based translation on known-word sentences, not that it beat it outright.

The clearest evidence is visual rather than numeric. Because the weights $\alpha_{ij}$ are a distribution over source positions, you can lay them out as a heatmap and read off, for each target word, which source words it attended to. The alignments come out mostly monotonic, a bright diagonal, which is what you expect between two fairly similar languages. The interesting cells are off the diagonal, where French reorders the words. Hover the rows below: when the model writes "zone économique européenne" for "European Economic Area," it jumps to "Area" first and then walks the adjectives back, producing the anti-diagonal block instead of a straight line:

Figure 5 · the alignment heatmap

hover a French word to trace its alignment

Attention weights laid out as a grid: columns are the English source, rows are the French target, brighter cells carry more weight. The main diagonal is the word-for-word alignment; over "European Economic Area" the bright cells form an anti-diagonal because French reverses the order. Hover a French word to trace which source word it reads from. This is the kind of soft alignment the paper shows in its Figure 3.

Nobody told the model that "européenne" aligns with "European." It learned the alignment as a by-product of learning to translate, because reading the right source word lets it predict the right target word. That the learned weights agree with how a person would align the sentences, reordering and all, is the evidence that the mechanism is doing what its name claims, attending to the relevant source.

The fix traces back to one substitution with consequences. A fixed-length context vector loses accuracy on long sentences; replace it with a new context per output word, built as a softmax-weighted sum of one annotation per source word, and keep the weighting soft so the entire system still trains by ordinary backprop. That bought a translator that retained the start of long sentences and drew level with a decade of phrase-based engineering. The mechanism it introduced, a learned soft search over a set of vectors, proved far more general than translation. Three years later it would be the only thing a Transformer is made of, and the encoder and decoder RNNs around it would be thrown away.

Provenance Verified against primary literature

Sutskever et al. (2014)The seq2seq baseline: an LSTM encoder folds the source into one fixed vector (q = h_T), reading the input reversed.

Cho et al. (2014a)The RNN Encoder–Decoder and the "gated hidden unit" (the GRU) this paper uses for f.

Cho et al. (2014b)The empirical finding that a fixed-vector encoder–decoder degrades as the source grows. It motivates the bottleneck conjecture.

Schuster & Paliwal (1997)The bidirectional RNN that produces the per-word annotations.

Luong et al. (2015)The multiplicative attention that scores the CURRENT decoder state, a later contrast to this additive scorer.

Vaswani et al. (2017)The Transformer’s scaled dot-product over separate Q, K, V, which drops the RNN this paper still relies on.

correctionThe attention here is ADDITIVE: it scores against the PREVIOUS decoder state s₁₋₁ with a one-layer tanh MLP, and each annotation is both the key and the value. That is not the Transformer’s scaled dot-product over separate Q/K/V (2017), nor Luong’s multiplicative score on the current state (2015). The recurrent cell is the GRU ("gated hidden unit"), not the LSTM Sutskever used. The paper says "attention" only in a single passing aside (three times in one sentence); its own working terms are "alignment model" and "soft-search," and the model is "RNNsearch." We never say it beats Moses overall: it draws level only on known-word sentences.

Questions you might still have

Is this the same attention as in the Transformer?
The idea is the same (a softmax-weighted average of source vectors), but the parts differ. Here the score is a one-layer tanh MLP of the previous decoder state and each annotation, and the same annotation is both key and value. The Transformer (2017) scores with a scaled dot-product over separately projected Q, K, V and removes the RNN entirely. This is the ancestor; that is the descendant.

If the weights are learned, what stops them from drifting away from real alignments?
Nothing forces them. The alignment is never supervised and is not a latent variable; it falls out of maximizing translation likelihood. The weights end up matching linguistic alignment because attending to the right source word helps predict the next target word. The heatmaps confirm it after the fact.

Why a weighted average instead of just picking the single best source word?
A hard pick is a step function, so you cannot backpropagate through it; you would need to sample a position and use a higher-variance estimator (REINFORCE), which is exactly what hard attention does. The soft average is differentiable, trains with ordinary backprop, and can hedge across several words when it is unsure.

Does attention remove the fixed-length-vector limit at no cost?
It removes the single bottleneck, but it adds cost: the alignment model is evaluated once for every (source word, target word) pair, so the work grows with the product of the two lengths. For 15-to-40-word sentences that is cheap; at thousands of tokens it is the quadratic-attention cost that later work spends years fighting.

Did this paper invent the word "attention"?
It uses "attention" in just one passing aside (three times in a single sentence), as an intuition for what the decoder is doing. Its working names are "alignment model" and "(soft-)search," and the model is called RNNsearch. The name "attention" stuck afterward, and "Bahdanau attention" is the retrospective label for this additive form.

Footnotes & further reading

The paper: Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (ICLR 2015). Code (GroundHog).
The seq2seq baseline with a single fixed vector: Sutskever, Vinyals, Le, Sequence to Sequence Learning with Neural Networks.
The RNN Encoder–Decoder and the gated hidden unit (GRU): Cho et al., Learning Phrase Representations using RNN Encoder–Decoder, and the length-degradation study that motivates the bottleneck, Cho et al., On the Properties of Neural Machine Translation.
The bidirectional RNN: Schuster & Paliwal, Bidirectional Recurrent Neural Networks (1997).
Later cousins of this attention: Luong et al., Effective Approaches to Attention-based NMT (multiplicative, current state), Xu et al., Show, Attend and Tell (hard vs soft attention), and Vaswani et al., Attention Is All You Need (the Transformer, which keeps only the attention).
BLEU: Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation (2002).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.