VerifiedarXiv:1409.047317 min
Architecture · NLP

Neural Machine Translation by Jointly Learning to Align and Translate

The attention mechanism, born in a translator.

A decoder that reads from one fixed summary of the source forgets the start of a long sentence by the time it reaches the end. This paper lets it look back at every source word, weighted, for each word it writes. That weighted look is attention.

Explaining the paperNeural Machine Translation by Jointly Learning to Align and TranslateBahdanau, Cho, Bengio · Jacobs University / Université de Montréal · ICLR 2015 · arXiv:1409.0473

Instead of cramming a whole sentence into one vector, let the model look back at the source for every word it writes. That look is where attention starts.

In 2014 the new way to translate with a neural network was to read the whole source sentence into a single vector and then write the translation out of that vector. One network, the encoder, consumed the English word by word and ended with a fixed list of numbers meant to hold the entire meaning. A second network, the decoder, generated the French from that list alone. Train the pair end to end on millions of sentence pairs and it worked well enough to rival systems built from a dozen hand-tuned parts. The whole sentence, however long, had to fit in that one fixed list.

This paper is where that constraint broke, and it is where the attention mechanism that now runs every large language model first appears. Bahdanau et al. stop forcing the source into one vector. They keep a separate vector for every source word, and let the decoder decide, fresh for each word it emits, which of those vectors to read and how much. The authors call it learning to align (which source words matter now) and translate (what to write) at the same time. To see why it works, a few ideas carry it: the encoder–decoder it extends, the per-word context vector that replaces the fixed one, the softmax that decides the weights, the bidirectional encoder that supplies the per-word vectors, and why the whole thing still trains with plain gradient descent.

One vector, however long the sentence

Start with the thing the paper set out to fix. In the encoder–decoder setup, the encoder reads the source and produces one fixed-length vector, written cc, and the decoder generates the entire translation conditioned on that one cc. The vector has the same size whether the source is four words or forty. For a short sentence that is plenty of room. For a long one it is a funnel: everything the decoder will ever need about the source has to survive being compressed into a list of numbers whose length never grows.

The authors are careful about how strongly they claim this hurts. They conjecture that the fixed-length vector is a bottleneck, and they point to a companion study (Cho et al., 2014b) that measured the symptom: a basic encoder–decoder translates short sentences well and then degrades sharply as the source gets longer. So the bottleneck is a hypothesis motivated by evidence, not a proven theorem, and the rest of the paper is the experiment that confirms it. Drag the length below and switch between the two designs. In the fixed-vector design the box never widens, so each word's share of it falls as 1/L1/L; in the soft-search design the encoder keeps one annotation per word, and that strip grows with the sentence:

Figure 1 · the fixed-vector bottleneck
L = 24
A source of L words is read by an encoder. RNNencdec squeezes all of it into one fixed vector c whose width never changes, so each word's share is 1/L1/L and shrinks as the sentence grows; every output word reads the same c. RNNsearch keeps one annotation per word, a strip that grows with the sentence, and the decoder blends a fresh slice of it per output word. Drag L to the long end and watch the fixed box crowd.

That contrast is the paper in one picture. The fixed vector is a budget that does not grow; the annotations are a budget per word that does. Everything that follows is the machinery for the right panel: building one vector per source word, and teaching the decoder to read the right ones.

The encoder–decoder it extends

To change the model precisely, name the baseline precisely. An RNN reads the source one symbol at a time, folding each new word into a running hidden state:

ht=f(xt,ht1),c=q({h1,,hTx})h_t = f(x_t, h_{t-1}), \qquad c = q(\{h_1, \dots, h_{T_x}\})
(1)

Here xtx_t is the tt-th source word, hth_t is the hidden state after reading it, and ff is the recurrent cell that updates the state. After the last word the encoder summarizes the whole run of states into the context cc through some function qq. In the systems this paper builds on, qq just takes the final state: Sutskever and colleagues used an LSTM for ff and set c=hTxc = h_{T_x}, the state after the last word. One vector, end of source.

The decoder turns that vector back into a sentence. It defines the probability of a translation y\mathbf{y} by emitting one word at a time, each conditioned on the words so far and on the same cc:

p(y)=t=1Tp(yt{y1,,yt1},c),p(yt{y1,,yt1},c)=g(yt1,st,c)p(\mathbf{y}) = \prod_{t=1}^{T} p(y_t \mid \{y_1, \dots, y_{t-1}\}, c), \qquad p(y_t \mid \{y_1, \dots, y_{t-1}\}, c) = g(y_{t-1}, s_t, c)
(2,3)

Read the second piece slowly, because the fix lives in one symbol. The next-word distribution depends on the previous word yt1y_{t-1}, the decoder's own hidden state sts_t, and the context cc. That cc carries no subscript: it is identical for the first word and the fortieth. The decoder is reading from a frozen summary it cannot refresh. The change the paper makes is to give cc a subscript, and to recompute it for every word. Once you see that the design turns on a missing index, the rest is working out how to build a good cic_i.

A context vector for every word

The new decoder uses a distinct context vector cic_i for each target word yiy_i:

p(yiy1,,yi1,x)=g(yi1,si,ci),si=f(si1,yi1,ci)p(y_i \mid y_1, \dots, y_{i-1}, \mathbf{x}) = g(y_{i-1}, s_i, c_i), \qquad s_i = f(s_{i-1}, y_{i-1}, c_i)
(4)

Same shape as before, one index richer. The decoder state sis_i and the output for word ii now see cic_i, a context built specifically for this step, rather than the one frozen cc. So the question becomes: where does cic_i come from? It is a weighted sum of the source representations. Suppose the encoder has turned the source into a sequence of vectors (h1,,hTx)(h_1, \dots, h_{T_x}), one per word, which the paper calls annotations. Each hjh_j summarizes the whole sentence with a focus on the words around position jj (the next section builds them; for now take them as given, one rich vector per source word). The context for target word ii is then:

ci=j=1Txαijhjc_i = \sum_{j=1}^{T_x} \alpha_{ij}\, h_j
(5)

The weights αij\alpha_{ij} are nonnegative and sum to one across the source (the next section is how), so cic_i is a convex combination of the annotations: a weighted average that lands somewhere inside the cloud of source vectors, pulled toward whichever ones carry the most weight right now. Put all the weight on one annotation and cic_i is essentially that source word; spread it evenly and cic_i is a blur of the whole sentence. The paper calls cic_i an expected annotation, the average annotation under the weight distribution αi\alpha_{i\cdot}. Worth keeping straight: that expectation is over the model's own learned weights, a deterministic mean it computes, not a Bayesian posterior over some true hidden alignment.

The point of (5) is that the decoder is no longer stuck with one summary. At every step it gets to mix a new one, drawn from the parts of the source that matter for the word it is about to write. The fixed funnel is gone; what remains is to decide the mix.

The weights come from a softmax over alignment scores

The decoder needs, at each step ii, a number for each source word saying how relevant it is right now. Call that raw score the energy eije_{ij}, and turn the energies into weights with a softmax:

αij=exp(eij)k=1Txexp(eik),eij=a(si1,hj)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \qquad e_{ij} = a(s_{i-1}, h_j)
(6)

The softmax does two jobs at once. It makes every weight positive and forces the row to sum to one, so the weights are a genuine distribution over source positions and the context (5) is a proper average. And it is smooth: a small change in an energy nudges all the weights a little, which is what lets gradients flow back into whatever produced the energies. The energy itself is an alignment model aa, a small network that scores how well the source around position jj matches what the decoder is about to do. It reads two things: the decoder's hidden state si1s_{i-1}, the state just before emitting yiy_i, and the annotation hjh_j.

Notice which state it scores against. The query is si1s_{i-1}, the previous decoder state, available before yiy_i exists, so the model can choose where to look and only then commit to a word. (A later variant, Luong's, scores against the current state instead; the Transformer drops the recurrence and scores a query against keys with a scaled dot product. Same family, different wiring, and worth not conflating.) The alignment model is parametrized as a one-hidden-layer network with a tanh\tanh:

a(si1,hj)=vatanh ⁣(Wasi1+Uahj)a(s_{i-1}, h_j) = v_a^\top \tanh\!\big(W_a\, s_{i-1} + U_a\, h_j\big)
(7)

This is additive attention: combine the query si1s_{i-1} and the annotation hjh_j inside a tanh\tanh, then project to a single number with vav_a. The same annotation hjh_j plays two roles, the thing being scored (the key) and the thing being averaged in (the value), with no separate projections for the two. A practical detail makes this affordable: UahjU_a h_j does not depend on the target step ii, so it is computed once per source word and reused for every output word. The alignment model and the rest of the translator are trained together, from the same loss; alignment is not a separate pre-processing step or a fixed table.

Below is one decoder step under your control. Step through the target sentence and watch the energies become a row of weights that sums to one, then watch those weights blend the annotations into the context. The distribution slides along the source as the decoder advances, mostly tracking the matching word:

Figure 2 · one attention step
word 1/7
For each target word, the source annotations h₁…h₆ are scored against the decoder state, softmaxed into weights α that sum to 1 (the bars), and blended into the context c, drawn as a stacked bar whose segments are each source word's share. Press Play or scrub: the attention slides along the source as the decoder writes the translation.

With (5), (6) and (7) the decoder has what it needs: a relevance score per source word, a normalized set of weights, and a context that is their weighted blend, all recomputed every step and all differentiable. The encoder still owes us the annotations hjh_j.

Annotations that read from both directions

Each annotation hjh_j should describe its own source word together with the context it sits in, and context runs both ways: the words before xjx_j and the words after. A plain forward RNN only sees the words before, so the paper reads the source with a bidirectional RNN. A forward RNN reads left to right and produces states h1,,hTx\overrightarrow{h}_1, \dots, \overrightarrow{h}_{T_x}; a backward RNN reads right to left and produces h1,,hTx\overleftarrow{h}_1, \dots, \overleftarrow{h}_{T_x}. The annotation for word jj is the two stuck together:

hj=[hj;  hj]h_j = \big[\, \overrightarrow{h}_j^\top \,;\; \overleftarrow{h}_j^\top \,\big]^\top
(9)

The forward half has read everything up to and including word jj; the backward half has read everything from the end back to word jj. Together they summarize the whole sentence. And because an RNN leans on its most recent input, each half is sharpest near jj, so the concatenated hjh_j stays focused on the words around position jj while still carrying the global picture. That local focus is exactly what makes the alignment weights interpretable later: when the decoder puts weight on hjh_j, it is mostly looking at the region around the jj-th source word. One consequence of concatenation: hjh_j is twice as wide as a single direction's state, which is why the alignment model's UaU_a and the decoder's context weights act on a 2n2n-wide vector. Hover a word to see its annotation assembled from the two passes:

Figure 3 · bidirectional annotations
A forward RNN reads left to right, a backward RNN right to left. The annotation for a word is the two states concatenated, hj=[hj;hj]h_j = [\overrightarrow{h}_j; \overleftarrow{h}_j]. Hover a word: the words to its left glow as its forward context and the words to its right as its backward context, brightest at the word itself, so the annotation sees the whole sentence yet stays focused on its own position.

One small asymmetry is easy to get backwards, so catch it now. The decoder's first state s0s_0 is initialized from the first backward annotation, s0=tanh(Wsh1)s_0 = \tanh(W_s \overleftarrow{h}_1). The backward RNN reaches position 1 last, after sweeping the entire sentence right to left, so h1\overleftarrow{h}_1 is the state that has just seen all of the source. The decoder, in other words, begins from a vector that already glimpsed the whole input, and then refines its view word by word through attention.

Why soft alignment trains at all

What makes the alignment learnable at all is the word soft. Older alignment ideas in translation treated the alignment as a hidden choice: word ii comes from source word jj, a discrete decision you would have to search over or sample. A discrete choice is a step function, and you cannot push a gradient through a step, so you are forced into sampling and high-variance estimators (this is what later hard attention does, training the picked location with a REINFORCE-style estimator). This paper makes the alignment soft instead. The softmax (6) produces a smooth distribution, the context (5) is a differentiable average, and the alignment is explicitly not a latent variable. So the gradient of the translation loss flows straight back through the weights into the alignment model and the encoder. Alignment and translation are trained by the same backprop, on the same objective, with nothing special bolted on.

That single decision, soft rather than hard, is doing several jobs. It removes the need for a separate aligner or a fixed alignment table. It lets the model hedge, splitting weight across two or three source words when it is unsure, instead of betting everything on one. And it keeps the entire system one differentiable function from source to loss, which is the property that let a single network match systems assembled from many tuned parts. It does run up a bill: the alignment model is evaluated once for every (source word, target word) pair, so the work scales with the product Tx×TyT_x \times T_y of the two sentence lengths. The paper judges that acceptable for sentences of 15 to 40 words and flags it as a possible limit elsewhere. It is also the first appearance of the quadratic cost that attention research has been fighting ever since.

The recurrent cell ff is the gated hidden unit from Cho et al. (the unit later called the GRU), not the LSTM that Sutskever used. It keeps a running state with two gates: an update gate ziz_i that decides how much of the old state to carry forward unchanged, and a reset gate rir_i that decides how much of the old state to ignore when proposing a new one. Carrying the old state through nearly untouched is what lets gradients survive over many steps, the same long-memory trick the LSTM buys with more machinery. The output layer is a single maxout hidden layer (it takes the max over pairs of pre-activations, a cheap way to learn a flexible nonlinearity) feeding a softmax over the target vocabulary. In code, one decoder step is the five lines we have been describing:

# one decoder step i: emit target word y_i (bias terms omitted)
# h[j]: source annotations (2n-dim), j = 1..Tx ; s_prev: state s_{i-1}
e   = [v_a @ tanh(W_a @ s_prev + U_a @ h[j]) for j in range(Tx)]  # (7)
a   = softmax(e)                       # weights alpha_ij, sum to 1   (6)
c   = sum(a[j] * h[j] for j in range(Tx))   # context c_i = blend     (5)
s   = gru(s_prev, embed(y_prev), c)    # new decoder state s_i
y_i = softmax(maxout(s, embed(y_prev), c))  # next-word distribution  (4)

None of these pieces is the contribution; they are the solid 2014 parts the contribution is built from. The contribution is the three lines in the middle: score, normalize, blend. Swap the fixed cc for that per-step blend and the translator stops forgetting the start of long sentences.

What the alignments show

The test is WMT'14 English-to-French, trained on 348M words with a 30,000-word vocabulary per language. Two models, matched in size at 1000 hidden units: the fixed-vector baseline (RNNencdec) and the soft-search model (RNNsearch), each trained once on sentences up to 30 words and once up to 50. Quality is BLEU, a modified n-gram precision against reference translations, with a penalty for being too short; higher is better.

On the full test set RNNsearch-50 scores 26.75 BLEU against RNNencdec-50's 17.82, a gap of nearly nine points, roughly the difference between a translation you can read and one you have to decode. The robustness shows up most clearly against sentence length: drag the marker below and the curves tell the story the bottleneck predicted. The fixed-vector model climbs for short sentences and then falls off as the source outgrows its one vector; the soft-search model trained to length 50 stays essentially flat past 50 words, while the one trained only to 30 holds until its training length and then declines. The gap is near zero for short sentences (where one vector is enough) and widens past ten BLEU for long ones (where it is not):

Figure 4 · quality vs sentence length
BLEU against source length, following the trend of the paper's Figure 2. RNNsearch-50 stays flat even past 50 words; RNNencdec-50 (dashed) climbs then collapses as the sentence outgrows its fixed vector; RNNsearch-30 holds to its training length then dips. Drag the marker to read the gap at any length. Plateaus are anchored to the Table 1 scores.

Two smaller numbers make the point sharper. RNNsearch-30, trained only on short sentences, still beats RNNencdec-50 overall (21.50 versus 17.82): the better architecture trained on less wins. And trained longer, RNNsearch-50's best run reaches 28.45 on all sentences, 36.15 once sentences with unknown words are excluded. That second figure edges the conventional phrase-based system Moses, at 35.63, which had years of engineering and an extra 418M words of monolingual text behind it. Be careful with that comparison, though: on the full test set, unknown words and all, Moses still wins 33.30 to 28.45. The honest claim, and the one the paper makes, is that a single neural network drew level with phrase-based translation on known-word sentences, not that it beat it outright.

The most telling result is not a number, it is a picture. Because the weights αij\alpha_{ij} are a distribution over source positions, you can lay them out as a heatmap and read off, for each target word, which source words it attended to. The alignments come out mostly monotonic, a bright diagonal, which is what you expect between two fairly similar languages. The interesting cells are off the diagonal, where French reorders the words. Hover the rows below: when the model writes "zone économique européenne" for "European Economic Area," it jumps to "Area" first and then walks the adjectives back, producing the anti-diagonal block instead of a straight line:

Figure 5 · the alignment heatmap
Attention weights laid out as a grid: columns are the English source, rows are the French target, brighter cells carry more weight. The main diagonal is the word-for-word alignment; over "European Economic Area" the bright cells form an anti-diagonal because French reverses the order. Hover a French word to trace which source word it reads from. This is the kind of soft alignment the paper shows in its Figure 3.

Nobody told the model that "européenne" aligns with "European." It learned the alignment as a by-product of learning to translate, because reading the right source word is what lets it predict the right target word. That the learned weights agree with how a person would align the sentences, reordering and all, is the evidence that the mechanism is doing what its name claims, attending to the relevant source.

Trace the fix back and it is one substitution with consequences. A fixed-length context vector chokes on long sentences; replace it with a fresh context per output word, built as a softmax-weighted blend of one annotation per source word, and keep the weighting soft so the whole system still trains by ordinary backprop. That bought a translator that stopped forgetting the start of long sentences and drew level with a decade of phrase-based engineering. The mechanism it introduced, a learned soft search over a set of vectors, proved far more general than translation. Three years later it would be the only thing a Transformer is made of, and the encoder and decoder RNNs around it would be thrown away.

Provenance Verified against primary literature
Sutskever et al. (2014)The seq2seq baseline: an LSTM encoder folds the source into one fixed vector (q = h_T), reading the input reversed.
Cho et al. (2014a)The RNN Encoder–Decoder and the "gated hidden unit" (the GRU) this paper uses for f.
Cho et al. (2014b)The empirical finding that a fixed-vector encoder–decoder degrades as the source grows. It motivates the bottleneck conjecture.
Schuster & Paliwal (1997)The bidirectional RNN that produces the per-word annotations.
Luong et al. (2015)The multiplicative attention that scores the CURRENT decoder state, a later contrast to this additive scorer.
Vaswani et al. (2017)The Transformer’s scaled dot-product over separate Q, K, V, which drops the RNN this paper still relies on.
correctionThe attention here is ADDITIVE: it scores against the PREVIOUS decoder state s₁₋₁ with a one-layer tanh MLP, and each annotation is both the key and the value. That is not the Transformer’s scaled dot-product over separate Q/K/V (2017), nor Luong’s multiplicative score on the current state (2015). The recurrent cell is the GRU ("gated hidden unit"), not the LSTM Sutskever used. The paper says "attention" only in a single passing aside (three times in one sentence); its own working terms are "alignment model" and "soft-search," and the model is "RNNsearch." We never say it beats Moses overall: it draws level only on known-word sentences.

Questions you might still have

?

Is this the same attention as in the Transformer?
The idea is the same (a softmax-weighted blend of source vectors), but the parts differ. Here the score is a one-layer tanh MLP of the previous decoder state and each annotation, and the same annotation is both key and value. The Transformer (2017) scores with a scaled dot-product over separately projected Q, K, V and removes the RNN entirely. This is the ancestor; that is the descendant.

?

If the weights are learned, what stops them from drifting away from real alignments?
Nothing forces them. The alignment is never supervised and is not a latent variable; it falls out of maximizing translation likelihood. The weights end up matching linguistic alignment because attending to the right source word is what helps predict the next target word. The heatmaps confirm it after the fact.

?

Why a weighted average instead of just picking the single best source word?
A hard pick is a step function, so you cannot backpropagate through it; you would need to sample a position and use a higher-variance estimator (REINFORCE), which is exactly what hard attention does. The soft average is differentiable, trains with ordinary backprop, and can hedge across several words when it is unsure.

?

Does attention remove the fixed-length-vector limit for free?
It removes the single bottleneck, but it adds cost: the alignment model is evaluated once for every (source word, target word) pair, so the work grows with the product of the two lengths. For 15-to-40-word sentences that is cheap; at thousands of tokens it is the quadratic-attention cost that later work spends years fighting.

?

Did this paper invent the word "attention"?
It uses "attention" in just one passing aside (three times in a single sentence), as an intuition for what the decoder is doing. Its working names are "alignment model" and "(soft-)search," and the model is called RNNsearch. The name "attention" stuck afterward, and "Bahdanau attention" is the retrospective label for this additive form.

Footnotes & further reading

  1. The paper: Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (ICLR 2015). Code (GroundHog).
  2. The seq2seq baseline with a single fixed vector: Sutskever, Vinyals, Le, Sequence to Sequence Learning with Neural Networks.
  3. The RNN Encoder–Decoder and the gated hidden unit (GRU): Cho et al., Learning Phrase Representations using RNN Encoder–Decoder, and the length-degradation study that motivates the bottleneck, Cho et al., On the Properties of Neural Machine Translation.
  4. The bidirectional RNN: Schuster & Paliwal, Bidirectional Recurrent Neural Networks (1997).
  5. Later cousins of this attention: Luong et al., Effective Approaches to Attention-based NMT (multiplicative, current state), Xu et al., Show, Attend and Tell (hard vs soft attention), and Vaswani et al., Attention Is All You Need (the Transformer, which keeps only the attention).
  6. BLEU: Papineni et al., BLEU: a Method for Automatic Evaluation of Machine Translation (2002).