Sequence to Sequence Learning with Neural Networks
One network reads a sentence into a vector; another writes the translation from it.
Two LSTMs and a single fixed vector turned translation into next-word prediction. One change to the data, reversing the source sentence, made it train.
Explaining the paperSequence to Sequence Learning with Neural NetworksA single neural network, trained only to predict the next French word, out-translated a hand-built statistical system that had been refined for a decade.
In 2014 a deep neural network could label a photo or transcribe a sound, but only when the input and the output were both fixed in size. An image is a fixed grid of pixels; a class label is one of a fixed list. Feed-forward and convolutional networks need that: the input layer has a set number of slots, the output layer has a set number of slots, and training fills in the weights between. Translation breaks the assumption on both ends. An English sentence can be three words or thirty, its French translation can be a different length again, and which French word lines up with which English word shifts from sentence to sentence. There is no fixed grid to feed in and no fixed grid to read out.
This paper, from Ilya Sutskever, Oriol Vinyals, and Quoc Le at Google, gives a single recipe for any such problem, with almost no assumptions about the structure of the sequences. One network reads the input one token at a time and compresses everything it read into a fixed-length vector. A second network takes that vector and writes the output one token at a time. Translation, summarization, question answering: anything that maps a sequence to a sequence becomes the same shape, an encode step followed by a decode step.
On the WMT'14 English-to-French task, an ensemble of five of these networks scored 34.81 BLEU, clearing the 33.30 of a mature phrase-based statistical machine translation system, and doing it with none of the alignment tables or phrase dictionaries that system was built from.
A few ideas carry it: a language model that writes one word at a time, two LSTMs joined by a single vector, the gating that lets an LSTM remember a whole sentence, and one strange data trick, reversing the source words, that turned a model that barely trained into one that worked. Each is simple on its own.
Variable length in, variable length out
Start with the recurrent neural network, the natural way to push a feed-forward network along a sequence. An RNN keeps a hidden state , a vector of numbers that is its running memory, and updates it one input at a time:
Read it left to right. At step the network mixes the new input with everything it has seen so far, summarized in , squashes the result through a sigmoid to get the new state , and reads an output off that state. The same weight matrices are reused at every step, so the network handles a sequence of any length with a fixed number of parameters. (The here are raw scores, not probabilities yet; a softmax turns them into a distribution, which matters in a moment.)
An RNN maps a sequence to a sequence cleanly when the two line up one-to-one and you know the alignment in advance, tagging each word with a part of speech, say. Translation is not like that. The output length differs from the input length, and the word order scrambles. You cannot emit the first French word after reading the first English word, because the first French word might depend on the last English one. The model has to read the entire source before it can responsibly start writing.
So separate the two jobs. Use one RNN to read the entire source and boil it down to a fixed-length vector, then use a second RNN to expand that vector into the target. The reader never has to commit to an output; the writer never has to look at the raw source. Everything turns on that fixed-length vector in the middle: one network reads the source into it, another writes the target out of it, and that split is what this paper contributed.
A model that writes one word at a time
First, what the model computes. The decoder is a conditional language model: a next-word predictor that has been told what the source said. A plain language model assigns a probability to a sentence by the chain rule of probability, splitting the joint probability of all the words into a product of one-word-at-a-time conditionals:
Here is the source of length , the target has its own length that need not match, and is the fixed vector the encoder produced. Read the right-hand side as a procedure: the probability of the entire translation is the probability of the first word, times the probability of the second word given the first, times the third given the first two, and so on. This factorization is not an approximation or a modeling choice; it is an exact identity, true of any distribution over sequences. What is a choice is how each factor gets computed: the decoder LSTM reads and the words written so far, and outputs one factor per step.
Each factor is a softmax over the entire output vocabulary, all 80,000 words. The decoder's state at step produces one score per vocabulary word, and the softmax exponentiates and normalizes those scores into a probability distribution that sums to one. Every step is therefore an 80,000-way classification, the most expensive operation in the model, which is why it later gets four GPUs to itself:
Training maximizes the log-probability the model assigns to the correct translations. Over a training set of source-target pairs:
The paper reuses letters here: is the source sentence and its target translation (not the input length from (1)), and is the training set. Because the cross-entropy of a prediction against the single correct word is exactly the negative log-probability the model gave that word, maximizing this average log-probability and minimizing the summed cross-entropy are the same objective with the sign flipped, nothing changing but which way you push. During training the model is fed the true previous word at each step rather than its own guess, a shortcut called teacher forcing that turns sequence generation into a pile of independent next-word classification problems and lets the entire target be scored in one pass. At test time there is no true previous word to feed, so the model consumes its own output, and a single wrong guess early can feed itself forward until the rest of the sentence drifts, because the model was never trained on its own mistakes. This train-test gap is called exposure bias, and beam search softens it by keeping several guesses alive instead of betting everything on one.
One detail makes variable length work. Every sentence ends with a special <EOS> token, the end-of-sentence marker, and the decoder emits words until it produces that token, which ends the translation. Because <EOS> is itself one of the words the softmax can pick, the model can produce outputs of any length, with the stopping point set by the prediction rather than fixed in advance.
Two LSTMs and one vector
Now the architecture that computes (1). The model is two separate LSTMs joined at one point. The first, the encoder, reads the source token by token and ignores its outputs; only its final state survives. That final state is the vector , a fixed-length summary of the entire source. The second, the decoder, is initialized with and runs as the conditional language model above, emitting target words until <EOS>. Press play and watch the source flow into , then the target flow back out of it:
Why two LSTMs rather than one shared network for both halves? Reading a sentence down to a summary and writing one back out are different jobs, so giving each its own weights lets each specialize instead of compromising, at almost no extra compute, and it lets you train one encoder against many decoders, one per language pair, sharing the reader across translation directions. And why a deep stack rather than a single layer? Each added layer cut perplexity by nearly ten percent in their experiments, so they settled on four layers of 1000 cells. (Perplexity is roughly how many words the model is effectively choosing among at each step; lower means a sharper next-word predictor.) Depth, not width, was where the gains were.
That depth fixes the size of . Four layers of 1000 cells, and because an LSTM layer carries two state vectors (a cell state and a hidden state, two vectors the LSTM keeps separate), the handoff is real numbers. Eight thousand numbers to hold an entire sentence of any length, and they are the only channel between reader and writer. A three-word sentence and a thirty-word sentence get the same 8000 numbers. This is the model's defining constraint, the fixed-vector bottleneck: everything the decoder will ever know about the source has to fit through it, and a long sentence has to survive being crushed into the same 8000 numbers as a short one.
For scale, the model has 384M parameters, of which 64M are the recurrent connections (32M in the encoder, 32M in the decoder). The rest is mostly lookup and read-out: a 160,000-word source embedding (160M), an 80,000-word target embedding (80M), and the 80,000-word output softmax (80M), which sums with the 64M recurrent to the 384M total. So five-sixths of the parameters are vocabulary lookup tables; the recurrent machinery that actually reads and writes the sequence is a comparatively small 64M, a modest engine bolted onto a large dictionary.
Why an LSTM, not a plain RNN
The encoder has to carry the first word of the source all the way to the end, across every intervening word, without it fading. A plain RNN cannot. Propagating an error signal back through the plain RNN above multiplies it by the recurrent matrix at every step, and a product of many similar factors either shrinks toward zero or grows without bound, the vanishing and exploding gradient problem. With vanishing gradients the network never learns that word one mattered for word fifty: the signal connecting them has decayed to nothing by the time it arrives.
The LSTM (long short-term memory) fixes this with a second, protected lane. Alongside the hidden state it keeps a cell state, and instead of overwriting the cell each step it adds to it. The cell has a near-pass-through self-connection, so information, and the gradient flowing backward through it, can ride along for hundreds of steps without being squashed. Three small sigmoid-valued controllers, the gates, regulate the lane: an input gate decides how much new information to write, a forget gate decides how much of the old memory to keep, and an output gate decides how much of the memory to expose at this step. Because storing is separated from reading and writing, the cell can hold a fact untouched and release it exactly when needed, which is what carrying a sentence to its end demands. (This paper uses the LSTM formulation from Graves, which includes the forget gate and peephole connections, additions that came after the original 1997 design.)
One precision the paper is careful about: the LSTM mitigates vanishing gradients, it does not abolish them, and it can still suffer the opposite problem. Gradients can explode along the input and gate paths even when the cell-state lane is well behaved. That is why the training recipe still clips gradients, a step covered below, rather than trusting the architecture alone.
Reverse the source sentence
With the architecture in place, the model trained, but not well. The fix was not in the network. The authors reversed the word order of every source sentence, feeding the encoder instead of while leaving the target in its normal order (reverse the target too and the effect below disappears). One change to the data, no change to the model, and the test perplexity fell from 5.8 to 4.7 while the test BLEU rose from 25.9 to 30.6. Nearly five BLEU points, larger than most architectural tweaks ever buy, from reversing a list.
Why reversing helps is subtle. The model reads the entire source and then writes the target as one long chain of timesteps: source words occupy steps 1 through , target words steps through . The time lag of a word pair is how many steps separate a source word from its matching target word, and a long lag is what backpropagation has to bridge to link them. With the source forward, word sits at step and its translation at step , so every pair is exactly steps apart. Reverse the source and word moves to step , so its lag becomes : the first pair collapses to a single step while the last pair stretches to .
The popular telling says reversing brings corresponding words closer together. It does not. The average lag is either way, because the average of is exactly , the same as the constant of the forward case. Reversing does not move words closer on average; it redistributes the lags, collapsing the minimal one from to 1 at the cost of stretching the maximal one. Drag the length and toggle the order to watch the average hold steady while the first pair snaps together:
Why does a short minimal lag help so much when the average is unchanged? Early in training the network knows nothing, and a long gradient path decays before it carries any usable signal. Putting the first few word pairs one step apart gives backpropagation a short, undecayed path to learn from immediately. The authors describe it as letting the model "establish communication" between source and target: once the opening words are linked, the rest of the alignment bootstraps from there. The forward model has no such foothold, every pair is equally far, so it struggles to get started at all.
The authors expected reversing to help only the beginning of each translation and to hurt the end, where lags now stretch longest. The opposite happened. Reversed models did better on long sentences, not worse, which they attribute to the network learning to use its memory more effectively once it had a way in.
Training the network
The optimizer is plainer than anything in a modern recipe. Stochastic gradient descent with no momentum (the paper does not justify dropping it; with gradients clipped and the rate hand-tuned, the bare update was enough), a fixed learning rate of 0.7 held for five epochs, then halved every half-epoch through 7.5 epochs total. All parameters initialized uniformly in . Adam, the adaptive optimizer that would later train almost everything, did not exist yet (it was published months after this work; see Adam), so the schedule was tuned by hand.
The one safeguard the LSTM still needs is against exploding gradients. The paper enforces a hard cap on the gradient's overall size. For each batch, take the gradient already averaged over the 128 sequences, call it , and measure its length . If exceeds 5, rescale the whole vector down to length 5:
This is global-norm clipping, not a per-weight clamp: it shrinks the entire gradient vector by one scalar factor, so the direction of the step is preserved exactly and only its magnitude is capped. Most steps fall under the threshold and pass through untouched; clipping acts only on the rare spike that would otherwise blow up the weights. The full step, teacher-forced, for one pair:
# one SGD step on a (source, target) pair, teacher-forced
src = reverse(tokenize(source)) # reverse the SOURCE only
tgt = tokenize(target) + ["<EOS>"] # target stays in order
h = c = zeros(4, 1000) # 4 layers of LSTM state
for w in src + ["<EOS>"]: # read the source, one token a step
h, c = encoder_lstm(embed_src[w], h, c)
v = (h, c) # the fixed vector: 4*1000*2 = 8000 nums
h, c = v # seed the decoder with v
loss, prev = 0, "<EOS>" # <EOS> also starts the output
for w in tgt: # teacher forcing: feed the TRUE word
h, c = decoder_lstm(embed_tgt[prev], h, c)
p = softmax(W_out @ h_top + b) # distribution over 80k words
loss += -log(p[w]) # cross-entropy, summed over the target
prev = w
g = grad(loss) / 128 # average over the 128-sequence batch
if norm(g) > 5: g = 5 * g / norm(g) # global-norm gradient clip at 5
params -= 0.7 * g # plain SGD, no momentumTwo practical notes finish the recipe. Sentences vary wildly in length, so a random batch of 128 mixes a few long sentences with many short ones, and the long ones stall the rest; sorting each batch to hold sentences of similar length gave a clean 2× speedup. And a single GPU managed only about 1,700 words per second, too slow, so the model was split across an 8-GPU machine: one LSTM layer per GPU on four of them, the 80,000-word softmax split across the other four, each owning a quarter of the vocabulary, so each multiplies the 1000-wide state by a 1000×20,000 slice. That reached 6,300 words per second, and even so, training ran about ten days.
Reading the translation out
Training scores known translations. Producing a new one means finding the target the model thinks is most likely:
That maximization is intractable: there are possible targets, the vocabulary size raised to the output length, so even a 10-word sentence over the 80,000-word vocabulary is candidates, far too many to enumerate. The tempting shortcut is greedy decoding: at each step emit the single highest-probability next word. It fails because the most probable first word need not begin the most probable sentence. Picking the locally best word now can paint the model into a corner where every continuation is poor, and the product of per-step probabilities ends up lower than a path that started with a slightly worse word.
Beam search is the standard middle ground between greedy and exhaustive. It keeps the best partial translations alive at once. At each step it extends every kept hypothesis by every possible next word, scores each extension by its cumulative log-probability, and prunes back to the best; a hypothesis that emits <EOS> is set aside as complete. Drag the beam width and watch greedy commit to a dead end while a wider beam recovers the better sentence:
The width barely needs to be wide. A beam of 1 already worked, and a beam of 2 captured most of the benefit: the ensemble's BLEU went 33.00 at beam 1, 34.50 at beam 2, and only 34.81 at beam 12, sharp diminishing returns after the second hypothesis. A small beam being enough is convenient, since width costs compute linearly. The decode loop, fed the same fixed vector as in training:
# decode: search for the most likely target, beam width B
v = encode(reverse(tokenize(source))) # same fixed vector as training
beam = [(["<EOS>"], 0.0, v)] # (tokens, log-prob, state)
done = []
while beam:
cand = []
for toks, lp, (h, c) in beam:
h, c = decoder_lstm(embed_tgt[toks[-1]], h, c)
p = softmax(W_out @ h_top + b) # 80k next-word probabilities
for w in vocab: # extend by EVERY word, then prune
cand.append((toks + [w], lp + log(p[w]), (h, c)))
cand.sort(key=lambda t: -t[1]) # rank by cumulative log-prob
beam = []
for hyp in cand[:B]: # keep only the B best prefixes
(beam, done)[hyp[0][-1] == "<EOS>"].append(hyp)
return max(done, key=lambda t: t[1]) # best finished hypothesisWhat it scored
Translation quality is measured in BLEU, which counts how many short word sequences the candidate shares with a reference translation, from single words up to four-grams, multiplied by a penalty for being too short. It rewards surface overlap, not meaning, but it correlates well enough to have anchored the field for years. It runs from 0 to 100; in practice anything in the 30s is a usable translation, and a single point is a difference a human reader can notice, so the +5 from reversing and the 1.5-point gap over the baseline are real, not rounding. The paper reports cased BLEU computed with the standard multi-bleu.pl script. Toggle between translating directly and rescoring the statistical system, and hover any bar for its full description:
The headline 34.81 is an ensemble of five LSTMs that differ only in random initialization and minibatch order, decoded with a beam of 12. A single reversed LSTM scored 30.59, and a single forward LSTM (the one without the reversing trick) only 26.17; only the ensemble cleared the baseline. And the scope of the claim is precise: 34.81 beats the phrase-based SMT baseline of 33.30, not the 37.0 best system in the competition. The achievement is that a purely neural translator beat a strong phrase-based system at all, which had never happened on a task this large, not that it won the bake-off.
There was a second way to use the model. Instead of translating from scratch, take the statistical system's 1000 best candidate translations and re-score each one with an even, equal-weight average of the system's own score and the LSTM's log-probability. That hybrid reached 36.5, within half a BLEU point of the 37.0 best result. The LSTM did not have to win on its own to be useful; even as a re-ranker bolted onto an existing system, it added more than three BLEU points over the 33.30 baseline.
It did not break on long sentences
The fixed vector looked like it should fail here. If the entire source has to fit in 8000 numbers, a long sentence should overflow it and translation quality should collapse as sentences grow, which is what earlier fixed-vector models had shown. Instead the LSTM held steady. Drag across the lengths and watch its curve stay flat where the feared collapse would have fallen away:
Reversing the source is why: by getting the optimization started, it let the network learn to pack a long sentence into the vector without losing the early words. The fixed vector is a real limit, but a less crippling one than everyone expected, and the next year's work on attention would lift it.
What the one vector learned
If a sentence is squeezed into 8000 numbers, what survives the squeeze? You cannot eyeball 8000 dimensions, so PCA flattens them to the two directions that vary most across a handful of phrases, a lossy shadow but enough to see whether sentences cluster by meaning or by surface form. Swap who does what to whom and the point jumps a long way; rewrite the same event in the passive voice and it barely moves. Toggle the two contrasts and hover a dot to read its sentence:
That the vector separates "Mary admires John" from "John admires Mary" means it is not a bag-of-words; a model that just counted words would place those identical. That it keeps an active sentence near its passive paraphrase means it has caught something closer to meaning than to surface form. This is a qualitative read of a hand-picked set of phrases, not a measured benchmark, and the paper hedges accordingly ("fairly insensitive" to voice, not flatly invariant). Suggestive, not proof, but it is the kind of representation the single-vector bet was hoping for.
The fixed vector runs forward from here as the thread through everything after. Cho and colleagues, the same year, had used a similar encoder-decoder to re-score a statistical system rather than translate outright. The following year, Bahdanau and colleagues named the fixed vector as the bottleneck and removed it: their decoder learns to look back over the whole source and attend to a different part of it for each word it writes, the mechanism explained in Bahdanau attention. Attention generalized, recurrence was dropped, and the encoder-decoder became the Transformer. The single vector this paper crushed a sentence into became the constraint worth attacking, and most of what came next was the attack on it.
Questions you might still have
If reversing the source does not change the average distance, why does it help so much?
Because early in training a long gradient path decays before it carries any signal. Reversing collapses the minimal lag, putting the first source and target words one step apart, so backpropagation has a short, undecayed path to learn from and can "establish communication" between the two languages. The rest of the alignment bootstraps from there. It also, unexpectedly, improved performance on long sentences.
Is the 34.81 BLEU from a single network?
No. 34.81 is an ensemble of 5 reversed LSTMs decoded with a beam of 12. A single reversed LSTM scored 30.59, and a single forward LSTM only 26.17. The ensemble beat the phrase-based SMT baseline (33.30) but not the best WMT’14 system (37.0); using the LSTM to rescore the statistical system’s 1000-best reached 36.5.
Why crush an entire sentence into one fixed vector instead of letting the decoder look back at the source?
This model deliberately uses one fixed vector, which is why long sentences were expected to fail. Letting the decoder look back at the source, attending to a different part for each output word, is exactly what attention added the next year (the Bahdanau attention paper), and it is the change that removed this bottleneck.
Why plain SGD and not Adam?
Adam did not exist yet; it was published months after this work. The recipe is stochastic gradient descent with no momentum, a fixed learning rate of 0.7 that is halved every half-epoch after epoch five, and a global-norm gradient clip at 5 to catch the LSTM’s occasional exploding gradient.
Footnotes & further reading
- The paper: Sutskever, Vinyals, Le, Sequence to Sequence Learning with Neural Networks (Google, NeurIPS 2014). The version explained here is arXiv v3; it reports 384M parameters in total.
- The LSTM cell: Hochreiter & Schmidhuber, Long Short-Term Memory (1997); the forget gate is from Gers, Schmidhuber & Cummins (2000). This paper uses the formulation in Graves, Generating Sequences with Recurrent Neural Networks (2013).
- Global-norm gradient clipping: Pascanu, Mikolov & Bengio, On the difficulty of training Recurrent Neural Networks (2013).
- The contemporary encoder-decoder: Cho et al., Learning Phrase Representations using RNN Encoder-Decoder (2014), which used a GRU and re-scored a statistical system.
- Attention, which removed the fixed-vector bottleneck: Bahdanau, Cho & Bengio, Neural Machine Translation by Jointly Learning to Align and Translate (2014), explained here.
- BLEU: Papineni, Roukos, Ward & Zhu, BLEU: a Method for Automatic Evaluation of Machine Translation (2002). The 37.0 best-WMT'14 figure is the authors' own recomputation (under
multi-bleu.pl) of Durrani et al.'s system, originally reported as 35.8.
How could this explainer be improved? Found an error, or something unclear? I read every message.