Playing Atari with Deep Reinforcement Learning
Raw pixels in, joystick out.
In 2013 a small London lab plugged a convolutional network into a learning rule from 1989, fed it nothing but the screen and the score, and it learned seven Atari games. The same architecture, the same hyperparameters, trained fresh on each game. This is the paper that made deep reinforcement learning a field.
Explaining the paperPlaying Atari with Deep Reinforcement LearningCan a network learn to act, not just to label, when the only feedback is a game score that arrives late, rarely, and already scrambled by its own behavior?
By 2013, deep learning had a winning recipe: collect a mountain of labeled examples, define a fixed target for each one, and grind the error down with stochastic gradient descent. ImageNet fell to it in 2012. Speech recognition fell the year before. The recipe worked because the problems were polite. Every input came with an answer key, the examples were independent of each other, and the thing being learned never changed the data it was learning from.
Reinforcement learning is the impolite version. An agent acts, the world responds, and the only feedback is a scalar reward that can arrive thousands of steps after the action that earned it. The data points are not independent; each frame of a game is nearly identical to the last. And the data distribution is not fixed, because the agent's current behavior decides what it sees next. Every assumption the deep-learning recipe rests on is violated.
This paper ignored the mismatch and made the marriage work anyway. A convolutional network watches the raw Atari screen, outputs a value for every joystick action, and is trained with a variant of Q-learning, a rule from 1989. One mechanism, experience replay, holds the whole thing together. The result beat the best published learning baselines on all seven games when scored the same way, beat every previous approach on six of the seven, and beat an expert human on three. Thirty-eight days after the arXiv posting, Google bought DeepMind. (The timing is documented; how much this particular paper had to do with the price is folklore, so I'll leave it at the timing.)
The plan: build the classical reinforcement-learning tower first (reward, return, the function , the Bellman equation), see why bolting a neural network onto it was considered reckless, and then watch one buffer of old memories defuse the problem. None of the pieces is hard. The trick is that they only work together.
The game only tells you the score
Strip the setup to what the agent actually gets. At each time-step it sees an image , the raw Atari screen (210×160 pixels). It picks an action from the legal joystick moves for that game, between 4 and 18 of them. The emulator advances, and the agent receives a reward : the change in the game score on that step. That is everything. No labels, no demonstrations, no access to the emulator's internal state, no hand-built features saying "this blob is the ball." The paper is explicit that the network learns from what a human player gets: the video, the score, a game-over signal, and the set of buttons.
A policy is whatever rule maps what you've seen to what you press. The agent in this paper uses a deliberately simple one, called -greedy: with probability press a random button, otherwise press the button the network currently rates highest (the paper's phrasing; the random draw can also land on the greedy button, a formality that changes nothing here). The random fraction is the exploration budget. A purely greedy agent only ever tastes the dishes it already likes, so it can never discover that some unrated action is better. During training is annealed from 1 down to 0.1 over the first million frames: start as a pure tourist, end as a professional with a 10% twitch of curiosity. The schedule never reaches zero; a residual 10% of randomness keeps the data varied for as long as training runs, and why that costs nothing in what is learned is a property of Q-learning we'll meet shortly.
The figure below runs this loop on a toy Breakout. Drag to both ends and hold it there. At you are watching minute one of training: random flailing, quick deaths. At the paddle plays its best (its best, not the best). The bars on the right are the agent's action-values, the quantity this whole paper exists to learn.
Keep one number in mind for scale: feedback about a good move can arrive thousands of time-steps after the move. In Breakout, the points for a brick arrive while the ball is still mid-flight from a paddle hit you made seconds ago; in Seaquest, surfacing for oxygen pays off half a minute later. Supervised learning never has this problem, because the label sits right next to the input. Here, the first job is bookkeeping: deciding which of the last thousand actions deserves credit for the point that just landed.
Q: the number every decision needs
The classical answer to delayed credit starts by defining what an action is worth. Future rewards are summed with a discount per step:
is the return: the total discounted haul from now until the episode ends at time . The discount works like inflation on future money. A point promised steps from now is worth points today, so with a reward 100 steps out is worth about 37 cents on the dollar. Why discount at all? Atari episodes end, so the sum is finite anyway; what buys is a bounded scale (no return can exceed however long the episode drags on) and a planning horizon: the agent effectively cares about the next steps and is nearly blind past them. One honest footnote here: this paper never states the it used. The 0.99 everyone quotes comes from the 2015 Nature follow-up.
Now the central object. For a state and action , define
in words: if I take action now and play perfectly forever after, what return should I expect? The max is over all policies , so doesn't depend on how anyone actually plays; it is a fact about the game. And it is exactly the number a player needs. If you had on a cheat sheet, optimal play would be mechanical: at every state, press the action with the biggest entry. The entire problem of playing Atari reduces to estimating one function.
Why value state-action pairs instead of just states? Because turning a state value into a decision requires asking where each action leads, and that requires a model of the game's dynamics, exactly the thing this approach refuses to learn. bakes the one-step lookahead into the function itself, so acting greedily is just comparing entries. The cost contrast is the real argument: with , choosing greedily is one forward pass that scores every action at once, with you would have to simulate each candidate action forward just to know where it lands, one rollout per button, before you could even ask the value function its opinion. (It's also why TD-Gammon, which we'll meet later, could afford to learn : backgammon's rules are known, so it could simulate every legal move and evaluate where each one lands.)
What makes learnable is that it obeys a self-consistency law, the Bellman equation:
Read it as a GPS planning backwards: the fastest time from your corner equals the first leg plus the fastest time from wherever that leg ends. The value of acting now is the reward you collect immediately, plus times the value of the best action available in the state you land in (the expectation averages over the emulator's randomness about where you land; Atari itself is nearly deterministic, a fact that comes back when a competitor gets graded on its single best run). No other function satisfies this identity, which is what makes it usable as a target: if your estimate violates equation (1) somewhere, you know it's wrong there.
It also hands you an algorithm. Treat (1) as an assignment instead of an identity and sweep it over all states:
This is value iteration, and it provably converges, . The reason is worth one sentence, because it explains the figure: each sweep shrinks the worst-case error by a factor of , like photocopying a photocopy at 99% scale, so errors decay geometrically and value information radiates outward from wherever reward actually lives, one step per sweep. Watch it happen, and then drag to its edges: at value cannot travel at all (only the squares touching the reward ever learn anything), and near the whole board lights up almost evenly, because distance has almost stopped mattering.
So why doesn't this solve Atari on the spot? Because value iteration needs a table with one entry per state-action pair, it visits all of them repeatedly, and each backup takes an expectation over the emulator's transition probabilities, a map nobody has. (The gridworld above cheats on both counts: tiny, and the map known.) Atari's state space is every screen the game can show (more precisely every history of screens, as we'll see), which is not a number you store a table for. The paper calls the tabular approach "totally impractical" and points at the standard escape: replace the table with a function approximator, , here a neural network they call a Q-network. A table memorizes; a network generalizes. Two screens that differ by one pixel of scenery share their value estimate, and that sharing is the entire reason this can work with only ten million frames of experience. It is also the reason everyone expected it to explode.
Making a frame into a state
Before learning , there's a quiet problem with itself. Everything above assumed a Markov state, meaning the state carries all the information needed to predict what happens next; given the present, the past is redundant. A single Atari frame is not that. It shows the ball's position and says nothing about its velocity. Two situations with the same frame (ball rising, ball falling) demand different actions but look identical, which the paper calls the screen being perceptually aliased. Feed a single frame to any learner and you've asked it to make distinctions its input does not contain.
The formal fix in the paper is heavy: define the state as the entire history of screens and actions so far. Histories are always Markov (the past can't be missing from the present if the present includes the past), and since episodes end in finitely many steps, this gives a large but finite Markov decision process, the structure that licenses everything in the previous section. The practical fix is light: a function that stacks the last 4 preprocessed frames into one 84×84×4 input. Four frames are a cheap, truncated stand-in for the full history. The truncation is also a ceiling, fixed before training starts: four frames cover well under a second of play, so the state sees reactions, not plans, and anything strategic that unfolds over seconds is invisible by construction; that bound will show up plainly in which games the agent wins and which it loses. They reveal velocity (two frames pin down direction and speed), a bit of acceleration, and they also help with a Stone Age hardware quirk: many Atari games flicker their sprites (the console could only draw a few objects per scanline, so games alternated them across frames), and any single frame can be missing an object that very much exists. The stack catches a blinker only if it shows up in at least one of the four sampled frames; hold that caveat, it comes back as the paper's one per-game tweak.
The payoff: with in place, "one stacked input" behaves approximately like "one state," and the Bellman machinery applies to it. The stack is also a confession of scope. Four frames span well under a second of game time, so anything strategic that lives on longer scales (your oxygen meter has been falling for twenty seconds) is invisible to the state. File that away; it predicts exactly which games DQN loses.
Q-learning: regression on your own guesses
Now the learning rule. We want the network to satisfy the Bellman equation, but we can't evaluate the right side of (1): the expectation needs the emulator's transition probabilities, and we refuse to model them (this is what model-free means: learn values from sampled experience directly, never learn physics). What we do have is samples. Play, and every step hands you a transition : one real observation of "from , doing earned and led to ."
So turn the identity into a regression. Build a target out of the sample and the network's own estimate of the future, and pull the prediction toward it:
Squared error, like any regression, with two strange residents. The distribution , the behaviour distribution, is just "whatever states and actions the agent actually encounters while playing" (textbooks say behaviour policy; same idea). And the label is manufactured from the network's own previous prediction at the next state. One step of measured reality, , plus times a guess. This is called bootstrapping: improving guesses using other guesses, anchored to reality one reward at a time. The paper's own framing of the oddity is the right one: in supervised learning the targets are fixed before training starts; here the targets depend on the weights being trained. Move and the labels move too. (The subscripts: indexes rounds of this regression, and is last round's weights, held fixed while you fit . Hold the notation loosely; in the actual algorithm it collapses to the same live weights, a wrinkle we come back to.)
Differentiate (2), treating the target as a constant, and you get the update direction:
The parenthesis is the TD error (temporal-difference error): how far today's prediction missed the one-step-corrected target. "Treating the target as a constant" deserves its sentence, because it is a choice, not an accident. The target contains too (through the max at ), and equation (3) deliberately does not differentiate through it; this is called a semi-gradient. You could differentiate through the target instead (Baird's "residual gradient" method does), which buys provable convergence but pays for it with slower learning and a bias on random environments, so practice settled on the semi-gradient and its risks. The consequence of the choice: equation (3) is not gradient descent on any fixed objective. It is archery at a goalpost you secretly control: aim at where the target is now, pretending it's bolted down, even though your own adjustment will shift it.
Replace the expectations with single sampled transitions and step after every move, and (3) collapses into the classic tabular rule Watkins wrote down in 1989, Q-learning:
(With one parameter per table cell, is just an indicator, and the SGD step is literally this rule.) For lookup tables, Watkins and Dayan proved in 1992 that this converges to with probability 1, provided every state-action pair keeps being visited and the step sizes decay at the usual stochastic-approximation rate. Hold on to the fine print: the theorem is tabular only. Their paper says outright that it may fail for other representations. One more clause from the same theorem earns its keep later: the experience does not have to arrive in order; in Watkins and Dayan's words, the episodes "need not form a continuous sequence." And DQN is about to use a very other representation.
Watch the rule run on a corridor before scaling it to pixels. Six cells, +1 at the right end, and a table of Q-values learning from single transitions. Scrub slowly around the first time the agent reaches the goal: that update is the only one with real signal in it (terminal states have no future, so the target is just ), and every later update drags that signal one cell further left, exactly the Figure 2 wave rebuilt from samples instead of sweeps.
To feel why a moving target is hard to chase, follow one fixed transition through training, the network seeing an enemy on screen and firing, then watch its TD target and its prediction at three points in the run. The transition never changes; what changes is the weights underneath both numbers. Early, the weights are near-random, so the prediction is noise, and the target is worse than noise: it is built from the network's equally random estimate at the next state, which the ongoing updates are also yanking around. So the target swings violently from one update to the next, not because reality changed but because the estimator it leans on is changing fast. The prediction is chasing a goalpost that is itself sprinting. Midway, the next-state estimate has been anchored to real rewards enough times that it has steadied somewhat, so the target still moves but in smaller jumps, and the prediction has closed most of the gap to it. Late, the network has propagated the reward back along the path many times, the next-state value it bootstraps from has nearly converged, and the target sits almost still; the prediction and the target now nearly agree, which is exactly the Bellman identity in equation (1) being satisfied at this transition.
That arc is qualitative, but it is the whole reason the two stabilizers exist. The early wild swings come from two sources the design attacks separately. One is that consecutive frames are nearly identical, so a stream of them keeps shoving the weights in one direction and amplifies whatever the target is currently doing; sampling old, scattered transitions from the replay buffer instead breaks that correlation and averages the swing down. The other is that the target is rebuilt from the same weights being updated, so every step that moves the prediction also moves the target it was aiming at; the 2015 follow-up's frozen target network holds those target-side weights still for thousands of steps, so the goalpost at least stops sprinting while the prediction runs at it. Both fixes are aimed at the same thing this three-snapshot arc makes visible: making a self-referential target settle down faster and with less thrashing.
One more property, easy to miss and load-bearing for everything that follows. Look at whose behavior the target evaluates: scores the best action at , regardless of what the agent actually did next. Q-learning is a food critic traveling with an unadventurous crowd: the crowd (the ε-greedy behaviour) decides which restaurants get visited, but the review always records "the best dish on the menu here," not the fries your companions actually ordered. This makes Q-learning off-policy: it learns about the greedy policy while following a different, more exploratory one. Its on-policy sibling, Sarsa, instead plugs the action actually taken into the target, diary rather than review. The distinction sounds academic for exactly one more section; it is the property that will make experience replay legal.
So Q-learning is regression where you manufacture your own labels out of your own predictions, and the labels move as you learn. With a table, the theorem holds anyway. The cracks appear when the table becomes a network.
Why everyone expected this to blow up
The idea "neural network + temporal-difference learning" was not new in 2013. It had already had its one great success, twenty years earlier: Tesauro's TD-Gammon, a backgammon program with a one-hidden-layer network that trained purely by self-play and reached the level of the best human players. (For the record, it differed from DQN in both knobs that matter here: it learned a state value rather than , and it learned on-policy from its own games. The DQN paper's phrase "similar to Q-learning" is loose, and the paper itself footnotes the differences.)
Then came the embarrassing decade. The same approach was tried on chess, Go, and checkers, and produced nothing comparable. The field settled on a deflating explanation: backgammon itself had done the heavy lifting. The dice force exploration that a deterministic game never provides, and they smooth the value landscape. Pollack and Blair sharpened the skepticism by showing that even crude hill-climbing, no temporal-difference learning at all, got surprisingly far at backgammon. The lesson the field absorbed: TD-Gammon was a property of backgammon, not a recipe.
Theory then made the pessimism respectable. Tsitsiklis and Van Roy proved that TD methods with nonlinear function approximation can diverge, and Baird built a famous small counterexample (a six-state star) where off-policy updates with a linear approximator drive every weight to infinity, with all rewards zero and the true answer perfectly representable. Nothing about the task is hard in these examples. The divergence comes from the learning dynamics themselves, and you can hear the mechanism in one analogy: audio feedback. The network's estimate at one state is the speaker; through bootstrapped targets it feeds the microphone, the network's estimates at other states, because with shared weights an update here moves predictions everywhere. If the update distribution keeps re-amplifying the same loop, a small overestimate becomes a bigger target becomes a bigger overestimate, and the system squeals.
The modern name for the risky cocktail is the deadly triad: function approximation, bootstrapping, and off-policy data. (The term is Sutton and Barto's, from 2018; this paper predates its own diagnosis.) DQN runs all three at once, with a 680,000-parameter network. And there is a fourth ingredient the triad doesn't even count, the one the paper worries about most: the agent makes its own dataset. Imagine the current weights say "move left." The agent moves left, so the next thousand frames all come from the left side of the screen, so the next thousand updates are about left-side states, so whatever the network believes about the left gets reinforced and whatever it believed about the right decays. Training data generated by the thing being trained is a feedback loop wearing a dataset costume. The paper's citation for where such loops end is Tsitsiklis and Van Roy: oscillation, a poor local rut, or divergence.
This is why serious reinforcement learning had retreated to linear function approximators with convergence guarantees, and why the existing convergent alternatives (gradient-TD methods, which do honest gradient descent on a well-defined objective, the projected Bellman error) had only been shown for evaluation or linear control, never at deep-network scale on a hard domain. The closest prior attempt was Riedmiller's neural fitted Q-learning (NFQ), which trains a neural network on exactly the loss in (2), but as a batch method that refits on the whole accumulated dataset each iteration, at a cost that grows with it; its visual variant first compressed the screen with a deep autoencoder rather than learning from pixels end to end. DQN's claim to "first" is precise: end to end from raw pixels, with cheap constant-cost SGD updates. (Riedmiller is an author here; the lineage is literal.) The 2013 move was to run every warning light at once and add a damper nobody considered sufficient: a buffer of old memories.
Experience replay: train on your past, not your present
The mechanism is almost insultingly simple. At every step, store the transition in a fixed-size replay memory (the paper keeps the most recent million frames). To learn, do not use the transition that just happened. Instead sample a minibatch of 32 transitions uniformly at random from the buffer and regress on those. The freshest experience just joins the library like everything else.
The paper gives three reasons, and each one strikes at a different failure from the last section. First, data efficiency: each transition can be replayed in many updates instead of being used once and discarded, and at ten million frames you want every drop. Second, correlation: consecutive frames are nearly identical, so consecutive-sample minibatches contain one effective data point and 31 echoes; sampling at random breaks those correlations and, in the paper's phrasing, reduces the variance of the updates (32 spread-out samples genuinely average where 32 clones don't). Third, and this is the one aimed at the feedback loop: the training distribution stops being "whatever the current policy is doing this second" and becomes an average over the last million steps of behavior. The left-moving agent still adds left-side frames to the buffer, but it trains on a mixture of every phase it has been through, so the loop's gain drops below the squeal threshold. Old flashcards from the whole semester, shuffled, instead of re-reading the sentence you just wrote and then burning it.
In the figure, the band is your past and the trays are minibatches. The buffer-size slider runs the full range down to its degenerate edge: at the buffer holds only "just now" and replay collapses into exactly the online learning it was built to replace. That edge is the claim, in reverse: everything replay buys lives in the gap between the two trays.
Replay has a price, and it's the punchline of the Q-learning section. A transition sampled from the buffer was generated by an old version of the agent, sometimes a much worse one. A method that learns "the value of how I currently behave" (on-policy, like Sarsa) would be poisoned by this stale data; it would keep computing the value of a policy that no longer exists. The paper states the requirement plainly: learning by replay forces you to learn off-policy, "which motivates the choice of Q-learning." Q-learning never asks who chose the action. A stored is a perfectly valid sample of "what happens if you do in " whether it was chosen by genius or by the dice, and the target's max asks the same question of every sample: what's the best you could do from where it landed? The critic doesn't care that the crowd picked the restaurant two months ago.
Credit where due: replay is not a DeepMind invention. Long-Ji Lin introduced it in 1992, two decades earlier, for robot learning with neural networks; the paper cites his thesis. Lin was actually more cautious, recommending you replay only experiences consistent with the current policy. DQN drops the caution, leans on off-policy Q-learning to make stale data safe, and scales the idea up. The paper also names its own limitation: uniform sampling treats a pivotal, surprising transition and the ten-thousandth frame of nothing happening as equally worth study, and it points at prioritized sweeping as the obvious upgrade. Two years later, the same lab built exactly that (prioritized experience replay). Reading a paper's future-work section after the future happened is one of the small pleasures of this job.
One absence is worth flagging loudly, because the internet routinely misremembers it. The famous frozen target network, a second copy of the weights used to compute targets and refreshed only every 10,000 updates, is not in this paper. That is the 2015 Nature paper's contribution. Here, Algorithm 1 bootstraps from the same live it is updating (the in equation (2) is a fitted-iteration formality that degenerates to the live weights in the actual algorithm). In 2013, experience replay carries the stability burden alone.
The network that reads the screen
The perception machinery is 2013-vintage and pleasingly small. Raw frames are 210×160 with a 128-color palette. Preprocessing converts to grayscale (color triples the input volume and these games barely need it; the paper's stated goal for the whole step is shrinking the input), downsamples to 110×84, and crops an 84×84 square around the playing area. The square crop has a wonderfully unprincipled reason: the team reused the GPU convolution code from Krizhevsky's AlexNet work, and it required square inputs. The last four such frames are stacked, giving the 84×84×4 state from Figure 3.
Then a convolutional network: 16 filters of size 8×8 with stride 4 (ReLU), 32 filters of 4×4 with stride 2 (ReLU), a fully-connected layer of 256 units (ReLU), and a final linear layer. Do the arithmetic and the network is tiny by any modern standard: 4,112 weights in the first conv layer, 8,224 in the second, 663,808 in the fully-connected layer (97.5% of the total), and a head of at most 4,626, for roughly 680,000 parameters in all. This is the network the paper christens the Deep Q-Network: DQN. The agent that beat humans at Pong fits in under three megabytes of float32, smaller than a single attention layer of a small modern language model.
The head is the architectural idea worth stealing. The obvious design, used by prior work, takes state and action as input and outputs one scalar, which means evaluating 18 actions costs 18 forward passes. DQN instead takes only the state and outputs all action values at once, one output unit per legal action: a menu with every price printed, rather than asking the waiter one dish at a time. The max over actions, which the target in (3) and the greedy policy both need constantly, becomes a single forward pass plus an argmax over a vector. Across games, only this head changes width (4 to 18 outputs); every layer upstream is identical.
The training recipe, identical for every game: RMSProp (divide each weight's step by a running average of its own recent gradient magnitudes, so no parameter's update dwarfs another's; at the time citable only as a slide from Hinton's Coursera course, and the paper states no learning rate either, the 0.00025 in common implementations being a 2015 number) with minibatches of 32; annealed 1 → 0.1 over the first million frames, then held at 0.1; ten million frames of training per game; the million-frame replay memory. And to be precise about the headline claim: one architecture and one hyperparameter sheet, but a fresh network is trained from scratch for each game; no weights are shared across the seven. Two more tricks earn their keep. Rewards are clipped to +1, 0, or −1 during training, because raw score scales differ wildly across games (a Breakout brick is 1 to 7 points, a Q*bert ledge is 25) and clipping lets one learning rate serve all seven; the admitted cost is that the agent can't tell a 100-point opportunity from a 1-point one. The mechanism behind "serve all seven": TD targets inherit the reward scale, so without clipping the targets, and with them the gradients, would differ across games as wildly as the raw scores do, and a step size gentle enough for Breakout's small targets would detonate on Q*bert's large ones; clipping puts every game's error signal on one scale, and one learning rate fits it. And the agent acts only every 4th frame, repeating its last action on skipped frames, which buys roughly 4× more episodes of experience per unit of compute since the emulator is much cheaper than the network. (Frame skipping is unrelated to frame stacking; one is about acting cheaply at 15Hz, the other about making the input Markov.) The lone exception to "identical everywhere": Space Invaders used skip 3 instead of 4, because at skip 4 the blinking lasers landed exactly on the skipped frames and became invisible (a skipped frame never enters the stack, so stacking could not save them; the sampling rate itself had to change). The entire field of per-game tuning, in this paper, is that one integer.
Assembled, the whole algorithm fits in a screenful:
# deep Q-learning with experience replay (Algorithm 1)
D = ReplayMemory(N) # holds (phi, a, r, phi2) tuples
theta = init_random()
for episode in range(M):
phi = preprocess(frames) # the 84x84x4 stack
for t in range(T):
a = random_action() if rand() < eps \
else argmax(Q(phi, theta)) # one forward pass, all actions
r, frame = emulator.step(a) # r = clipped score change
frames.append(frame); phi2 = preprocess(frames)
D.store((phi, a, r, phi2))
# learn from the past, not from this moment:
batch = D.sample(32) # uniform random minibatch
ys = [r_j if terminal(p2)
else r_j + gamma * max(Q(p2, theta)) # same live theta,
for (p, a_j, r_j, p2) in batch] # no frozen copy
preds = [Q(p, theta)[a_j] for (p, a_j, r_j, p2) in batch]
sgd_step(grad(mean((ys - preds) ** 2), # one step per batch
wrt=theta, treating_ys_as_constant=True))
phi = phi2Two honest notes on the listing. The published Algorithm 1 writes the greedy line as max_a Q*(φ(s),a;θ); it means argmax (a max returns a value, not a button), and the star is a slip, since the learned network is , not the unknown . We render it corrected. And the marked same live theta is the 2013 signature discussed above: targets come from the very weights being updated, no frozen copy.
That is the entire agent: one small network, one buffer, one regression loop, and a single integer of per-game tuning. All that remains is whether it plays.
What it learned
Before the scoreboard, the paper shows something better: evidence the network learned values in the Bellman sense rather than mere reflexes. They run a trained Seaquest agent and plot its predicted value over a 30-frame stretch. The reward in that stretch is a single spike, the moment a torpedo connects. The prediction is not a spike. It jumps the moment an enemy appears on screen, climbs while the torpedo is in the water, peaks just before impact, and falls back to baseline once the enemy is gone. That curve is equation (1) made visible: the +1 has been paid forward, through , into every state on the path that leads to it. Anticipation is what a value function is.
That same predicted-Q quantity solved a practical problem: how do you even monitor training? Episode scores are so noisy (a small weight change shifts which states the policy visits, which swings the score wildly) that progress plots look like static. The paper's fix: collect a fixed set of states once, with a random policy, before training, and track the average max predicted Q on that held-out set. That curve rises smoothly on every game, and, the claim that mattered most against the divergence history: they report no divergence in any experiment. The paper keeps the register honest: no theorem, no guarantee, just an empirical "the needle stayed on the road." An epoch on their 2013 hardware was 50,000 minibatch updates, about 30 minutes; the paper never totals the run, but at that rate the training curves it plots work out to roughly a day or two of GPU time per game.
The scoreboard, with its evaluation rules stated first since they decide what counts: the learning methods (Sarsa, Contingency, DQN) are scored by average total reward under an ε-greedy policy with ε = 0.05 (so the agent must survive its own occasional random twitch), and "Human" is the median score of an expert after about two hours of practice per game (the paper notes its human numbers run much higher than the ones in the benchmark's own paper). The arena is the Arcade Learning Environment (ALE), the shared harness that gives every method exactly the same screen, score, and joystick. The baselines are not strawmen, and that's the point: Sarsa and Contingency are the best published methods on ALE, both armed with hand-engineered visual features, including tricks like treating each of the 128 colors as an object channel. DQN sees raw pixels and beats both, on all seven games, by large margins.
The abstract says it "outperforms all previous approaches on six of the games," and the missing seventh is worth understanding precisely. The one method DQN doesn't clear everywhere is HNeat, a neuroevolution approach, and the comparison is lopsided by design: HNeat's reported score is its single best episode, replayed in a then-deterministic emulator (the same fixed start state every time, so a memorized action sequence scores forever), while DQN's numbers are averages with random actions injected 5% of the time. Speedrun tape versus season average. Even so, DQN's best episodes beat HNeat's everywhere except Space Invaders (1,075 vs 1,720), which is the "six."
The wins and losses against the human line trace one boundary. DQN wins where value is short-horizon: Breakout (168 vs the human 31, five times the human median), Enduro (470 vs 368), Pong (20 vs −3, where 21 is a shutout), and it comes close on the fourth short-horizon game, Beam Rider (4,092 vs 7,456; the paper calls that close to human performance). It loses badly where strategy is long-horizon: Q*bert (1,952 vs 18,900), Seaquest (1,705 vs 28,010, where playing well means managing oxygen and surfacing, minutes-scale structure), Space Invaders (581 vs 3,690). The paper says it directly: those games "require a strategy that extends over long time scales." Remember the two blinders we installed earlier, a four-frame state that sees a quarter-second of history and a discount horizon of order steps: the failure cases are exactly the games whose structure lives beyond both.
What happened next
The 2015 Nature follow-up scaled the recipe to 49 games and added the one stabilizer this paper lacked: the frozen target network, a copy of the weights used only for computing targets and refreshed every 10,000 updates. That turns the archery range back into something like supervised learning for 10,000 steps at a time: the goalpost is still your own older guess, but at least it holds still while you aim. (The Nature version also grew the network to three conv layers and 512 hidden units, clipped the TD error, trained five times longer, and stated , the number this paper never specifies.) Its ablation, run at this paper's 10-million-frame scale on Breakout, is the cleanest verdict on what mattered: with both fixes the score is 316.8; replay alone keeps 240.7; the target network alone collapses to 10.2; neither, 3.2. The 2013 bet that a memory buffer could carry the stability burden was, in hindsight, the right bet; the target network is a refinement on top of replay, not a substitute for it.
The rest of the lineage reads like the to-do list this paper left in its own margins. Uniform sampling became prioritized experience replay (Schaul et al., the upgrade this paper itself gestures at via prioritized sweeping). The max in the target turned out to systematically overestimate, fixed by Double DQN. The head split into value and advantage streams in Dueling DQN, and Rainbow stacked the lot. The long-horizon games that embarrassed DQN here, and Montezuma's Revenge after it, drove years of work on exploration and memory. And the basic recipe (a value network, a replay buffer, off-policy updates) is still the skeleton of most value-based deep RL today.
Step back and the paper's shape is simple. Q-learning gives a way to learn what actions are worth from delayed scores. Deep networks give a way to read a screen. Each had a known, fatal incompatibility with the other: the network needs decorrelated data with stable targets; the learning rule generates a correlated stream and moves its own targets. Experience replay sits exactly in the crack: it decorrelates the stream, averages away the feedback loop, and is only legal because Q-learning never cared who chose the action. One buffer, three problems. The deep-learning recipe didn't need to be rebuilt for reinforcement learning; it needed a better librarian.
Questions you might still have
Why does the max in the target make Q-learning off-policy?
The target r + γ·max Q(s′,·) evaluates the best available action at s′, whether or not the agent took it. So the answer being learned ("what would greedy play earn from here?") never depends on who generated the data, and a transition collected by an old, worse policy is still a valid sample. That indifference is exactly what makes training on a replay buffer legal.
Does the Q-learning convergence proof cover DQN?
No. Watkins and Dayan’s theorem is for lookup tables, with decaying step sizes and every state-action pair visited forever; their paper warns it may fail for other representations. DQN has no convergence guarantee. The paper’s evidence is empirical: the held-out average max-Q rises smoothly and no run diverged.
Why not just imitate human play with supervised learning?
You’d need labeled demonstrations for every game, and the ceiling would be the demonstrator. The score signal is free, needs no human, and carries no ceiling: DQN beat the human median on three games. The price is everything in this article: delayed credit, correlated data, and self-made targets.
If replay fixes correlation, what fixes the moving target?
In this paper, nothing, and it worked anyway at 10M-frame scale. The 2015 Nature version froze a copy of the weights for 10,000 updates at a time to hold targets still. Its own ablation says replay was the bigger fix: replay alone scored 240.7 on Breakout, target network alone 10.2.
Footnotes & further reading
- The paper: Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, Riedmiller, Playing Atari with Deep Reinforcement Learning (DeepMind, NIPS Deep Learning Workshop 2013).
- Q-learning: Watkins, Learning from Delayed Rewards (PhD thesis, 1989); the convergence proof is Watkins & Dayan, Q-learning (Machine Learning, 1992). Tabular only, by its own statement.
- The textbook for everything in the first half: Sutton & Barto, Reinforcement Learning: An Introduction. The "deadly triad" framing is from the 2018 second edition, named five years after this paper ran all three.
- Experience replay: Lin, Reinforcement Learning for Robots Using Neural Networks (CMU thesis / technical report, 1993; the technique itself appears in his 1992 Machine Learning paper).
- The divergence results: Tsitsiklis & Van Roy, An Analysis of Temporal-Difference Learning with Function Approximation (IEEE TAC, 1997), and Baird, Residual Algorithms: Reinforcement Learning with Function Approximation (ICML 1995).
- The prehistory: Tesauro, Temporal Difference Learning and TD-Gammon (CACM, 1995), and the skeptical postmortem, Pollack & Blair, Why did TD-Gammon Work? (NIPS 1996).
- The benchmark: Bellemare, Naddaf, Veness, Bowling, The Arcade Learning Environment (JAIR, 2013). The HNeat rows come from Hausknecht et al.'s 2013 neuroevolution manuscript (journal version: IEEE TCIAIG 2014).
- The sequel: Mnih et al., Human-level control through deep reinforcement learning (Nature, 2015): target network, 49 games, and the ablation quoted above. The prioritization this paper predicted: Schaul, Quan, Antonoglou, Silver, Prioritized Experience Replay (ICLR 2016).
How could this explainer be improved? Found an error, or something unclear? I read every message.