Reinforcement learning · Deep learning

Playing Atari with Deep Reinforcement Learning

Raw pixels in, joystick out.

In 2013 a small London lab plugged a convolutional network into a learning rule from 1989, fed it nothing but the screen and the score, and it learned seven Atari games. The same architecture, the same hyperparameters, trained fresh on each game. This is the paper that made deep reinforcement learning a field.

Explaining the paperPlaying Atari with Deep Reinforcement LearningMnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, Riedmiller · DeepMind · NIPS Deep Learning Workshop 2013 · arXiv:1312.5602 ↗

A network learns to act, not just to label, with no feedback but a game score that arrives late, rarely, and already scrambled by its own behavior.

By 2013, deep learning had a winning recipe: collect a mountain of labeled examples, define a fixed target for each one, and grind the error down with stochastic gradient descent. ImageNet fell to it in 2012. Speech recognition fell the year before. The recipe worked because the problems were polite. Every input came with an answer key, the examples were independent of each other, and the thing being learned never changed the data it was learning from.

Reinforcement learning is the impolite version. An agent acts, the world responds, and the only feedback is a scalar reward that can arrive thousands of steps after the action that earned it. The data points are not independent; each frame of a game is nearly identical to the last. And the data distribution is not fixed, because the agent's current behavior decides what it sees next. Every assumption the deep-learning recipe rests on is violated.

This paper ignored the mismatch and made the marriage work anyway. A convolutional network watches the raw Atari screen, outputs a value for every joystick action, and is trained with a variant of Q-learning, a rule from 1989. The whole thing is held together by experience replay. The result beat the best published learning baselines on all seven games when scored the same way, beat every previous approach on six of the seven, and beat an expert human on three. Thirty-eight days after the arXiv posting, Google bought DeepMind. (The timing is documented; how much this particular paper had to do with the price is folklore, so I'll leave it at the timing.)

Start with the classical reinforcement-learning pieces (reward, return, the function $Q^*$ , the Bellman equation), see why bolting a neural network onto it was considered reckless, and then watch one buffer of old memories defuse the problem. Taken apart they are familiar; the danger is in running them together.

The game only tells you the score

The setup reduces to what the agent actually gets. At each time-step it sees an image $x_t$ , the raw Atari screen (210×160 pixels). It picks an action $a_t$ from the legal joystick moves for that game, between 4 and 18 of them. The emulator advances, and the agent receives a reward $r_t$ : the change in the game score on that step. That is everything. No labels, no demonstrations, no access to the emulator's internal state, no hand-built features saying "this blob is the ball." The paper is explicit that the network learns from what a human player gets: the video, the score, a game-over signal, and the set of buttons.

A policy is whatever rule maps what you've seen to what you press. The agent in this paper uses a deliberately simple one, called $\varepsilon$ -greedy: with probability $\varepsilon$ press a random button, otherwise press the button the network currently rates highest (the paper's phrasing; the random draw can also land on the greedy button, a formality that changes nothing here). The random fraction is the exploration budget. A purely greedy agent only ever tastes the dishes it already likes, so it can never discover that some unrated action is better. During training $\varepsilon$ is annealed from 1 down to 0.1 over the first million frames: start as a pure tourist, end as a professional with a 10% twitch of curiosity. The schedule never reaches zero; a residual 10% of randomness keeps the data varied for as long as training runs, and it costs nothing in what's learned, for a reason rooted in Q-learning that we'll meet shortly.

The figure below runs this loop on a toy Breakout. Drag $\varepsilon$ to both ends and hold it there. At $\varepsilon = 1$ you are watching minute one of training: random flailing, quick deaths. At $\varepsilon = 0$ the paddle plays its best (its best, not the best). The bars on the right are the agent's action-values, the quantity this paper exists to learn.

Figure 1 · the loop

ε0.10

The reinforcement-learning loop on a toy Breakout. The agent sees pixels, scores every action with illustrative Q-bars (a hand-scripted stand-in for the values DQN will learn), and acts ε-greedily (violet = a random exploration step). The only feedback flowing back is r = the change in score. Training anneals ε from 1 to 0.1: the right end of the slider is how every run begins.

Feedback about a good move can arrive thousands of time-steps after the move. In Breakout, the points for a brick arrive while the ball is still mid-flight from a paddle hit you made seconds ago; in Seaquest, surfacing for oxygen pays off half a minute later. Supervised learning never has this problem, because the label sits right next to the input. Here, the first job is bookkeeping: deciding which of the last thousand actions deserves credit for the point that just landed.

Q: the number every decision needs

The classical answer to delayed credit starts by defining what an action is worth. Future rewards are summed with a discount $\gamma$ per step:

R_t \;=\; r_t + \gamma\, r_{t+1} + \gamma^2 r_{t+2} + \cdots \;=\; \sum_{t'=t}^{T} \gamma^{\,t'-t}\, r_{t'}

$R_t$ is the return: the total discounted haul from now until the episode ends at time $T$ . The discount works like inflation on future money. A point promised $k$ steps from now is worth $\gamma^k$ points today, so with $\gamma = 0.99$ a reward 100 steps out is worth about 37 cents on the dollar. Why discount at all? Atari episodes end, so the sum is finite anyway; what $\gamma$ buys is a bounded scale (no return can exceed $r_{\max}/(1-\gamma)$ however long the episode drags on) and a planning horizon: the agent effectively cares about the next $1/(1-\gamma)$ steps and is nearly blind past them. This paper never states the $\gamma$ it used. The 0.99 everyone quotes comes from the 2015 Nature follow-up.

Now the central object. For a state $s$ and action $a$ , define

Q^*(s,a) \;=\; \max_{\pi}\; \mathbb{E}\big[\, R_t \;\big|\; s_t = s,\; a_t = a,\; \pi \,\big]

in words: if I take action $a$ now and play perfectly forever after, what return should I expect? The max is over all policies $\pi$ , so $Q^*$ doesn't depend on how anyone actually plays; it is a fact about the game. And it is the number a player needs. If you had $Q^*$ on a cheat sheet, optimal play would be mechanical: at every state, press the action with the biggest entry. The entire problem of playing Atari reduces to estimating one function.

Why value state-action pairs instead of just states? Because turning a state value $V(s)$ into a decision requires asking where each action leads, and that requires a model of the game's dynamics, the very thing this approach refuses to learn. $Q$ bakes the one-step lookahead into the function itself, so acting greedily reduces to comparing entries. The cost contrast settles it: with $Q$ , choosing greedily is one forward pass that scores every action at once, with $V$ you would have to simulate each candidate action forward just to know where it lands, one rollout per button, before you could even ask the value function its opinion. (It's also why TD-Gammon, which we'll meet later, could afford to learn $V$ : backgammon's rules are known, so it could simulate every legal move and evaluate where each one lands.)

What makes $Q^*$ learnable is that it obeys a self-consistency law, the Bellman equation:

Q^*(s,a) \;=\; \mathbb{E}_{s'\sim\mathcal{E}}\big[\, r + \gamma \max_{a'} Q^*(s',a') \;\big|\; s,a \,\big]

(1)

The equation works like a GPS planning backwards: the fastest time from your corner equals the first leg plus the fastest time from wherever that leg ends. The value of acting now is the reward you collect immediately, plus $\gamma$ times the value of the best action available in the state you land in (the expectation $\mathbb{E}_{s'\sim\mathcal{E}}$ averages over the emulator's randomness about where you land; Atari itself is nearly deterministic, a fact that comes back when a competitor gets graded on its single best run). No other function satisfies this identity, so it works as a target: if your estimate violates equation (1) somewhere, you know it's wrong there.

It also hands you an algorithm. Treating (1) as an assignment instead of an identity, and sweeping it over all states, gives an algorithm:

Q_{i+1}(s,a) \;=\; \mathbb{E}\big[\, r + \gamma \max_{a'} Q_i(s',a') \;\big|\; s,a \,\big]

This is value iteration, and it provably converges, $Q_i \to Q^*$ . Each sweep shrinks the worst-case error by a factor of $\gamma$ , like photocopying a photocopy at 99% scale, so errors decay geometrically and value information spreads one step outward per sweep from the states where reward occurs. The figure shows this, and dragging $\gamma$ to its edges makes the dependence explicit: at $\gamma = 0$ value does not propagate at all (only the squares touching the reward ever learn anything), and near $\gamma = 1$ the board lights up almost evenly, because distance has almost stopped mattering.

Figure 2 · value washes backward

sweep 0

γ0.90

Value iteration on a gridworld with one +1; the heat is V(s), each cell's best-action value. Each Bellman sweep carries value one step further back from the reward, shrunk by γ per step, flowing around the wall. The γ slider spans the full range: at 0 the board stays dark beyond the reward's neighbors; at 0.99 it saturates. Delayed reward becomes local arithmetic.

So why doesn't this solve Atari on the spot? Because value iteration needs a table with one entry per state-action pair, it visits all of them repeatedly, and each backup takes an expectation over the emulator's transition probabilities, a map nobody has. (The gridworld above cheats on both counts: tiny, and the map known.) Atari's state space is every screen the game can show (more precisely every history of screens, as we'll see), which is not a number you store a table for. The paper calls the tabular approach "totally impractical" and points at the standard escape: replace the table with a function approximator, $Q(s,a;\theta) \approx Q^*(s,a)$ , here a neural network they call a Q-network. Two screens that differ by one pixel of scenery share their value estimate, and that sharing is the entire reason this can work with only ten million frames of experience. It is also the reason everyone expected it to explode.

Making a frame into a state

Before learning $Q$ , there's an unresolved problem with $s$ itself. Everything above assumed a Markov state, meaning the state carries all the information needed to predict what happens next; given the present, the past is redundant. A single Atari frame is not that. It shows the ball's position and says nothing about its velocity. Two situations with the same frame (ball rising, ball falling) demand different actions but look identical, which the paper calls the screen being perceptually aliased. Feed a single frame to any learner and you've asked it to make distinctions its input does not contain.

The paper's formal treatment is heavy: it defines the state as the entire history of screens and actions so far. Histories are always Markov (the past can't be missing from the present if the present includes the past), and since episodes end in finitely many steps, this gives a large but finite Markov decision process, the structure that licenses everything in the previous section. In practice the implementation is light: a function $\phi$ stacks the last 4 preprocessed frames into one 84×84×4 input. Four frames are a cheap, truncated stand-in for the full history. The truncation is also a ceiling, fixed before training starts: four frames cover well under a second of play, so the state sees reactions, not plans, and anything strategic that unfolds over seconds is invisible by construction; that bound will show up plainly in which games the agent wins and which it loses. The frames reveal velocity (two frames pin down direction and speed), a bit of acceleration, and they also help with a Stone Age hardware quirk: many Atari games flicker their sprites (the console could only draw a few objects per scanline, so games alternated them across frames), and any single frame can be missing an object that very much exists. The stack catches a blinker only if it shows up in at least one of the four sampled frames; this caveat returns as the paper's one per-game tweak.

Figure 3 · one frame is not a state

frames1

Slide the stack depth from 1 to 4. With one frame the ball's velocity is unknowable (the grey fan of candidate futures) and the blinking laser isn't there half the time. With two or more, the velocity arrow is pinned, and the stack catches this laser in at least one frame (its blink is out of step with the sampling here; in the real Space Invaders it wasn't, which forced the paper's lone per-game change). DQN ships with 4.

The payoff: with $\phi$ in place, "one stacked input" behaves approximately like "one state," and the Bellman machinery applies to it. The stack is also a confession of scope. Four frames span well under a second of game time, so anything strategic that lives on longer scales (your oxygen meter has been falling for twenty seconds) is invisible to the state. This scope limit predicts which games DQN loses.

Q-learning: regression on your own guesses

Now the learning rule. We want the network $Q(s,a;\theta)$ to satisfy the Bellman equation, but we can't evaluate the right side of (1): the expectation needs the emulator's transition probabilities, and we refuse to model them (this is what model-free means: learn values from sampled experience directly, never learn physics). What we do have is samples. Play, and every step hands you a transition $(s, a, r, s')$ : one real observation of "from $s$ , doing $a$ earned $r$ and led to $s'$ ."

So turn the identity into a regression. Build a target out of the sample and the network's own estimate of the future, and pull the prediction toward it:

L_i(\theta_i) \;=\; \mathbb{E}_{s,a\sim\rho(\cdot)}\Big[\big(y_i - Q(s,a;\theta_i)\big)^2\Big]

(2)

y_i \;=\; \mathbb{E}_{s'\sim\mathcal{E}}\big[\, r + \gamma \max_{a'} Q(s',a';\theta_{i-1}) \;\big|\; s,a \,\big]

Squared error, like any regression, with two strange residents. The distribution $\rho$ , the behaviour distribution, names "whatever states and actions the agent actually encounters while playing" (textbooks say behaviour policy; same idea). And the label $y_i$ is manufactured from the network's own previous prediction at the next state. One step of measured reality, $r$ , plus $\gamma$ times a guess. This is called bootstrapping: improving guesses using other guesses, anchored to reality one reward at a time. The paper frames the oddity well: in supervised learning the targets are fixed before training starts; here the targets depend on the weights being trained. Move $\theta$ and the labels move too. (The subscripts: $i$ indexes rounds of this regression, and $\theta_{i-1}$ is last round's weights, held fixed while you fit $\theta_i$ . The notation is loose here; in the actual algorithm it collapses to the same live weights.)

Differentiate (2), treating the target as a constant, and you get the update direction:

\nabla_{\theta_i} L_i \;=\; \mathbb{E}_{s,a\sim\rho(\cdot);\,s'\sim\mathcal{E}}\Big[\big(\, r + \gamma \max_{a'} Q(s',a';\theta_{i-1}) - Q(s,a;\theta_i)\big)\, \nabla_{\theta_i} Q(s,a;\theta_i)\Big]

(3)

The parenthesis is the TD error (temporal-difference error): how far today's prediction missed the one-step-corrected target. "Treating the target as a constant" is a choice, not an accident. The target contains $\theta$ too (through the max at $s'$ ), and equation (3) deliberately does not differentiate through it; this is called a semi-gradient. You could differentiate through the target instead (Baird's "residual gradient" method does), which buys provable convergence but pays for it with slower learning and a bias on random environments, so practice settled on the semi-gradient and its risks. The consequence of the choice: equation (3) is not gradient descent on any fixed objective. It is archery at a target you yourself keep moving: aim at where the target is now, pretending it's bolted down, even though your own adjustment will shift it.

Replace the expectations with single sampled transitions and step after every move, and (3) collapses into the classic tabular rule Watkins wrote down in 1989, Q-learning:

Q(s,a) \;\leftarrow\; Q(s,a) + \alpha\big[\, r + \gamma \max_{a'} Q(s',a') - Q(s,a) \,\big]

(With one parameter per table cell, $\nabla Q$ reduces to an indicator, and the SGD step is literally this rule.) For lookup tables, Watkins and Dayan proved in 1992 that this converges to $Q^*$ with probability 1, provided every state-action pair keeps being visited and the step sizes decay at the usual stochastic-approximation rate. The fine print matters: the theorem is tabular only. Their paper says outright that it may fail for other representations. The same theorem has another clause that matters later: the experience does not have to arrive in order; in Watkins and Dayan's words, the episodes "need not form a continuous sequence." And DQN is about to use a very other representation.

The figure runs the rule on a corridor before scaling it to pixels. Six cells, +1 at the right end, and a table of Q-values learning from single transitions. Scrub slowly around the first time the agent reaches the goal: that update is the only one with real signal in it (terminal states have no future, so the target equals $y = r$ ), and every later update drags that signal one cell further left, the same Figure 2 wave rebuilt from samples instead of sweeps.

Figure 4 · the target is built from a guess

update 1/160

Tabular Q-learning on a corridor, one update at a time. The formula line assembles each target y = r + γ·max Q(s′,·) from one real reward and the table's own next-state guess. Early targets are zeros teaching zeros; the first arrival at +1 plants truth in the last cell, and repetition drags it leftward onto the dashed Q* ticks.

To see why a moving target is hard to chase, follow one fixed transition through training, the network seeing an enemy on screen and firing, and track its TD target and its prediction at three points in the run. The transition never changes; what changes is the weights underneath both numbers. Early, the weights are near-random, so the prediction $Q(s,a)$ is noise, and the target $r + \gamma \max_{a'} Q(s',a')$ is worse than noise: it is built from the network's equally random estimate at the next state, which the ongoing updates are also yanking around. So the target swings violently from one update to the next, not because reality changed but because the estimator it leans on is changing fast. The prediction moves toward a target that is itself shifting just as fast. Midway, the next-state estimate has been anchored to real rewards enough times that it has steadied somewhat, so the target still moves but in smaller jumps, and the prediction has closed most of the gap to it. Late, the network has propagated the reward back along the path many times, the next-state value it bootstraps from has nearly converged, and the target sits almost still; the prediction and the target now nearly agree, which is the Bellman identity in equation (1) being satisfied at this transition.

That arc is qualitative, but it is the reason the two stabilizers exist. The early wild swings come from two sources the design attacks separately. One is that consecutive frames are nearly identical, so a stream of them keeps shoving the weights in one direction and amplifies whatever the target is currently doing; sampling old, scattered transitions from the replay buffer instead breaks that correlation and averages the swing down. The other is that the target is rebuilt from the same weights being updated, so every step that moves the prediction also moves the target it was aiming at; the 2015 follow-up's frozen target network holds those target-side weights still for thousands of steps, so the target at least holds still while the prediction is fit toward it. Both fixes are aimed at the same thing this three-snapshot arc makes visible: making a self-referential target settle down faster and with less thrashing.

Whose behavior the target evaluates matters: $\max_{a'} Q(s',a')$ scores the best action at $s'$ , regardless of what the agent actually did next. Q-learning is a food critic traveling with an unadventurous crowd: the crowd (the ε-greedy behaviour) decides which restaurants get visited, but the review always records "the best dish on the menu here," not the fries your companions actually ordered. This makes Q-learning off-policy: it learns about the greedy policy while following a different, more exploratory one. Its on-policy sibling, Sarsa, instead plugs the action actually taken into the target, diary rather than review. The distinction sounds academic for now; it is the property that makes experience replay legal.

So Q-learning is regression where you manufacture your own labels out of your own predictions, and the labels move as you learn. With a table, the theorem holds anyway. The cracks appear when the table becomes a network.

Why everyone expected this to blow up

The idea "neural network + temporal-difference learning" was not new in 2013. It had already had its one great success, twenty years earlier: Tesauro's TD-Gammon, a backgammon program with a one-hidden-layer network that trained purely by self-play and reached the level of the best human players. (For the record, it differed from DQN in both knobs that matter here: it learned a state value $V(s)$ rather than $Q(s,a)$ , and it learned on-policy from its own games. The DQN paper's phrase "similar to Q-learning" is loose, and the paper itself footnotes the differences.)

Then came the embarrassing decade. The same approach was tried on chess, Go, and checkers, and produced nothing comparable. The field settled on a deflating explanation: backgammon itself had done the heavy lifting. The dice force exploration that a deterministic game never provides, and they smooth the value landscape. Pollack and Blair sharpened the skepticism by showing that even crude hill-climbing, no temporal-difference learning at all, got surprisingly far at backgammon. The field came away believing that TD-Gammon was a property of backgammon, not a recipe.

Theory then made the pessimism respectable. Tsitsiklis and Van Roy proved that TD methods with nonlinear function approximation can diverge, and Baird built a famous small counterexample (a six-state star) where off-policy updates with a linear approximator drive every weight to infinity, with all rewards zero and the true answer perfectly representable. Nothing about the task is hard in these examples. The divergence comes from the learning dynamics themselves, and you can hear the mechanism in one analogy: audio feedback. The network's estimate at one state is the speaker; through bootstrapped targets it feeds the microphone, the network's estimates at other states, because with shared weights an update here moves predictions everywhere. If the update distribution keeps re-amplifying the same loop, a small overestimate becomes a bigger target becomes a bigger overestimate, and the estimate diverges.

The modern name for the risky cocktail is the deadly triad: function approximation, bootstrapping, and off-policy data. (The term is Sutton and Barto's, from 2018; this paper predates its own diagnosis.) DQN runs all three at once, with a 680,000-parameter network. And there is a fourth ingredient the triad doesn't even count, the one the paper worries about most: the agent makes its own dataset. Suppose the current weights say "move left." The agent moves left, so the next thousand frames all come from the left side of the screen, so the next thousand updates are about left-side states, so whatever the network believes about the left gets reinforced and whatever it believed about the right decays. Training data generated by the thing being trained is a feedback loop, not the fixed dataset the learning rule assumes. The paper's citation for where such loops end is Tsitsiklis and Van Roy: oscillation, a poor local rut, or divergence.

Figure 5 · the deadly triad, three toggles

no learner active

Schematic illustration, not a measured experiment. A one-dimensional toy error

e_{t+1} = a\, e_t + \text{noise}

stands in for a value estimate; each ingredient nudges the gain

a

closer to 1. Zero on: nothing updates, the line stays flat. One or two on:

a < 1

, the error converges (noisier with two). All three on:

a > 1

, the error runs away. The label "deadly triad" is from Sutton and Barto's 2018 textbook; this figure teaches the conceptual risk, not a measured DQN run. Real DQN training diverges much less often than this toy suggests, because replay (this paper) and the frozen target network (the 2015 sequel) break exactly the feedback loop the all-three case caricatures.

This is why serious reinforcement learning had retreated to linear function approximators with convergence guarantees, and why the existing convergent alternatives (gradient-TD methods, which do true gradient descent on a well-defined objective, the projected Bellman error) had only been shown for evaluation or linear control, never at deep-network scale on a hard domain. The closest prior attempt was Riedmiller's neural fitted Q-learning (NFQ), which trains a neural network on exactly the loss in (2), but as a batch method that refits on the accumulated dataset each iteration, at a cost that grows with it; its visual variant first compressed the screen with a deep autoencoder rather than learning from pixels end to end. DQN's claim to "first" is precise: end to end from raw pixels, with cheap constant-cost SGD updates. (Riedmiller is an author here; the lineage is literal.) The 2013 move was to run every warning light at once and add a damper nobody considered sufficient: a buffer of old memories.

Experience replay: train on your past, not your present

The mechanism is almost insultingly simple. At every step, store the transition $e_t = (s_t, a_t, r_t, s_{t+1})$ in a fixed-size replay memory $\mathcal{D}$ (the paper keeps the most recent million frames). To learn, do not use the transition that just happened. Instead sample a minibatch of 32 transitions uniformly at random from the buffer and regress on those. The freshest experience just joins the library like everything else.

The paper gives three reasons, and each one strikes at a different failure mode: data efficiency, correlation, and the self-made feedback loop. First, data efficiency: each transition can be replayed in many updates instead of being used once and discarded, and at ten million frames you want every drop. Second, correlation: consecutive frames are nearly identical, so consecutive-sample minibatches contain one effective data point and 31 echoes; sampling at random breaks those correlations and, in the paper's phrasing, reduces the variance of the updates (32 spread-out samples average where 32 clones don't). Third, and this is the one aimed at the feedback loop: the training distribution stops being "whatever the current policy is doing this second" and becomes an average over the last million steps of behavior. The left-moving agent still adds left-side frames to the buffer, but it trains on a mixture of every phase it has been through, so the feedback loop no longer reinforces itself hard enough to diverge. Old flashcards from across the semester, shuffled, instead of re-reading the sentence you just wrote and then burning it.

In the figure, the band is your past and the trays are minibatches. The buffer-size slider runs the full range down to its degenerate edge: at $N = 32$ the buffer holds only "just now" and replay collapses into the online learning it was built to replace. That degenerate edge restates the claim in reverse: everything replay buys lives in the gap between the two trays.

Figure 6 · what the minibatch sees

buffer N1,000,000

The band is the agent's past million frames, the unit the paper counts its memory in; each tick is one transition (teal = just now, through violet, amber = ancient; in training, color is also "what the policy was doing then"). Online learning trains on one contiguous teal sliver. Replay draws 32 transitions spanning the full window, resampled every couple of seconds. Shrink N to 32 and the difference vanishes, along with the benefit.

Replay has a price, and it explains why Q-learning, not Sarsa, had to be the learning rule. A transition sampled from the buffer was generated by an old version of the agent, sometimes a much worse one. A method that learns "the value of how I currently behave" (on-policy, like Sarsa) would be poisoned by this stale data; it would keep computing the value of a policy that no longer exists. The paper states the requirement plainly: learning by replay forces you to learn off-policy, "which motivates the choice of Q-learning." Q-learning never asks who chose the action. A stored $(s,a,r,s')$ is a perfectly valid sample of "what happens if you do $a$ in $s$ " whether it was chosen by genius or by the $\varepsilon$ dice, and the target's max asks the same question of every sample: what's the best you could do from where it landed? The critic doesn't care that the crowd picked the restaurant two months ago.

Credit where due: replay is not a DeepMind invention. Long-Ji Lin introduced it in 1992, two decades earlier, for robot learning with neural networks; the paper cites his thesis. Lin was actually more cautious, recommending you replay only experiences consistent with the current policy. DQN drops the caution, leans on off-policy Q-learning to make stale data safe, and scales the idea up. The paper also names its own limitation: uniform sampling treats a pivotal, surprising transition and the ten-thousandth frame of nothing happening as equally worth study, and it points at prioritized sweeping as the obvious upgrade. Two years later, the same lab built exactly that (prioritized experience replay).

The famous frozen target network, a second copy of the weights used to compute targets and refreshed only every 10,000 updates, is not in this paper. That is the 2015 Nature paper's contribution. Here, Algorithm 1 bootstraps from the same live $\theta$ it is updating (the $\theta_{i-1}$ in equation (2) is a fitted-iteration formality that degenerates to the live weights in the actual algorithm). In 2013, experience replay carries the stability burden alone.

The network that reads the screen

The perception machinery is 2013-vintage and pleasingly small. Raw frames are 210×160 with a 128-color palette. Preprocessing $\phi$ converts to grayscale (color triples the input volume and these games barely need it; the paper's stated goal for this step is shrinking the input), downsamples to 110×84, and crops an 84×84 square around the playing area. The square crop has a wonderfully unprincipled reason: the team reused the GPU convolution code from Krizhevsky's AlexNet work, and it required square inputs. The last four such frames are stacked, giving the 84×84×4 state from Figure 3.

Then a convolutional network: 16 filters of size 8×8 with stride 4 (ReLU), 32 filters of 4×4 with stride 2 (ReLU), a fully-connected layer of 256 units (ReLU), and a final linear layer. Do the arithmetic and the network is tiny by any modern standard: 4,112 weights in the first conv layer, 8,224 in the second, 663,808 in the fully-connected layer (97.5% of the total), and a head of at most 4,626, for roughly 680,000 parameters in all. This is the network the paper christens the Deep Q-Network: DQN. The agent that beat humans at Pong fits in under three megabytes of float32, smaller than a single attention layer of a small modern language model.

The output head carries the design choice that later work reused. The obvious design, used by prior work, takes state and action as input and outputs one scalar, which means evaluating 18 actions costs 18 forward passes. DQN instead takes only the state and outputs all action values at once, one output unit per legal action: a menu with every price printed, rather than asking the waiter one dish at a time. The max over actions, which the target in (3) and the greedy policy both need constantly, becomes a single forward pass plus an argmax over a vector. Across games, only this head changes width (4 to 18 outputs); every layer upstream is identical.

Figure 7 · pixels to Q-values

actions K18 (full joystick)

The full pipeline: color screen → grayscale 110×84 → square 84×84 crop → stack of 4 → conv1 (16 filters → 20×20×16) → conv2 (32 filters → 9×9×32) → 256 fully connected → one Q-value per action, the max highlighted. Weight counts sit above the learned layers; the head's count tracks K. Drag K from 4 to 18: only the head's width changes across games; every layer upstream has the same shape.

The training recipe, identical for every game: RMSProp (divide each weight's step by a running average of its own recent gradient magnitudes, so no parameter's update dwarfs another's; at the time citable only as a slide from Hinton's Coursera course, and the paper states no learning rate either, the 0.00025 in common implementations being a 2015 number) with minibatches of 32; $\varepsilon$ annealed 1 → 0.1 over the first million frames, then held at 0.1; ten million frames of training per game; the million-frame replay memory. And to be precise about the headline claim: one architecture and one hyperparameter sheet, but a fresh network is trained from scratch for each game; no weights are shared across the seven. Two more tricks do real work. Rewards are clipped to +1, 0, or −1 during training, because raw score scales differ wildly across games (a Breakout brick is 1 to 7 points, a Q*bert ledge is 25) and clipping lets one learning rate serve all seven; the admitted cost is that the agent can't tell a 100-point opportunity from a 1-point one. The mechanism behind "serve all seven": TD targets inherit the reward scale, so without clipping the targets, and with them the gradients, would differ across games as much as the raw scores do, and a step size gentle enough for Breakout's small targets would detonate on Q*bert's large ones; clipping puts every game's error signal on one scale, and one learning rate fits it. And the agent acts only every 4th frame, repeating its last action on skipped frames, which buys roughly 4× more episodes of experience per unit of compute since the emulator is much cheaper than the network. (Frame skipping is unrelated to frame stacking; one is about acting cheaply at 15Hz, the other about making the input Markov.) The lone exception to "identical everywhere": Space Invaders used skip 3 instead of 4, because at skip 4 the blinking lasers landed exactly on the skipped frames and became invisible (a skipped frame never enters the stack, so stacking could not save them; the sampling rate itself had to change). The entire field of per-game tuning, in this paper, is that one integer.

Assembled, the algorithm fits in a screenful:

# deep Q-learning with experience replay (Algorithm 1)
D = ReplayMemory(N)                       # holds (phi, a, r, phi2) tuples
theta = init_random()
for episode in range(M):
    phi = preprocess(frames)              # the 84x84x4 stack
    for t in range(T):
        a = random_action() if rand() < eps \
            else argmax(Q(phi, theta))    # one forward pass, all actions
        r, frame = emulator.step(a)       # r = clipped score change
        frames.append(frame); phi2 = preprocess(frames)
        D.store((phi, a, r, phi2))
        # learn from the past, not from this moment:
        batch = D.sample(32)              # uniform random minibatch
        ys = [r_j if terminal(p2)
              else r_j + gamma * max(Q(p2, theta))  # same live theta,
              for (p, a_j, r_j, p2) in batch]       # no frozen copy
        preds = [Q(p, theta)[a_j] for (p, a_j, r_j, p2) in batch]
        sgd_step(grad(mean((ys - preds) ** 2),      # one step per batch
                 wrt=theta, treating_ys_as_constant=True))
        phi = phi2

Two notes on the listing. The published Algorithm 1 writes the greedy line as max_a Q*(φ(s),a;θ); it means argmax (a max returns a value, not a button), and the star is a slip, since the learned network is $Q$ , not the unknown $Q^*$ . We render it corrected. And the marked same live theta is the 2013 signature discussed above: targets come from the very weights being updated, no frozen copy.

That is the entire agent: one small network, one buffer, one regression loop, and a single integer of per-game tuning. What follows is whether it plays.

What it learned

Before the scoreboard, the paper shows something better: evidence the network learned values in the Bellman sense rather than mere reflexes. They run a trained Seaquest agent and plot its predicted value over a 30-frame stretch. The reward in that stretch is a single spike, the moment a torpedo connects. The prediction is not a spike. It jumps the moment an enemy appears on screen, climbs while the torpedo is in the water, peaks just before impact, and falls back to baseline once the enemy is gone. That curve is equation (1) made visible: the +1 has propagated, through $\gamma$ , into every state on the path that leads to it.

Figure 8 · value is anticipation

frame 18/30

A Seaquest-like episode, scrubbable (after the paper's Figure 3). The reward is one late spike at the kill. The predicted value rises when the enemy appears (A), peaks just before the torpedo lands (B), and settles after the enemy is gone (C). The discounting spreads the future reward backward across the states that lead to it.

That same predicted-Q quantity solved a practical problem: how do you even monitor training? Episode scores are so noisy (a small weight change shifts which states the policy visits, which swings the score wildly) that progress plots look like static. The paper's fix: collect a fixed set of states once, with a random policy, before training, and track the average max predicted Q on that held-out set. That curve rises smoothly on every game, and, the claim that mattered most against the divergence history: they report no divergence in any experiment. The paper claims no more than it shows: no theorem, no guarantee, just an empirical "the needle stayed on the road." An epoch on their 2013 hardware was 50,000 minibatch updates, about 30 minutes; the paper never totals the run, but at that rate the training curves it plots work out to roughly a day or two of GPU time per game.

The scoreboard, with its evaluation rules stated first since they decide what counts: the learning methods (Sarsa, Contingency, DQN) are scored by average total reward under an ε-greedy policy with ε = 0.05 (so the agent must survive its own occasional random twitch), and "Human" is the median score of an expert after about two hours of practice per game (the paper notes its human numbers run much higher than the ones in the benchmark's own paper). The arena is the Arcade Learning Environment (ALE), the shared harness that gives every method the same screen, score, and joystick. The baselines are not strawmen: Sarsa and Contingency are the best published methods on ALE, both armed with hand-engineered visual features, including tricks like treating each of the 128 colors as an object channel. DQN sees raw pixels and beats both, on all seven games, by large margins.

Figure 9 · Table 1, explorable

gameBreakout

Average scores per game, verbatim from the paper: Random, Sarsa and Contingency (hand-built features), DQN, and the human median. DQN beats every learned baseline on all seven games, beats the human on Breakout, Enduro, and Pong, and loses by an order of magnitude on Q*bert and Seaquest. Pong is a margin score, so its chart goes negative.

The abstract says it "outperforms all previous approaches on six of the games," and the missing seventh is the one to get right. The lone method DQN doesn't clear everywhere is HNeat, a neuroevolution approach, and the comparison is lopsided by design: HNeat's reported score is its single best episode, replayed in a then-deterministic emulator (the same fixed start state every time, so a memorized action sequence scores forever), while DQN's numbers are averages with random actions injected 5% of the time. Speedrun tape versus season average. Even so, DQN's best episodes beat HNeat's everywhere except Space Invaders (1,075 vs 1,720), which is the "six."

The wins and losses against the human line trace one boundary. DQN wins on games where value is short-horizon: Breakout (168 vs the human 31, five times the human median), Enduro (470 vs 368), Pong (20 vs −3, where 21 is a shutout), and it comes close on the fourth short-horizon game, Beam Rider (4,092 vs 7,456; the paper calls that close to human performance). It loses badly where strategy is long-horizon: Q*bert (1,952 vs 18,900), Seaquest (1,705 vs 28,010, where playing well means managing oxygen and surfacing, minutes-scale structure), Space Invaders (581 vs 3,690). The paper says it directly: those games "require a strategy that extends over long time scales." Two blinders constrain the agent, a four-frame state that sees a quarter-second of history and a discount horizon of order $1/(1-\gamma)$ steps: the failure cases are precisely the games whose structure lives beyond both.

What happened next

The 2015 Nature follow-up scaled the recipe to 49 games and added the one stabilizer this paper lacked: the frozen target network, a copy $\theta^-$ of the weights used only for computing targets and refreshed every 10,000 updates. That turns the archery range back into something like supervised learning for 10,000 steps at a time: the goalpost is still your own older guess, but at least it holds still while you aim. (The Nature version also grew the network to three conv layers and 512 hidden units, clipped the TD error, trained five times longer, and stated $\gamma = 0.99$ , the number this paper never specifies.) Its ablation, run at this paper's 10-million-frame scale on Breakout, isolates what mattered: with both fixes the score is 316.8; replay alone keeps 240.7; the target network alone collapses to 10.2; neither, 3.2. The 2013 bet that a memory buffer could carry the stability burden paid off, in hindsight; the target network is a refinement on top of replay, not a substitute for it.

The rest of the lineage reads like the to-do list this paper left in its own margins. Uniform sampling became prioritized experience replay (Schaul et al., the upgrade this paper itself gestures at via prioritized sweeping). The max in the target systematically overestimates, which Double DQN later fixed. The head split into value and advantage streams in Dueling DQN, and Rainbow stacked the lot. The long-horizon games that embarrassed DQN here, and Montezuma's Revenge after it, drove years of work on exploration and memory. And the basic recipe (a value network, a replay buffer, off-policy updates) is still the skeleton of most value-based deep RL today.

The shape of the paper is simple once you have the pieces. Q-learning gives a way to learn what actions are worth from delayed scores. Deep networks give a way to read a screen. Each had a known, fatal incompatibility with the other: the network needs decorrelated data with stable targets; the learning rule generates a correlated stream and moves its own targets. Experience replay sits right in the crack: it decorrelates the stream, averages away the feedback loop, and is only legal because Q-learning never cared who chose the action. The deep-learning recipe didn't need to be rebuilt for reinforcement learning; it needed a better librarian.

Provenance Verified against primary literature

Watkins 1989 / Watkins & Dayan 1992Q-learning and its tabular convergence proof, with its fine print.

Sutton & BartoMDPs, returns, Bellman optimality, value iteration, on/off-policy.

Lin 1992 / 1993Experience replay, two decades before DQN.

Tsitsiklis & Van Roy 1997 / Baird 1995Why TD with function approximation can diverge.

Mnih et al., Nature 2015The target network, 49 games, and the replay-vs-target ablation.

correctionThe frozen target network is NOT in this paper: Algorithm 1 bootstraps from the same live weights (the 2015 Nature paper added θ⁻). We also silently correct Algorithm 1's printed greedy step, max_a Q*(φ,a;θ): argmax is meant, and the learned Q carries no star. The paper never states its γ; 0.99 is the 2015 value.

Questions you might still have

Why does the max in the target make Q-learning off-policy?
The target r + γ·max Q(s′,·) evaluates the best available action at s′, whether or not the agent took it. So the answer being learned ("what would greedy play earn from here?") never depends on who generated the data, and a transition collected by an old, worse policy is still a valid sample. That indifference is exactly what makes training on a replay buffer legal.

Does the Q-learning convergence proof cover DQN?
No. Watkins and Dayan’s theorem is for lookup tables, with decaying step sizes and every state-action pair visited forever; their paper warns it may fail for other representations. DQN has no convergence guarantee. The paper’s evidence is empirical: the held-out average max-Q rises smoothly and no run diverged.

Why not just imitate human play with supervised learning?
You’d need labeled demonstrations for every game, and the ceiling would be the demonstrator. The score signal is free, needs no human, and carries no ceiling: DQN beat the human median on three games. The price is everything in this article: delayed credit, correlated data, and self-made targets.

If replay fixes correlation, what fixes the moving target?
In this paper, nothing, and it worked anyway at 10M-frame scale. The 2015 Nature version froze a copy of the weights for 10,000 updates at a time to hold targets still. Its own ablation says replay was the bigger fix: replay alone scored 240.7 on Breakout, target network alone 10.2.

Footnotes & further reading

The paper: Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra, Riedmiller, Playing Atari with Deep Reinforcement Learning (DeepMind, NIPS Deep Learning Workshop 2013).
Q-learning: Watkins, Learning from Delayed Rewards (PhD thesis, 1989); the convergence proof is Watkins & Dayan, Q-learning (Machine Learning, 1992). Tabular only, by its own statement.
The textbook for everything in the first half: Sutton & Barto, Reinforcement Learning: An Introduction. The "deadly triad" framing is from the 2018 second edition, named five years after this paper ran all three.
Experience replay: Lin, Reinforcement Learning for Robots Using Neural Networks (CMU thesis / technical report, 1993; the technique itself appears in his 1992 Machine Learning paper).
The divergence results: Tsitsiklis & Van Roy, An Analysis of Temporal-Difference Learning with Function Approximation (IEEE TAC, 1997), and Baird, Residual Algorithms: Reinforcement Learning with Function Approximation (ICML 1995).
The prehistory: Tesauro, Temporal Difference Learning and TD-Gammon (CACM, 1995), and the skeptical postmortem, Pollack & Blair, Why did TD-Gammon Work? (NIPS 1996).
The benchmark: Bellemare, Naddaf, Veness, Bowling, The Arcade Learning Environment (JAIR, 2013). The HNeat rows come from Hausknecht et al.'s 2013 neuroevolution manuscript (journal version: IEEE TCIAIG 2014).
The sequel: Mnih et al., Human-level control through deep reinforcement learning (Nature, 2015): target network, 49 games, and the ablation quoted above. The prioritization this paper predicted: Schaul, Quan, Antonoglou, Silver, Prioritized Experience Replay (ICLR 2016).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.