VerifiedarXiv:1712.0181524 min
Reinforcement learning · Search

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

One algorithm, given only the rules, taught itself chess, shogi, and Go.

Starting from random play and the rules alone, one network learns by playing itself, with a tree search that turns each of its guesses into a stronger move and then teaches that move back to the network.

Explaining the paperMastering Chess and Shogi by Self-Play with a General Reinforcement Learning AlgorithmSilver, Hubert, Schrittwieser, et al. · DeepMind · 2017 · arXiv:1712.01815

In 1997 it took purpose-built hardware and decades of encoded human chess knowledge to beat the world champion. Twenty years later one program, handed nothing but the rules, taught itself chess, shogi, and Go in a single day and beat the strongest engine in each.

For seventy years the way to make a computer play chess was the same. You write down what human grandmasters know, a few hundred hand-tuned numbers for material, king safety, pawn structure, mobility, then you search: explore the tree of possible moves as deep as the clock allows, scoring the leaf positions with those numbers and assuming both sides play their best reply. Deep Blue beat Garry Kasparov this way in 1997. The strongest engine of 2016, Stockfish, is the same idea refined for two more decades: a fast alpha-beta search over a handcrafted evaluation, plus opening books, endgame tables, and a thick stack of domain-specific heuristics. It is enormously strong and almost entirely human-designed. None of it transfers to another game.

AlphaZero throws all of it out. There is no handcrafted evaluation, no opening book, no search heuristics, no human games to learn from. There is one neural network and one general-purpose tree search, and the only thing the system is told is the rules: which moves are legal, and who has won. From random initial weights it plays itself millions of times, and out of those games it learns to play chess, shogi, and Go better than the best programs ever written for any of them. The word the authors use is tabula rasa, a blank slate. The same algorithm, unchanged, does all three games. This is the direct ancestor of MuZero, which a couple of years later dropped even the requirement of knowing the rules.

A handful of ideas carry the method, and none of them is hard on its own: a network that turns a position into a hunch about which move to play and a guess at who is winning; a tree search that spends a few hundred simulations turning that hunch into a demonstrably better move; a self-play loop in which the search keeps handing the network better targets to copy; a board-and-move encoding plain enough to give any game to the same network; and one argument for why a search that averages beats the classical search that takes a minimax, once the thing doing the evaluating is a noisy neural network. Take them in order.

The self-play loop: search teaches the network

Everything rests on one network, written (p,v)=fθ(s)(\mathbf{p}, v) = f_\theta(s). Hand it a board position ss and it returns two things at once: a vector of move probabilities p\mathbf{p}, where pa=Pr(as)p_a = \Pr(a \mid s) is its instinct for how likely each move aa is to be the right one, and a single number vv, its estimate of the eventual outcome from this position, vE[zs]v \approx \mathbb{E}[z \mid s]. The outcome zz is +1+1 for a win, 00 for a draw, 1-1 for a loss. So p\mathbf{p} is a fast opinion about moves and vv is a fast opinion about who is winning, both produced in a single pass of the network, with no looking ahead at all.

On its own that fast opinion is mediocre, the way a strong player's first glance at a position is good but beatable. So before committing to a move, AlphaZero searches: it runs 800 simulated games from the current position, a Monte-Carlo tree search guided by the network, and returns a new distribution over moves, written π\boldsymbol{\pi}, built from how often the search visited each move. The mechanics come a couple of sections below. Only one fact matters here, and the paper turns on it: the searched policy π\boldsymbol{\pi} is stronger than the network's raw policy p\mathbf{p}. It has to be: the search took p\mathbf{p} as a starting hunch and then spent compute checking the consequences, so it picks moves the one-glance network would have missed. In the language AlphaGo Zero introduced, the search is a policy-improvement operator: it reliably converts a so-so policy into a better one.

That single fact closes a loop. Play a full game choosing moves by the improved policy π\boldsymbol{\pi}, and you get two kinds of training signal for free. Every position visited comes with the searched policy π\boldsymbol{\pi} that was computed there, a better target than the network's own p\mathbf{p}, and at the end of the game you learn the true outcome zz, a target for the value. Train the network to move p\mathbf{p} toward π\boldsymbol{\pi} and vv toward zz, and the next network is a little stronger. A stronger network makes the next search start from a better hunch, which makes the search's output better still, which makes the next training target better. Round and round, from random weights up to superhuman, with the game's own rules as the only ground truth.

This is reinforcement learning as policy iteration, the classic alternation of evaluating a policy and then improving it, with the network playing the role of the fast, learned policy and the search playing the role of the slow, deliberate improvement on top of it. The same fast-versus-slow split is the heart of the contemporaneous Expert Iteration paper, where the search is the deliberate "expert" and the network is the fast "apprentice" trained to imitate it. Watch the four stages circulate, and watch the strength bar fill as each lap of the loop hands the next one a better starting point:

Figure 1 · the self-play loop
net
The cycle that bootstraps a player from random weights. The network turns a position into a prior p and value v; search sharpens p into a stronger policy π; self-play follows π to an outcome z; training pushes p toward π and v toward z. Each lap makes the next search better, so the strength bar climbs.

One subtlety matters here, because it is the reason this works without a single human game. The search is better than the network at fixed weights, before any training happens. The improvement is bought with compute, not with data: looking ahead 800 times is more informative than glancing once. Training does not create the improvement; it only banks it, copying the search's better judgment into the fast network so the next search starts from higher ground. In code the loop is short:

# one self-play game of AlphaZero, then a learning update
s = start_position()
history = []
while not s.terminal():
    pi = mcts(s, net, sims=800)     # 800 PUCT simulations -> visit policy
    history.append((s, pi))         # store the SEARCH policy as the target
    a = sample(pi)                  # play in proportion to visit counts
    s = s.apply(a)
z = s.outcome()                     # +1 win / 0 draw / -1 loss
for (s, pi) in history:             # z taken from that player's view
    grad_step(net, s, pi, z)        # min (z-v)^2 - pi . log p + c||theta||^2

The outcome zz at the end is the same number for every position in the game, taken from the perspective of whichever player was to move there. There is no separate reward along the way, no human label, no bootstrapped estimate standing in for the truth: just the result of a game the system played against itself under the real rules, propagated back to every position it passed through.

One network, two heads, one loss

The network is a deep residual network, a tall stack of convolutional residual blocks reading the board as a grid, with two small "heads" on top of a shared body: a policy head that outputs p\mathbf{p} and a value head that outputs vv. The architecture is inherited wholesale from AlphaGo Zero, which the paper defers to for every detail it does not restate. One design choice in it is worth calling out, because the earlier AlphaGo did the opposite: the policy and the value share one body rather than living in two separate networks. Folding them together means a single forward pass produces both outputs, and, more usefully, the two objectives shape one representation. The features that help judge who is winning and the features that help rank moves reinforce each other, and the combined network plays stronger than either separate one did.

Training is ordinary gradient descent on a single loss. Given a position ss, its searched policy π\boldsymbol{\pi}, and the game's outcome zz, minimize

l=(zv)2    πlogp  +  cθ2l = (z - v)^2 \;-\; \boldsymbol{\pi}^\top \log \mathbf{p} \;+\; c\,\lVert\theta\rVert^2
(1)

Three terms, each doing one job. The first, (zv)2(z - v)^2, is a squared error that pulls the value vv toward the actual result zz; it is the part that learns to read who is winning. The second, πlogp-\boldsymbol{\pi}^\top \log \mathbf{p}, is a cross-entropy between the search policy and the network policy. Because π\boldsymbol{\pi} is a probability distribution (the visit counts, normalized), this term bottoms out exactly when p=π\mathbf{p} = \boldsymbol{\pi} (any mismatch only adds to it), so minimizing it drags the network's instinct toward the search's improved policy. That leading minus sign turns it into a proper cross-entropy rather than its negative; the expression is a loss to drive down, and the paper's prose phrase "maximize the similarity of p\mathbf{p} to π\boldsymbol{\pi}" is the same objective stated with the sign flipped, not a second one. The third term, cθ2c\,\lVert\theta\rVert^2, is plain L2 weight decay, with cc setting how hard it pushes the weights toward zero.

Two questions a careful reader asks here, both with one-sentence answers. Why train p\mathbf{p} toward the whole distribution π\boldsymbol{\pi} instead of toward the single move the search ended up playing? Because π\boldsymbol{\pi} is the entire improved policy, and the full distribution of how the search spread its attention carries far more signal than one sampled move. And why use the final game outcome zz as the value target rather than some cleverer running estimate? Because zz is the honest result of the searched policy playing itself, an unbiased sample of what actually happens, with nothing to tune. (Its successor MuZero later replaces zz with the search's own value estimate, which lowers variance; AlphaZero keeps the simpler, unbiased target.)

The configuration around this loss is deliberately spare. One learning rate schedule for all three games, starting at 0.20.2 and dropped three times to 0.020.02, 0.0020.002, then 0.00020.0002; mini-batches of 4,0964{,}096 positions; 700,000700{,}000 gradient steps in total. The compute that fed it was not spare: 5,0005{,}000 first-generation TPUs generating self-play games and 6464 second-generation TPUs running the training. Nearly all the compute goes into generating games, since each move costs 800 network evaluations while a gradient step costs one batch. Hold onto one fact before any chess is played: every position the network ever trains on was labeled by the network's own search, which was guided by an earlier version of the same network. No teacher outside the system ever corrects it.

Now the part deferred above: how 800 simulations turn the network's hunch into a better move. The search itself is not a contribution of this paper. AlphaZero takes it unchanged from AlphaGo Zero, which is where the formulas below live; the only search settings this paper touches are the simulation count and a single exploration knob. It still pays to see how it works, because the "policy improvement" claim rests on it.

The search grows a tree of positions, rooted at the position you actually have to move from. Each edge of the tree, meaning a move aa from a position ss, keeps four running numbers: a visit count N(s,a)N(s,a), the total value W(s,a)W(s,a) backed up through it, the mean value Q(s,a)=W(s,a)/N(s,a)Q(s,a) = W(s,a)/N(s,a), and the prior P(s,a)P(s,a) the network assigned that move when the position was first seen. A single simulation does three things. It selects a path down the tree, at each step taking the move that maximizes

Q(s,a)  +  U(s,a),U(s,a)=cpuctP(s,a)bN(s,b)1+N(s,a)Q(s,a) \;+\; U(s,a), \qquad U(s,a) = c_{\text{puct}}\, P(s,a)\, \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)}
(2)

until it reaches a position the tree has not expanded yet. There it expands: the network evaluates that leaf once, producing a value vv and priors p\mathbf{p} for the new edges. And then it backs up, adding that one value vv to every edge on the path it took and bumping each visit count. No random playout to the end of the game, the way the original AlphaGo and classical Monte-Carlo search did; the leaf is scored by the value head and that estimate propagates. "Monte-Carlo" here means averaging values over a tree, not playing out random games.

The selection rule (2) does the real work, so read it slowly. The first term, Q(s,a)Q(s,a), is exploitation: prefer the move whose explored consequences have looked best. The second term, U(s,a)U(s,a), is exploration, and it has a specific shape. It is large when the network's prior P(s,a)P(s,a) is high, so the search starts by trusting the network's instinct, and it shrinks as N(s,a)N(s,a) grows, so a move that has already been visited a lot stops getting the bonus and the search moves on. (This is the PUCT rule, after Rosin's 2011 bandit work. Note there is no logarithm in it, unlike the textbook UCB rule; the numerator is the square root of the total visits to the parent, and the +1+1 in the denominator keeps an unvisited move's bonus finite.) Early simulations follow the prior; later ones follow the values; and where the network was wrong, enough visits let QQ override its mistaken prior.

After 800 simulations, the move AlphaZero actually plays is chosen not by which has the best value but by which was visited most. The search policy is

π(as)    N(s,a)1/τ\boldsymbol{\pi}(a \mid s) \;\propto\; N(s,a)^{1/\tau}
(3)

with a temperature τ\tau (high τ\tau keeps the choice near the raw visit shares; low τ\tau sharpens it toward the single most-visited move) that is 11 early in self-play, to keep some variety in the openings, and sent toward 00 later and during real games, which means "play the most-visited move." Using the visit count rather than the raw value as the output is deliberate: the visit count is robust, because the search only pours visits into a move after repeatedly confirming it is good, whereas a single QQ can be a lucky or unlucky one-off. And this same visit distribution is the π\boldsymbol{\pi} the network is trained to imitate.

One more knob, the only one AlphaZero tunes per game. To make sure self-play keeps exploring instead of grooving one line forever, a dash of Dirichlet noise is mixed into the prior at the root only, P0.75p+0.25ηP \leftarrow 0.75\,\mathbf{p} + 0.25\,\boldsymbol{\eta} with ηDir(α)\boldsymbol{\eta} \sim \text{Dir}(\alpha). The amount α\alpha is scaled to the game's branching factor, set inversely to the typical number of legal moves: α=0.3\alpha = 0.3 for chess (around 35 legal moves in a position), 0.150.15 for shogi, 0.030.03 for Go (closer to 250). A game with more choices spreads the same exploration over more moves, so it needs a smaller α\alpha per move.

Watch the improvement happen below. The amber ticks are the network's raw prior over six candidate moves; its favorite is move a. The teal bars are the search policy, the share of the 800 visits each move collects, with the looked-ahead value QQ printed above each. Drag the simulations up from zero and watch the visits start out following the prior, then pile onto move d, the move with the best look-ahead value, which the one-glance network had ranked second:

Figure 2 · search improves the policy
160 sims
MCTS as a policy-improvement operator. Early simulations follow the network's prior P; as visits accumulate the exploration bonus decays and the backed-up value takes over, so the search policy π sharpens onto the move with the best look-ahead, here a different move than the prior's favorite. The network is then trained to imitate this sharper π.

In pseudocode a simulation is select, expand, back up:

# one of the 800 simulations, from the root down
node, path = root, []
while node.is_expanded:                       # SELECT by PUCT
    a = argmax_a Q(node,a) + c*P(node,a)*sqrt(sum_b N(node,b))/(1+N(node,a))
    path.append((node, a)); node = node.child(a)
p, v = net(node.state)                        # EXPAND: one network eval, no rollout
node.expand(priors=p)
for (s, a) in path:                           # BACK UP the value
    N(s,a) += 1; W(s,a) += v; Q(s,a) = W(s,a) / N(s,a)

Run that 800 times, read off the visit counts, and you have both the move to play and the training target for the policy. The search has done the looking-ahead the single network cannot.

The board, and the moves, as stacks of planes

For a convolutional network to read a board, the board has to arrive as a grid of numbers. AlphaZero encodes both the position and the set of possible moves as stacks of N×NN \times N planes, one binary plane per fact, which is the natural shape for a convnet because the same small filter can slide across the whole board. This encoding, plus the bare rules, is the only game-specific knowledge the system is given.

The input is an N×N×(MT+L)N \times N \times (M T + L) stack. The MM per-step planes describe one position (one plane per piece type for each player, plus a couple for repetitions), and they are stacked over a history of the last T=8T = 8 moves so the network can see how the position arrived. On top sit LL constant planes for things that are not per-square: whose turn it is, the move count, castling rights, the no-progress counter. The arithmetic is easy to get wrong, so do it once: only the per-step block is repeated eight times, and the constants are added once. Chess comes to 14×8+7=11914 \times 8 + 7 = 119 planes, shogi to 45×8+2=36245 \times 8 + 2 = 362, Go to a slim 2×8+1=172 \times 8 + 1 = 17.

The moves are encoded the same way, as a stack of planes over the board. A move is "pick up the piece on this square and do this with it," so each of the 8×88 \times 8 squares is crossed with a set of move-type planes. Chess uses 73 of them: 56 for sliding moves (every straight-line slide, 7 distances in each of 8 compass directions, which is the shared geometry of queens, rooks, bishops, and the short steps of kings and pawns), 8 for knight jumps, and 9 for under-promotions. That makes 8×8×73=4,6728 \times 8 \times 73 = 4{,}672 possible moves, almost all of them illegal in any given position and simply masked out. Shogi needs 9×9×139=11,2599 \times 9 \times 139 = 11{,}259, the extra planes covering promotions and the piece drops that let a captured piece re-enter the board, the rule that makes shogi's tree so much wider. Toggle through the three games and hover a band of the move bar to see what each group of planes encodes:

Figure 3 · board and moves as planes
119 input · 4,672 moves
The only domain knowledge beyond the rules. The input is an N×N×(M·8+L) stack: per-step feature planes over an 8-move history plus a few constants. The policy output is an N×N×P stack, every square crossed with the move-types. Toggle the game; hover a move-group for what it means. Counts are exact.

One detail in that 73 catches people out: the 9 under-promotion planes cover promoting a pawn to a knight, bishop, or rook (three pieces) along three directions (push, capture left, capture right). Promotion to a queen, the usual case, is not a separate plane at all; it is the ordinary sliding-move plane for that pawn. And the 56 "queen moves" are not about the queen specifically; they are every straight-line slide of any length and direction, the geometry shared by every sliding piece. None of these counts is tuned. The paper notes a plain flat list of moves works almost as well; the plane layout just trains a little faster.

Why an averaging search beats minimax

A genuine puzzle sits in the results. AlphaZero searches far less than Stockfish does. In chess it looks at about 80 thousand positions per second; Stockfish looks at 70 million, nearly a thousand times more, because AlphaZero spends a whole network evaluation judging each position while Stockfish spends a cheap handcrafted formula. In shogi it is 40 thousand against the 35 million of Elmo, the strongest shogi engine. By the brute-force logic that has defined computer chess, AlphaZero should lose badly. It wins, and it wins because of a careful argument about how the two searches combine their position evaluations, the paper's central justification for the approach.

Classical engines compute a minimax: down at the leaves they score positions, then back the scores up assuming you pick your best move and your opponent picks theirs, so a node's value is the max or the min of its children. (Alpha-beta pruning, the standard speedup, does not change the answer; it skips branches that provably cannot affect it, and the exact minimax value comes back.) That is the right thing to do when the leaf scores are accurate. The trouble starts when the evaluator is a deep network, which is far more expressive than Stockfish's handcrafted linear formula but also noisier, prone to the occasional badly wrong score on a position it has not learned well. Minimax is built out of max and min, and max and min are operators that seek the extreme: a single leaf the network wildly overrates can become the maximum at its level and ride straight up to the root, flipping the engine's decision. So the worst single error, not the average, sets the verdict.

AlphaZero's tree search does the opposite. It averages the values within a subtree rather than taking their min or max, and averaging makes independent errors cancel: an overestimate on one leaf and an underestimate on another wash out, so the root estimate stays close to the truth even when individual leaves are noisy. That is why a powerful-but-noisy network pairs naturally with an averaging search and badly with minimax. The figure makes the contrast concrete. The leaf values are evaluated by a noisy network; drag the noise up and watch the minimax backup swing and even change which move it prefers, while the averaging backup barely moves:

Figure 4 · minimax amplifies a noisy evaluator
0.50
A two-ply tree with a noisy leaf evaluator. Minimax (min then max) latches onto the most extreme leaf, so its root value swings with the noise and can flip the chosen move. The average backup cancels independent errors and barely moves. Drag the noise to compare the two errors; alpha-beta returns the exact minimax, so the noise is the evaluator's, not the search's.

Two honest caveats keep this from being more than it is. The error-cancellation works cleanly only if the network's mistakes are roughly independent and unbiased; a systematic blind spot the network shares across many similar positions will survive averaging, so this is the authors' intuition for why their search tolerates a noisy evaluator, not a theorem that it always will. And the contrast is Claude Shannon's, drawn in 1950: the brute-force "Type A" search that examines everything, versus the selective "Type B" search that a human uses, looking hard at a few promising moves and ignoring the rest. AlphaZero is the Type B player, leaning on a strong evaluator to search a thousand times less but in the right places. (Stockfish's evaluator was handcrafted and linear, by the way; the neural NNUE evaluators that engines use now arrived in 2020, after this paper.)

The selectivity pays off in a way that surprised people. Give both players more time per move and AlphaZero's advantage grows, the opposite of the long-standing belief that alpha-beta's deep brute-force search must eventually overtake a slower, more selective one. Toggle the game and slide the clock:

Figure 5 · scaling with thinking time
vs Stockfish
Strength versus thinking time per move, relative to the engine at 40ms. Both improve, but AlphaZero's search gains faster than the alpha-beta engine's, so its lead widens with more time, despite searching a thousand times fewer positions. Slopes are illustrative of the paper's reported shape; the positions-per-second figures are exact.

What makes it general: subtraction

AlphaZero is barely a new algorithm. It is AlphaGo Zero with four assumptions removed, and the removals are exactly what let one method cover three very different games. It helps to see them as deletions, because the generality came from taking things out.

Draws. AlphaGo Zero estimated the probability of winning, which is fine for Go, where under the scoring used there are no draws. Chess and shogi draw constantly, and the best play in chess is widely believed to be a draw. So AlphaZero estimates the expected outcome instead, with z{1,0,+1}z \in \{-1, 0, +1\} and a draw scoring a clean 00. Notice the loss in (1) did not change at all; only the meaning of the target did. A draw is an outcome of 00 for the value to learn.

Symmetry. Go's board is the same under rotations and reflections, and AlphaGo Zero exploited that twice: it multiplied its training data eightfold with the eight symmetries, and it randomly rotated or reflected the board before each network evaluation to average out bias. Chess and shogi have no such symmetry. A mirrored chess board is a different position: pawns march one way and not the other, kingside castling is not queenside castling. So AlphaZero drops both tricks, augmenting nothing and transforming nothing.

The best-player gate. AlphaGo Zero generated its self-play games with the best network seen so far, and only promoted a new network after it beat the incumbent by a margin in a separate evaluation match. AlphaZero throws away that whole machinery and keeps a single network that is updated continuously, generating its self-play from the latest weights with no evaluation step and no promotion gate. Continuous improvement did not need the checkpoint after all.

Per-game tuning. AlphaGo Zero tuned its search hyper-parameters for Go with Bayesian optimization. AlphaZero reuses one set of hyper-parameters across all three games, with the single exception of the exploration-noise scale α\alpha from the search section, which has to track each game's branching factor. Everything else is shared.

Two clarifications so the generality is not oversold. The five things AlphaZero is still given are the plane encoding of board and moves, the rules used inside the search to generate moves and detect a finished game, the rules used to fill the special input planes, the typical legal-move count that sets α\alpha, and a maximum game length after which a game is called a draw. And "one algorithm, three games" does not mean one brain: a separate network is trained from scratch for each game. The generality is in the algorithm, not in a single set of weights.

The results, and the fine print

From random play, AlphaZero crossed each program's level fast. In chess it passed Stockfish after 4 hours of training (300 thousand steps); in shogi it passed Elmo, the reigning computer champion, after less than 2 hours (110 thousand steps); in Go it passed the version of AlphaGo that beat Lee Sedol after 8 hours (165 thousand steps). The wall-clock differs from the step count because a Go training step is much slower to generate: full training took 9 hours for chess, 12 for shogi, 34 for Go, all to the same 700 thousand steps. Toggle a game and scrub the training to watch the crossover:

Figure 6 · learning from scratch
420k · 9h
AlphaZero's strength as it trains, reproducing the qualitative shape of the paper's Figure 1. The teal curve rises from random play; the dashed line is the world-champion baseline. The marker shows where the paper reports it passed: chess at 300k steps, shogi at 110k, Go at 165k. The y-axis is relative strength, not the paper's exact Elo.

Then the head-to-head matches, 100 games each at one minute per move. Against Stockfish, AlphaZero did not lose a single game: 25 wins and 25 draws as White, 3 wins and 47 draws as Black, 28 wins, 72 draws, 0 losses overall. The split says something real: with the first move it won freely, and with Black it could not be beaten, only held to draws. Against Elmo it won 90, drew 2, and lost 8. Against the published three-day AlphaGo Zero in Go it won 60 and lost 40. So "undefeated" is true only of the Stockfish match; the shogi and Go results are dominant, but not perfect.

And then the fine print, which matters because this preprint became famous fast. The match conditions drew real criticism, mostly from the computer-chess community. Stockfish was run at a fixed one minute per move, taking away the time-management that a normal engine relies on; it was given only a 1 GB hash table for a 64-thread search, which is small; and the paper's silence on an opening book and endgame tablebases (which Stockfish normally uses) suggests it had neither. None of this means AlphaZero did not win, but it does mean the headline match was not a clean tournament. The authors clearly heard this: the peer-reviewed version a year later (Silver et al., Science, 2018) reran the chess match under tournament conditions, against a much stronger Stockfish, with proper time controls, hash, and tablebases, over a thousand games, and AlphaZero still won convincingly. The tabula-rasa result held; the original presentation of it had been generous to itself.

A small reporting note, since the figures above talk about strength. The paper measures playing strength in Elo, writing the probability that player aa beats player bb as a logistic in their rating gap:

p(a beats b)=(1+exp(celo(e(b)e(a))))1,celo=1400p(a \text{ beats } b) = \Big(1 + \exp\big(c_{\text{elo}}\,(e(b) - e(a))\big)\Big)^{-1}, \qquad c_{\text{elo}} = \tfrac{1}{400}

That looks like the familiar Elo formula but is written in base ee, not the textbook base 10, so the usual shorthand "a 400-point gap means about 10-to-1 odds" does not follow from this version of the equation. It is a reporting convention, not a result, but it is the kind of thing to read carefully rather than assume.

What lasts from this paper is not a chess rating. It is the demonstration that a single learning algorithm, given a game's rules and nothing else, can teach itself to play three of the hardest games humans invented better than any program built by hand for them, and that the engine of that learning is a search good enough to keep improving on the network it is built from. MuZero took the next step of learning the rules too; the line continues into protein folding (AlphaFold), matrix multiplication (AlphaTensor), and algorithm discovery (AlphaDev). The blank slate turned out to be a remarkably good place to start.

Provenance Verified against primary literature
AlphaGo Zero (2017)The MCTS, PUCT rule, dual-head network, and policy-iteration framing, taken unchanged.
Expert Iteration (2017)The same expert-search / apprentice-network idea, found independently (Anthony et al., on Hex).
AlphaGo (2016)The earlier program: human games, rollouts, separate policy and value networks.
Shannon (1950)Type A (brute-force) vs Type B (selective) search, the framing for the positions-per-second gap.
caveatThe preprint's headline chess match gave Stockfish a fixed one minute per move, a 1 GB hash, and (by its silence) no opening book or endgame tables, conditions later criticized as not a clean tournament. The peer-reviewed 2018 Science version reran it under tournament controls against a stronger Stockfish and AlphaZero still won.

Questions you might still have

?

Why is the search stronger than the network it is built from?
Because it spends compute looking ahead. At fixed weights, before any training, running 800 simulations is simply more informative than the network’s single glance, so the searched policy picks better moves. Training does not create that edge; it copies the search’s better judgment into the fast network so the next search starts from higher ground.

?

If the network starts random, what stops self-play from learning nonsense?
The only ground truth is the game outcome under the real rules, which the system cannot fake. Whatever the current network is, the search improves on it, so each round’s training targets are a little better than the network that produced them. That is ordinary policy iteration: evaluate, improve, repeat, climbing from random play.

?

Did AlphaZero really beat Stockfish fairly?
In the 2017 preprint, not under clean tournament conditions: Stockfish ran at a fixed minute per move with a small hash and no opening book or tablebases. But the result held up. The 2018 Science version reran the match under proper tournament controls against a stronger Stockfish, over a thousand games, and AlphaZero still won.

?

How is this different from MuZero?
AlphaZero is given the rules and uses them inside the search to simulate moves and detect wins. Its successor, the MuZero explainer on this site, removes that: it learns a model of the game’s dynamics and plans inside the learned model, so it also works on Atari, where the rules are not handed to it.

Footnotes & further reading

  1. The paper: Silver, Hubert, Schrittwieser, et al., Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (DeepMind, arXiv, December 2017). The numbers on this page are from this preprint.
  2. The peer-reviewed version, which reran the matches under tournament conditions: Silver et al., A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play (Science 362, 2018).
  3. Where the search and architecture come from: Silver et al., Mastering the game of Go without human knowledge (AlphaGo Zero, Nature 550, 2017). AlphaZero defers to it for every detail of MCTS, PUCT, and the network.
  4. The contemporaneous, independent version of the expert-search / apprentice-network idea: Anthony, Tian, Barber, Thinking Fast and Slow with Deep Learning and Tree Search (Expert Iteration, NeurIPS 2017), demonstrated on Hex.
  5. The PUCT selection rule traces to Christopher Rosin, Multi-armed bandits with episode context (2011); the selective-search framing to Claude Shannon, "Programming a Computer for Playing Chess" (1950).
  6. The direct successor, which learns the rules too: the MuZero explainer on this site.