Reinforcement learning · Planning

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Plan ahead in a game without ever being told its rules.

MuZero learns three small functions that predict only what planning needs (the reward of a move, how good a position is, and which move to try), then plans with them. The same algorithm reaches superhuman play in Go, chess, and shogi, and a new state of the art across 57 Atari games, and for the board games it is never told the rules.

Explaining the paperMastering Atari, Go, Chess and Shogi by Planning with a Learned ModelSchrittwieser, Antonoglou, Hubert, Simonyan, … Silver · DeepMind · Nature 2020 · arXiv:1911.08265 ↗

One agent reached superhuman play in Go, chess, and shogi, and a new state of the art across 57 Atari games. For the board games, nobody ever wrote down the rules for it.

To plan, you need a model. A chess engine looks ahead by playing moves out on an internal board, so it has to know the rules: which moves are legal, where each one leads, when the game is over. AlphaZero, the program that beat the strongest humans and engines at chess, shogi, and Go, leaned on exactly such a rulebook. It searched thousands of futures by consulting a perfect simulator of the game and scoring the resulting positions with a neural network. Take the rulebook away and the search has nothing to look ahead with.

Most of the real world has no rulebook. There is no perfect simulator of a robot's kitchen, a power grid, or an Atari screen full of moving sprites whose dynamics you were never handed. So for visually rich, rule-free domains the field leaned on a different family: model-free reinforcement learning, which skips the model and learns a policy or value directly from experience. That family won Atari and lost the board games, because without lookahead it cannot do the precise, deep planning that chess and Go demand.

MuZero, from DeepMind, is the paper that got both at once: it keeps AlphaZero's search but stops relying on a given model and learns one instead. The learned model stays tractable because of how little it has to predict: not the rules, not the pixels, only the three quantities a planner consumes. A few ideas build the method: what a learned model has to predict and what it can ignore, how three small networks produce it, how Monte-Carlo tree search runs inside it, and how the search trains the very networks it runs on.

A model of only what planning needs

Classical model-based reinforcement learning tries to predict the world. You learn a model that, given a state and an action, tells you the next state, and you check it by how well it reproduces what actually happens, often down to the pixels of the next frame. This is the obvious thing to build, and on visually rich domains like Atari it had stubbornly underperformed. Predicting frames is hard, the errors compound as you roll the model forward, and most of the model's capacity goes into rendering detail that has nothing to do with whether you are winning. Pixel-predicting world models stayed well below the model-free state of the art for exactly this reason.

MuZero asks a sharper question: what does a planner read off a model? A tree search only ever needs three things at each imagined step. The reward of a move, so it can add up the score of a line. The value of a position, so it can judge a line without playing it to the end. And a policy, a hint about which moves are worth trying, so the search does not waste time. Nothing in a search ever asks the model to draw the board. So MuZero trains its model to predict only those three, and frees it from everything else.

This idea carries the method, and it is easy to over- or under-claim. The model's internal state is not required to reconstruct the observation, and it is not required to match the true state of the game either. It carries no semantics at all. It is only a vector the network is free to shape however makes the three predictions most accurate. A grandmaster reading a line ahead does not hallucinate the wood grain of each future board; they hold an abstract sense of "this line is winning, that square is weak," just enough to judge it. MuZero's hidden state works the same way: it holds enough to judge a line and cannot be turned back into a screen. (Where the analogy frays: a human can redraw the literal board on request and uses the real rules of movement; MuZero's hidden state genuinely cannot be decoded into a screen, and it invents its own dynamics.) Because the prediction is all that is graded, the model is free to throw away anything that does not change a reward, a value, or a policy.

The figure makes the trade concrete. The left panel is a tiny screen with one relevant agent and goal plus a dial of task-irrelevant clutter. A model that has to reconstruct the screen must encode every lit cell, so its job grows as you turn the dial up. MuZero's three outputs depend only on the agent and the goal, so they do not move at all. Drag the dial and watch the reconstruction load climb while the planning answer stays put.

Figure 1 · what the model is graded on

irrelevant detail40%

Turn up the irrelevant detail. A model graded on reconstructing the observation must encode every cell, so its load (amber) climbs. MuZero is graded only on reward, value, and policy, which depend on the agent and goal alone, so its outputs are unchanged. The hidden state can ignore everything that does not move a prediction.

Those three predictions are cheap to grade, and the gains are all downstream. The model is small, because it never spends capacity on appearance. It rolls forward cleanly, because abstract states do not accumulate rendering error the way predicted frames do. And it works in domains where there is no clean state to reconstruct in the first place. You give up one thing: there is no decoder, so you cannot watch MuZero's imagined future as a movie. With nothing to reconstruct, you no longer need to.

Three functions: representation, dynamics, prediction

The model is three networks with one job each, trained together as a single system written $\mu_\theta$ .

s^0 = h_\theta(o_1,\dots,o_t), \qquad r^k, s^k = g_\theta(s^{k-1}, a^k), \qquad p^k, v^k = f_\theta(s^k)

The representation $h_\theta$ takes the past observations (the last frames of the screen, or the recent board positions) and compresses them into a root hidden state $s^0$ . The dynamics $g_\theta$ is the imagination step: hand it the current hidden state and an action you are considering, and it returns the immediate reward $r^k$ of that action and the next hidden state $s^k$ . It is recurrent, the same $g_\theta$ applied over and over, so you can roll the model forward as many imagined steps as you like. The prediction $f_\theta$ reads a policy $p^k$ and a value $v^k$ off any hidden state, the same role the policy-value network plays in AlphaZero, except here it looks at an invented hidden state rather than a real board.

The superscript $k$ counts imagined steps forward from the present, not real time. At $k=0$ you are at the actual current position. At $k=1$ you have imagined one move, at $k=2$ two, and so on, all inside the model, without touching the environment. Drag the unroll depth below and watch one recurrent $g_\theta$ reused down the chain, reading a reward, a value, and a policy off each imagined state. Notice the hidden states are drawn as abstract vectors, because that is all they are.

Figure 2 · the model, unrolled

unroll K3 steps

h turns past observations into the root hidden state s0. The same recurrent g then steps forward, each step taking an action and emitting a reward and the next hidden state; f reads a policy and value off every state. Drag K to unroll further. The states are abstract vectors, not pictures.

Each of the three predictions is trained against its own target. The reward $r^k_t$ should match the real reward observed $k$ steps later, $u_{t+k}$ . The value $v^k_t$ should match a value target $z_{t+k}$ built from what actually happened. The policy $p^k_t$ should match the move distribution the search settled on, $\pi_{t+k}$ . For now, the model is judged only on these three predictions, run forward from a hidden state with no obligation to mean anything.

Search inside the imagined model

Given that model, planning is Monte-Carlo tree search, the same algorithm AlphaZero used, run over imagined hidden states instead of real board positions. The search builds a tree where each node is a hidden state and each edge is an action, and it grows the tree one simulation at a time. Every simulation is three steps.

Select. Start at the root and walk down the tree, at each node picking the action that looks most worth investigating, until you reach an edge the tree has not expanded yet. Expand. Apply the dynamics function once to that edge: $r^k, s^k = g_\theta(s^{k-1}, a^k)$ gives the new node its reward and hidden state, and the prediction function $f_\theta$ gives it a value and a prior policy over its own children. That is one call to $g_\theta$ and one to $f_\theta$ per simulation, so the cost is the same order as AlphaZero's. Back up. Carry the new node's value back up the path you descended, folding in the predicted reward at each edge and discounting as you go, and increment a visit counter on every edge you touched. (In a two-player game a position good for you is bad for your opponent, so the value flips sign at each ply on the way up.)

Each edge keeps running statistics as the tree grows: a visit count $N$ , a mean value $Q$ , the prior $P$ from the prediction network, the predicted reward $R$ , and the resulting hidden state $S$ . After hundreds of simulations the action visited most from the root is the move the search recommends. Step through the search below: the network's prior favors action A, but the values the search backs up favor B, and you can watch the visits pile onto B as the simulations accumulate.

Figure 3 · one simulation at a time

sim 1/22

Each simulation selects a path (highlighted), expands a new leaf with the dynamics function, and backs its value up the path. Edge thickness is the visit count. The dashed amber ticks are the network's prior P, which favors A; the teal bars are the visit counts N(root, a), which shift onto B, the higher-value move. Play it or scrub.

Three things in MuZero's search differ from AlphaZero, and all three exist because there is no longer a rulebook to lean on. The search never checks legality inside the tree; it only masks illegal moves at the root, where it can still query the real environment, and trusts the network to learn not to propose moves that never occur in real games. It gives terminal positions no special treatment either. There is no game-over flag to check inside the tree, so the dynamics function is trained to keep returning the same state and zero further reward once a game has ended. A search that imagines moves past the real game's end just spins in place rather than inventing nonsense. And because Atari rewards can be any size while board-game outcomes are tidy wins and losses, the search has to handle values of wildly different scale.

Choosing what to try next

Selection is where the search spends its budget wisely. At each node MuZero scores every action and descends into the best one, balancing two pulls: exploit the action that already looks good, or explore an action the prior favors but the search has barely tried. The rule is a variant of the one AlphaZero used, called pUCT.

a^k = \arg\max_a \left[\, Q(s,a) + P(s,a)\cdot\frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)}\cdot\left(c_1 + \log\frac{\sum_b N(s,b) + c_2 + 1}{c_2}\right)\right]

(2)

The first term, $Q(s,a)$ , is exploitation: the mean value backed up through that action so far. The rest is the exploration bonus. It is large when the prior $P(s,a)$ favors the move, and it shrinks as that action's own visit count $N(s,a)$ grows, through the $1/(1+N(s,a))$ factor. So a promising move gets probed early, and once it has been tried enough, its bonus fades and its measured value $Q$ has to carry it. The sum $\sum_b N(s,b)$ runs over all of the node's actions including $a$ , so it equals the parent's total visit count; the slowly growing $c_1 + \log(\cdots)$ factor barely moves for any realistic search. With the paper's constants $c_1 = 1.25$ and $c_2 = 19652$ , and at most 800 simulations, $\sum_b N$ never comes close to $c_2$ , so that log term sits near $c_1$ throughout. It is there to let exploration ramp up gently in much larger searches, not in these.

Below, the two terms pull in opposite directions. The prior favors A, the learned values favor C. Early picks follow the prior because the bonus dominates, but each visit shrinks that action's bonus, the value term takes over, and the visits concentrate on C.

Figure 4 · exploit versus explore

pick 1/26

Each action's selection score stacks exploitation Q on top of an exploration bonus (the prior, scaled by a factor that decays as that action is visited). The prior favors A; the values favor C. Early picks chase the prior, but the bonus on a visited action shrinks, so C, the high-value move, ends up most visited.

One subtlety the equation hides: AlphaZero could assume values lived in $[0,1]$ (you either win or you do not), so $Q$ and the bonus were already comparable. Atari returns can be in the thousands, and a raw $Q$ that large would swamp the bonus and make the prior irrelevant. So before plugging $Q$ into the rule, MuZero rescales it to $[0,1]$ using the smallest and largest $Q$ seen anywhere in the current tree.

\bar{Q}(s,a) = \frac{Q(s,a) - \min_{s,a\in \text{Tree}} Q}{\max_{s,a\in \text{Tree}} Q - \min_{s,a\in \text{Tree}} Q}

(5)

This normalization is not cosmetic. It lets a search engine built for the bounded values of board games transfer, unchanged, to the unbounded scores of Atari, with no game-specific reward scaling baked in. The min and max move as the tree grows, so the same raw $Q$ can normalize slightly differently between two simulations of the same search; that is fine, because all that matters is keeping $Q$ and the bonus on a comparable footing.

Search makes a better policy

One idea turns a search engine into a learning algorithm. The network's raw policy $p$ is a snap judgment about which move is best. The search takes that snap judgment, spends a few hundred simulations of lookahead refining it, and comes out with a sharper, better answer: the distribution of visit counts over the root's actions. Looking ahead improves the policy. The move you settle on after calculating beats your first instinct, so you train the instinct toward what calculating concluded. The search policy $\pi_t$ is that improved answer, and it becomes the training target the raw policy is pulled toward.

One precise point that is easy to get wrong: the search policy is proportional to visit counts, not to $Q$ values.

\pi_t(a) \;\propto\; N(s^0, a)

Visits and value are correlated, since pUCT sends more simulations toward high-value actions, but they are not the same thing. A deep, narrow line can have a high $Q$ and yet few visits, and reading the policy off the counts rather than the values keeps it robust. Look back at the visit-count bars under Figure 3: those bars are $\pi_t$ , sharpened away from the prior toward the move search preferred.

When the agent plays a real move, it samples from those counts, with a temperature that controls how greedy it is.

p_\alpha = \frac{N(\alpha)^{1/T}}{\sum_b N(b)^{1/T}}

(6)

The temperature here is an exponent on the visit counts, which inverts the usual reading. At $T=1$ the agent samples in direct proportion to the counts (not uniformly), which keeps exploration alive early in training; as $T \to 0$ it collapses to always picking the most-visited move. MuZero anneals it on a schedule: $T=1$ for the first 500k training steps, then $0.5$ , then $0.25$ , so play gets greedier as the network gets stronger.

That closes the loop. Search uses the current network to produce an improved policy and value; the network is trained toward those improved targets; a better network makes the next search better still. A slower, more careful version of the network teaches a faster one. The eventual game result anchors all of it to ground truth, so the bootstrap has something real to climb toward.

What the network learns from

Everything is trained jointly to match three targets at every imagined step, summed into one loss.

\ell_t(\theta) = \sum_{k=0}^{K} \Big[\, \ell^r(u_{t+k}, r^k_t) + \ell^v(z_{t+k}, v^k_t) + \ell^p(\pi_{t+k}, p^k_t) \,\Big] + c\,\lVert\theta\rVert^2

(1)

Read it as: take a real trajectory, unroll the model $K$ steps from some point in it, and at each unrolled step $k$ penalize the gap between the model's prediction and what really happened $k$ steps later. Reward against the observed reward, value against a value target, policy against the search policy. There is no reconstruction term anywhere in the loss; value equivalence shows up here as that absence. One small asymmetry: the $k=0$ step has a value and a policy target but no reward target, because a reward only appears once the dynamics function has taken an action, and at the root no action has been applied yet.

The value target is where the most-repeated description of MuZero goes wrong. For a game that ends, the target is the final outcome, $z_t = u_T$ , with wins, draws, and losses coded as $+1, 0, -1$ . For a long task like Atari, you cannot wait until the end, so MuZero uses an $n$ -step return: sum the next $n$ real rewards, then bootstrap with an estimate of the rest.

z_t = u_{t+1} + \gamma\, u_{t+2} + \cdots + \gamma^{\,n-1} u_{t+n} + \gamma^{\,n}\, \nu_{t+n}

The estimate it bootstraps from is $\nu_{t+n}$ , the value the search returned at that later step, not the raw value head $v$ . Most secondary accounts get this backwards. The search value is the network's own value refined by a tree of lookahead, so it is a better estimate of the position than the value head alone, and training the value head toward it is the same cheap improvement step as the policy target. The figure lets you check both halves: toggle the bootstrap between the search value $\nu$ and the network value $v$ and watch the target shift toward the true return, and slide $n$ to see the bootstrap's weight $\gamma^n$ shrink as more real reward is folded in.

Figure 5 · the value target

n=4

The target

z_t

sums n real rewards and then bootstraps. Toggle the bootstrap between the search value ν and the network value v: ν lands closer to the dashed true return. Raise n to fold in more real reward and lean less on the (imperfect) bootstrap. MuZero uses n=10 for Atari and bootstraps to the game's end (n to infinity) for board games.

Small $n$ gives a stable but biased target, since it leans heavily on a bootstrap that is still being learned; large $n$ gives an accurate but noisier one, since it waits for more real, random reward. MuZero sits at $n=10$ for Atari, with the discount $\gamma = 0.997$ inherited from the R2D2 agent (which itself used $n=5$ ); board games sit at the far end of the dial, pure outcome with no discount.

One more practical problem hides in those Atari numbers. A value can be $0.3$ in one game and tens of thousands in another, and a network that has to regress such a range directly trains badly. MuZero borrows R2D2's fix: first squash the value through an invertible compressor that is nearly linear near zero and square-root-ish far out, so a reward of 0.3 and one in the tens of thousands both land in a small range.

h(x) = \operatorname{sign}(x)\big(\sqrt{|x|+1}-1\big) + \varepsilon\, x, \qquad \varepsilon = 0.001

On the squashed axis, MuZero does not regress a number at all. It outputs a distribution over 601 integer bins from $-300$ to $300$ , and represents any value two-hot, split across its two nearest bins (a squashed value of $3.7$ becomes weight $0.3$ on bin $3$ and $0.7$ on bin $4$ ), trained with a cross-entropy loss that behaves well across scales. To read a number back, take the expectation under the bins and invert $h$ . The bins live on the squashed axis, so the modest range $[-300, 300]$ covers an enormous range of real returns: squashed $300$ corresponds to a raw return near $58{,}700$ . Drag a raw return through the pipeline below and watch it compress, split across two bins, and round-trip back.

Figure 6 · fitting any reward on one dial

raw return845

A raw return is squashed by h onto a tame

[-300, 300]

axis (scaled 300 is a raw return near 58,700), then encoded two-hot across its two nearest bins. Reading it back takes the expectation and inverts

h

. The round-trip error stays near zero only because the inverse squares its bracket; drop that and it breaks.

Board games skip all of this and keep AlphaZero's plain squared-error value and no reward loss, since the only reward is the tidy outcome at the end. Atari uses the cross-entropy categorical loss for both value and reward. The policy loss is cross-entropy against the search distribution in both settings.

What it looks like in practice

Concretely, two loops run side by side. Actors play games: at each real position they run a search and play a move sampled from its visit counts, storing the trajectory along with the search policy $\pi_t$ and search value $\nu_t$ at each step. A learner samples those trajectories and updates the network. The search is the select-expand-backup loop of Figure 3, with the discounted reward folded into the value as it climbs back up the tree:

# one simulation of MuZero's search (after the official pseudocode)
node = root                       # root = h(o_1..o_t), already expanded
path = [root]
while node.expanded:              # SELECT: descend by the pUCT rule
    action, node = argmax_puct(node)
    path.append(node)
parent = path[-2]                 # the unexpanded leaf to grow
r, s = g(parent.hidden, action)   # EXPAND: imagine one step with dynamics
p, v = f(s)                       # read off policy + value at the new node
node.expand(reward=r, hidden=s, prior=p)
for n in reversed(path):          # BACK UP the value along the path
    n.visit += 1
    n.value_sum += v              # (sign-flipped for the opponent's turn)
    v = n.reward + gamma * v      # fold in the predicted reward, discount

One learner step runs the representation once to get $s^0$ , unrolls the dynamics for $K=5$ imagined steps along the trajectory's real actions, scores every head against its target, and sends one backward pass through all three functions at once. Two gradient scalings keep the recurrent core stable and do different jobs: dividing each head's loss by $K$ makes the total gradient the same size no matter how far you unroll, and halving the gradient entering the dynamics at each step keeps the repeatedly-traversed $g_\theta$ from accumulating $K$ times the gradient everything else gets.

# one training step on a sampled trajectory
# inputs: observations, real actions a, search policies pi, rewards u, values nu
s = h(obs_up_to_t)                # representation -> root hidden state s0
z = nstep_return(u, nu)           # value target: bootstrap from search nu, not v
loss = 0
for k in range(K + 1):            # K = 5 hypothetical steps
    p, v = f(s)                   # prediction head at s^k
    loss += xent(pi[t+k], p) + value_loss(z[t+k], v)
    if k > 0:
        loss += reward_loss(u[t+k], r)   # no reward target at k = 0
    if k < K:
        r, s = g(s, a[t+k+1])     # dynamics -> next reward + hidden state
        s = scale_grad(s, 0.5)    # halve the gradient entering dynamics
loss = loss / K + c * l2(theta)   # 1/K keeps the total unroll-invariant
loss.backward()                   # one backward pass through h, g, f

The shapes are ordinary. In Go the hidden state is a $19\times19\times256$ tensor; in Atari the screen (the last 32 frames at $96\times96$ , plus the last 32 actions) is downsampled to a $6\times6\times256$ hidden state, and an action is encoded as a plane and stacked on before each dynamics step. The representation and dynamics functions reuse AlphaZero's convolutional residual trunk, but with 16 residual blocks rather than 20, since that trunk runs once per node in every search and a leaner one keeps the per-node cost down. One more trick keeps the unroll from drifting: the hidden state is rescaled into $[0,1]$ by its own running min and max at every step, a plain bounding with no learned parameters, legitimate precisely because the hidden state has no semantics to preserve.

What it actually does

On the 57-game Atari benchmark MuZero set a new state of the art for both mean and median human-normalized score (where 100% is human-level play, so the numbers run far above 100). It beat the previous best model-free agent, R2D2, on 42 of the 57 games, and the previous best model-based agent, SimPLe, on all of them. Its median normalized score was 2041%, against R2D2's 1921%, so its median game sat at roughly twenty times the human score, and it did so on fewer environment frames. On chess and shogi it matched AlphaZero, and on Go it slightly exceeded it, while using a smaller network per evaluation (16 residual blocks against AlphaZero's 20) and, unlike AlphaZero, without ever being given the rules. Matching a rules-aware planner with nothing but a learned model, in the planner's own games, had never been done before.

The ablations are where you see how much of the work the planning is doing, and the answer splits by domain. In Go, searching longer keeps making the agent stronger, and the learned model tracks a perfect rules engine almost exactly, all the way out to searches two orders of magnitude longer than the 800 simulations it trained on. In Atari, more search helps too, but the gains flatten around 100 simulations, presumably because the learned model is less accurate there and deeper lookahead compounds its errors. Drag the simulation budget below and compare the two regimes.

Figure 7 · how far planning scales

In Go, playing strength keeps climbing with more search and the learned model nearly matches a perfect model far past its training budget of 800 simulations. In Atari the gains plateau near 100 simulations, and even a single simulation already plays well. Toggle the domain and drag the simulation count. (Curves reproduce the qualitative shape of the paper's Figure 3.)

That last point is often misread. By the end of training, even a single simulation, which is the raw policy with no search at all, plays Atari well. This does not mean search was unnecessary. During training, search is the teacher: networks trained with more simulations learn faster and reach higher final scores, and when the authors swapped MuZero's search-based targets for a plain Q-learning objective on Ms. Pacman, it learned slower and plateaued lower. The single-simulation result means the raw policy has, by the end, absorbed what search spent all of training teaching it.

The lineage runs AlphaGo Zero to AlphaZero to MuZero, and the paper's sample-efficient variant, Reanalyze, extends it once more by re-running search on old positions with the latest network to mint fresh targets. It reaches a 731% median while training on roughly 200 million frames, a small fraction of what the full agent used, trading peak score for sample efficiency rather than losing ground. Underneath all of it: you can have the lookahead, sample efficiency, and precision of planning without a model of the world, as long as you model only the few things a decision turns on.

Provenance Verified against primary literature

AlphaZero (2018)The MCTS self-play loop, the prediction network f, and the pUCT search MuZero keeps; MuZero drops the perfect simulator.

AlphaGo Zero (2017)Search as a policy-improvement operator; the visit-count distribution as the policy target.

Predictron / VPN (2017)Value-equivalent models: train an abstract model to predict value, not observations. MuZero scales this to full MCTS.

R2D2 (2019)The discount 0.997 and the invertible reward/value scaling (with ε=0.001).

UCT (Kocsis & Szepesvári 2006)Bandit-based tree search and the asymptotic-optimality result the search descends from.

correctionMuZero's value target bootstraps from the MCTS search value ν, not the network's value head v. Most secondary accounts state the network value; the paper bootstraps from the search-refined estimate, and we teach that.

Questions you might still have

Does MuZero learn the rules of the game?
No. It is never given the transition rules, and inside its search it never checks whether a move is legal or whether the game is over. It still asks the real environment which moves are legal at the current position (the root), then plans entirely inside a model it learned. That model only has to predict reward, value, and policy well enough to plan.

Can you look at a hidden state and see the board?
No decoder exists. The hidden state is trained only to make the three predictions come out right, with no obligation to resemble the screen or the true game state. That freedom is exactly what lets it discard detail that does not change a decision, which is why it works where pixel-predicting models had failed.

If one simulation plays well by the end, why bother searching?
During training, search is the teacher. More simulations learn faster and reach higher scores, and replacing search targets with plain Q-learning learns slower and plateaus lower. By the end, the raw policy has internalized what search taught it, which is why a single simulation looks fine at evaluation time but would have been far weaker to train with.

Why bootstrap the value from the search value and not the network value?
The search value is the network’s own value head refined by a tree of lookahead, so it is a sharper estimate of the position. Training the value head toward it is a cheap improvement step, the value-side twin of distilling the search policy back into the prior.

How is this different from AlphaZero?
AlphaZero used a perfect rules engine in three places: the state transitions, the legal-move list, and detecting game-over. MuZero replaces all three inside the tree with its learned model, so it works in domains with no simulator at all, like Atari, while matching AlphaZero where a simulator does exist.

Footnotes & further reading

The paper: Schrittwieser et al., Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (DeepMind, Nature 2020). The equations and constants here come from the v2 arXiv text and the authors' official pseudocode in its ancillary files.
The search and self-play lineage: Silver et al., A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go (AlphaZero), and Mastering the Game of Go without Human Knowledge (AlphaGo Zero).
Value-equivalent models, the idea MuZero scales up: Silver et al., The Predictron, and Oh et al., Value Prediction Networks. The principle was later formalized by Grimm et al., The Value Equivalence Principle, which cites MuZero as an instance.
The discount and the invertible value scaling are inherited from Kapturowski et al., Recurrent Experience Replay in Distributed RL (R2D2); the form of the scaling is from Pohlen et al., Observe and Look Further (which used ε=0.01; MuZero uses R2D2's ε=0.001).
The tree-search foundation: Kocsis & Szepesvári, Bandit Based Monte-Carlo Planning (UCT); the prior-in-the-bonus idea traces to Rosin, Multi-armed Bandits with Episode Context.
Prioritized sampling of the replay buffer follows Schaul et al., Prioritized Experience Replay, with MuZero's priority set to the gap between the search value and the n-step return.
Why the visit-count policy improves on the prior is made precise by Grill et al., Monte-Carlo Tree Search as Regularized Policy Optimization, which reads the search policy as a regularized greedy step toward high-value actions.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.