VerifiedarXiv:1803.1012224 min
Model-based RL · Generative models

World Models

An agent that learns to act inside its own dream.

Split an agent into a large model that learns how its world looks and moves, and a tiny controller that decides what to do. The world model can be trained without any reward, and once it can predict the future, the controller can be trained entirely inside a hallucination and then dropped back into reality.

Explaining the paperWorld ModelsHa, Schmidhuber · Google Brain · NNAISENSE · NeurIPS 2018 · arXiv:1803.10122

What if an agent could practice a task thousands of times without ever touching the real environment, by running the practice runs inside a model it dreamed up itself?

A batter facing a 100 mph fastball has less time to react than it takes for the visual signal to travel from the eye to the conscious brain. They swing anyway, and they connect, because they are not reacting to the ball. They are reacting to a prediction of where the ball will be. We all do this constantly. The brain carries a compressed running model of the world and acts on what that model expects to happen next, not on the raw flood of the senses, which arrives a beat too late to be useful.

World Models, by David Ha and Jürgen Schmidhuber, asks what an artificial agent built on the same principle would look like, and then builds the smallest version that works. The agent has three parts that mirror the cognitive story: something that sees and compresses each frame, something that remembers and predicts what comes next, and something that decides. The first two together are the "world model." The third is a controller so small you could fit its parameters on an index card.

That division is the whole idea, and it produces a result that sounds unlikely the first time you hear it. Because the world model learns to predict the future, you can unplug the real environment and let the model generate the agent's experience instead. The agent trains inside this dreamed-up environment, never seeing a real pixel, and the policy it learns there transfers back to the actual game. To make all of that land, we need to build the three parts in order, see how they snap together into a loop, and then watch what happens when you cut reality out of that loop.

A big world model and a tiny controller

Start with a problem that has shaped reinforcement learning for years: credit assignment. An agent takes thousands of actions and, much later, gets a single number telling it how well the whole episode went. Which actions deserve the credit? Backpropagation, the workhorse that trains huge networks elsewhere, needs a clean differentiable path from a loss back to every weight. A sparse, delayed, often non-differentiable reward gives it almost nothing to chew on. So in practice people who train policies by reinforcement learning keep the networks small, because a small network is all the weak reward signal can manage to steer.

The paper's move is to refuse the tradeoff. It splits the agent so that the part needing a rich, easy, differentiable objective and the part needing the weak reward signal are different networks, trained in different ways.

V and M are the world model. They hold essentially all of the agent's capacity (for the car-racing agent, about 4.8 million parameters between them) and they learn in a self-supervised way, predicting observations, a task that hands backpropagation a dense signal at every pixel and every time step. C holds 867 parameters and learns from reward alone. By keeping C tiny, the reward only has to solve a tiny search problem, which means you can throw a gradient-free optimizer at it and not care that the reward is sparse. Most of the learning happens where learning is cheap; only a sliver happens where it is hard.

Three letters, then, in dependency order: V turns a picture into a few numbers, M turns a history of those numbers into a guess about the next ones, and C turns both into an action. Build them one at a time.

V: compress what you see

The environment hands the agent a 64-by-64 color image at every step. That is 12,288 numbers, almost all of them redundant: neighboring pixels agree, the sky is the sky, the road is the road. The job of V is to throw away the redundancy and keep a compact code of what actually matters in the frame.

V is a variational autoencoder, or VAE (we have a separate explainer on the VAE if you want the full derivation). The short version: an encoder squeezes the image down to a small latent vector zz (32 numbers for the car-racing agent, 64 for the Doom agent), and a decoder tries to rebuild the original image from just those numbers. Train the pair to make the reconstruction match the input and the bottleneck is forced to spend its handful of numbers on whatever is most worth keeping.

The loss has two terms, and the paper minimizes their sum. The first is reconstruction error, the pixel-by-pixel squared difference between the frame and its rebuild. The second is a regularizer that pulls the distribution of codes toward a standard Gaussian:

LV=xx^22reconstruction  +  DKL ⁣(q(zx)N(0,I))keep the code space tidy\mathcal{L}_V = \underbrace{\lVert \mathbf{x} - \hat{\mathbf{x}} \rVert_2^2}_{\text{reconstruction}} \; + \; \underbrace{D_{\mathrm{KL}}\!\big(q(z \mid \mathbf{x}) \,\Vert\, \mathcal{N}(\mathbf{0}, \mathbf{I})\big)}_{\text{keep the code space tidy}}

That second term is what makes it a variational autoencoder rather than a plain one, and it earns its keep here for a specific reason. It caps how much information any single code can carry and it keeps the codes packed into a smooth, well-behaved blob around the origin with no strange empty pockets. That matters because of what happens next: M is going to start inventing latent codes that V never produced from a real image. A tidy latent space means those invented codes still land somewhere V's decoder can make sense of, instead of off in some undefined corner. (The paper leans on the capacity limit as the stated reason for the VAE; the robustness to invented codes is a welcome consequence of the same Gaussian prior.)

Drag across the figure to bend the track. The frame on the left is encoded into the 32-number code in the middle, then decoded back on the right. Watch what survives the squeeze and what does not:

Figure 1 · compress and rebuild
A frame is encoded into a 32-number latent z, then decoded back. The road and the car survive the bottleneck; the lane dashes and fine texture do not. V keeps the gist of the scene and spends none of its budget on detail you would not miss.

The reconstruction is visibly lossy, and that is the point. V is not a photograph, it is a thumbnail: enough to know where the road bends and where the car sits, with the texture sanded off. The whole V network is trained for a single pass over ten thousand random-play episodes and then frozen. From here on, the agent never works with pixels again. It works with zz.

One honest caveat the paper raises about itself: because V learns with no idea what the task is, it cannot know which details will turn out to matter. It faithfully reproduces irrelevant wall textures in Doom while fumbling the task-critical tiles on the road in car racing. The reconstruction loss counts every pixel the same and never hears about the task, so V keeps whatever is large and repetitive and easy to rebuild, which is not always what the policy needs. We will come back to this.

M: predict what happens next

V compresses space. M compresses time. Its job is to look at the current code ztz_t, the action ata_t the agent is about to take, and a running summary of everything that came before, and predict the next code zt+1z_{t+1}. If V is the eye, M is the part that has watched enough of the world to know roughly what tends to happen next.

The running summary is the key object, so give it a name now. M is a recurrent network, specifically an LSTM, and at each step it carries a hidden state hth_t. You can think of hth_t as M's compressed belief about the situation: not just what the last frame looked like, but where things seem to be heading. Because M is trained to predict the future, hth_t ends up holding exactly the information you would need to predict the future. Hold that thought, because it is what makes the controller's job easy later.

Now the part that takes a little care. M does not predict a single next code. It predicts a probability distribution over next codes. The reason is that real environments branch. From the same situation, a monster might fire or it might not; the road might continue straight or a curve might be about to appear. A network forced to output one number would have to average those futures into a mush that is none of them. So M outputs a distribution with room for several distinct possibilities at once.

The tool for that is a mixture density network, or MDN, an idea from Bishop (1994). Instead of one Gaussian, the network outputs a weighted mix of several (five, in this paper), each with its own mean and width. The whole predicted distribution for the next latent is

P(zt+1at,zt,ht)  =  k=15πkN ⁣(zt+1;μk,σk2)P(z_{t+1} \mid a_t, z_t, h_t) \;=\; \sum_{k=1}^{5} \pi_k \,\mathcal{N}\!\big(z_{t+1}; \, \mu_k, \, \sigma_k^2\big)

where the weights πk\pi_k say how likely each branch is and sum to one. Stitch this onto the LSTM and you get the MDN-RNN, the same recipe Ha used to make a network draw sketches one stroke at a time. Each branch of the mixture can stand for a genuinely different thing that might happen next, which is exactly what you need to capture a discrete event like "a fireball appears."

That mixture also comes with a single dial that turns out to matter enormously, so meet it now. A temperature τ\tau controls how much randomness you allow when you sample the next latent from the mixture. Following the sketch-drawing work, lowering τ\tau does two things at once: it shrinks every Gaussian (temperature scales each mode's variance by τ\tau, so its width, the standard deviation, scales by τ\sqrt{\tau}) and it sharpens the choice among branches toward the single most likely one. At τ0\tau \to 0 the model collapses onto its favorite branch with no spread, effectively deterministic. At τ=1\tau = 1 you get the distribution the network actually learned. Above one, the futures spread out and the rare branches get more airtime.

Slide τ\tau and watch the distribution over one latent coordinate. The tall mode is the usual case, where nothing much changes; the small amber mode is the rare branch, where a fireball forms. Notice how the rare mode dies as you cool the model down:

Figure 2 · the next latent is a distribution
τ = 1.00
M predicts the next latent as a mixture of five Gaussians, not a single point. Lower the temperature τ\tau and every mode narrows while the weight piles onto the likeliest one, so the rare "fireball" branch and its probability vanish. This collapse is the villain of the temperature story below.

M is trained on the same ten thousand random episodes as V, for twenty passes, with one simple objective: make the mixture assign high probability to the latent that actually came next. No reward, no planning, just next-frame prediction in the compressed space V built. After training, M is a generative model of how the world moves. That is a stronger thing to own than it sounds, and the rest of the paper is about cashing it in.

C: the part that acts

After V and M, the controller is almost an anticlimax, which is the design working as intended. C takes the current code ztz_t and M's hidden state hth_t, glues them into one vector, and runs a single linear layer to produce the action. That is the entire policy, and it is the one numbered equation in the paper:

at=Wc[zt  ht]+bca_t = W_c\,[\,z_t \; h_t\,] + b_c(1)

No hidden layers, no nonlinearity beyond squashing the output into the valid action range. For the car-racing agent, ztz_t is 32 numbers and hth_t is 256, so the concatenation is 288 numbers, and mapping that to three continuous controls (steer, gas, brake) takes 288 × 3 weights plus 3 biases, which is 867 parameters. That is the whole brain that touches the reward.

The controller is kept this small on purpose, because now you can optimize it with a method that needs no gradients at all. The paper uses CMA-ES, an evolution strategy. The idea is refreshingly blunt: keep a population of candidate controllers drawn from a Gaussian over the 867 numbers, run each one in the environment, see which scored well, and shift the Gaussian toward the winners, also reshaping its covariance so it stretches along the directions that have been paying off. Repeat for many generations. It only ever needs the final score of a rollout, so it does not care that the reward is sparse or that env.step is a non-differentiable black box, and it parallelizes trivially across machines. CMA-ES handles a few thousand parameters comfortably, and 867 sits well inside that.

Here is the loop all three parts make together. Each step, the environment shows a frame; V codes it to zz; C reads zz and M's state hh and emits an action; the action moves the world and also feeds M, which rolls its hidden state forward to be ready for the next step:

Figure 3 · the agent loop
The loop, with capacity bars drawn to scale (car-racing counts). V sees, M predicts, C acts. V holds 4.35M parameters and M holds 422K; C's 867 are a sliver you can barely see. Almost the entire agent is the world model.
def rollout(controller):              # one episode in the real world
    obs = env.reset()
    h = rnn.initial_state()           # M's hidden state: the running memory
    total, done = 0, False
    while not done:
        z = vae.encode(obs)           # V:  64x64x3 frame  ->  z
        a = controller.action([z, h]) # C:  a = W_c [z, h] + b_c
        obs, reward, done = env.step(a)
        h = rnn.forward([a, z, h])    # M:  roll the memory one step forward
        total += reward
    return total

Now watch what feeding C that hidden state hh buys. On the car-racing task, a controller that sees only the current code zz drives, but badly: it wobbles and clips the sharp corners, scoring 632±251632 \pm 251. Give the very same linear controller M's hidden state hh as well, and the score jumps to 906±21906 \pm 21. Nothing changed except that C can now see what M expects to happen next. The driving steadies because the agent is no longer reacting to the current frame alone; it is acting on a prediction, the same trick as the batter. The information about the future that M packed into hh was the thing standing between wobbly and smooth.

That 906±21906 \pm 21 is worth pausing on. CarRacing-v0 counts as solved at an average of 900 over 100 consecutive trials, a bar no published method had cleared. Prior deep-RL agents scored in the 591 to 652 range, and the best entry on the leaderboard managed 838±11838 \pm 11. A linear controller reading a learned world model was the first to actually solve it.

Training inside a dream

Everything so far used the real environment to generate the agent's experience. But look again at what M is. It takes the current latent and an action and produces the next latent. That is precisely the signature of an environment: state plus action goes in, next state comes out. M is not just predicting the world; M is a world, a self-contained one that lives entirely in latent space.

So unplug reality. Instead of asking the real game for the next frame, ask M for the next latent: sample zt+1z_{t+1} from its mixture, feed that back in as if it were a real observation, and let the loop run. The controller cannot tell the difference, because it only ever saw latents to begin with. For the Doom agent the paper also trains M to predict whether the agent dies, so the dreamed environment can even end episodes. Nothing real is in the loop. The agent is dreaming, and it can act inside the dream:

Figure 4 · the dream loop
With the real environment cut out, M generates the next latent itself and the agent lives inside the hallucination: sample z^\hat z, act, feed the action back to M, repeat. Here the dreamed rollout is a corridor of fireballs the agent learns to dodge. The pixels are decoded only so we can watch; the agent trains purely on latents.
def dream(controller, tau):           # one episode inside M's hallucination
    z = sample_initial_latent()       # start from a latent, never a pixel
    h = rnn.initial_state()
    total, done = 0, False
    while not done:
        a = controller.action([z, h]) # same C, it cannot tell it is dreaming
        z, done = rnn.dream(a, z, h, tau)  # M samples the next z (and death)
        h = rnn.forward([a, z, h])
        total += 1                    # Take Cover: +1 for each step survived
    return total

The two code blocks are the same loop with one line swapped. In the real rollout, the next observation comes from env.step. In the dream, it comes from rnn.dream. Train the controller against the second, then deploy it against the first. On the Doom "Take Cover" task, where the agent has to survive a barrage of fireballs and solving means staying alive past 750 time steps on average, an agent trained entirely in the dream scored around 900 in the dream and then, dropped into the real game, survived for 1092±5561092 \pm 556 steps. It learned to dodge fireballs it had only ever met in its imagination.

This is not a small curiosity. A learned dream environment runs without a game engine, without rendering pixels, without physics you have to compute. It is just a recurrent network unrolling in latent space, so you can run it on a GPU as fast as the network goes, spin up as many parallel dreams as you like, and never pay the cost of the real simulator while the controller practices. The catch is that the dream is only as honest as M, and that is where the story gets interesting.

Cheating the dream, and why noise fixes it

M is a model, which means M is wrong in places. And a controller trained inside M is under no obligation to be fair about it. It will happily find a policy that scores beautifully against M's flaws and means nothing in reality, the way a speedrunner finds a glitch that walks through a wall the designer never meant to be walkable.

The paper watched exactly this happen. In early Doom dreams the controller discovered an adversarial policy: it learned to move in a way that caused the dreamed monsters to never fire. Even when a fireball started to form, the agent could make it fizzle out, as if it had found a cheat code in its own imagination. Inside the dream the agent was untouchable. Back in the real game, where monsters do not honor the exploit, the same policy was useless.

The fix is the temperature dial from the memory section. Crank τ\tau up and you inject extra uncertainty into M's predictions: fireballs appear less predictably, the dream behaves more erratically, and the precise sequence of events the exploit depended on stops being reliable. You cannot game a world that keeps surprising you. A noisier dream is a harder dream, and a harder dream is an honest one.

But uncertainty cuts both ways, and the paper's temperature table lays the tradeoff out plainly. Step the dial through the measured settings and compare two numbers at each: the score the controller gets in the dream, and the score the very same controller gets back in the real game:

Figure 5 · the exploitation tradeoff
τ = 1.15 · best transfer
Take Cover scores against temperature τ\tau (the paper's numbers). At low τ\tau the dream looks solved (~2086) while the real game score collapses to 193, below a random policy. Add uncertainty and the curves cross: real transfer peaks at τ=1.15\tau = 1.15 with 10921092, then a harder dream starts to hurt again.

Read the two ends against each other. At τ=0.1\tau = 0.1 the dream is nearly deterministic, the monsters mode-collapse into never shooting, and the agent racks up a near-perfect 20862086 by exploiting a world with no danger in it. Deployed for real, that same agent scores 193193, worse than flailing at random (a random policy survives for 210210). The dream score was a fantasy in the literal sense. At τ=1.15\tau = 1.15 the dream is honestly hard, the controller can no longer cheat, its dream score is a modest 918918, and that is the version that transfers best, surviving 10921092 steps in the real game. Push τ\tau to 1.301.30 and the dream gets so chaotic there is little left to learn, and both scores sag. A dream you can game teaches the wrong lesson, and a dose of noise keeps the agent from studying for the wrong test. Too little noise and it learns exploits, too much and there is nothing left to learn, so temperature is a knob you tune.

There is a quiet reason the mixture density network was the right choice for M, and it shows up right here. A plain deterministic predictor would have one future, the easiest possible world to exploit. Because M outputs a real distribution with separate modes, you can dial its uncertainty up with temperature and you can represent genuinely branching events like a fireball that may or may not appear. The multimodality is not decoration; it is what lets you make the dream both realistic and tamper-resistant.

What it actually does, and where it breaks

Pull back and the system is almost suspiciously simple. Compress each frame with an unsupervised autoencoder. Predict the next compressed frame with a recurrent mixture model. Bolt a linear controller onto those two and evolve its few hundred weights. Two of the three parts never see a reward, and the one that does is small enough for a gradient-free search to handle. From those pieces you get the first agent to solve CarRacing-v0, and an agent that learns to play Doom inside a hallucination and carries the skill back to the real game.

The paper also sketches where this goes for harder problems, and it is careful to mark these as proposals rather than finished results. For a world too rich to learn from random play, it suggests an iterative loop: act in the real environment, collect what you saw, improve M on the new data, train C inside the improved M, and repeat. To push the agent toward the parts of the world M understands worst, it suggests rewarding the agent for surprising M, by flipping the sign of M's own prediction loss so that high prediction error becomes something to seek out. That is a rough cousin of Schmidhuber's long line of work on artificial curiosity, though seeking raw prediction error is not quite the same as seeking learning progress, and the paper is upfront that one round was all its simple tasks needed. Whether the loop scales is left open.

The limits are stated plainly, and they follow from the design. Because V is trained with no notion of the task, it spends its bottleneck on whatever looks prominent, not on whatever matters, so a feature the policy needs can get compressed away while a useless wall texture is preserved in detail. Because M is a finite recurrent network, it can only hold so much of a world, and over long iterative training it is prone to catastrophic forgetting, losing old skills as it picks up new ones. And the dream is always a model: useful exactly to the degree that M is faithful, exploitable exactly where M is not, which is why the temperature knob had to exist at all.

What survives all of that is the shape of the idea, and it has aged well. Learn a compact, predictive model of your world from cheap unsupervised data. Put nearly all of your capacity there, where a dense objective makes learning easy, and keep the reward-driven part small. Then use the model not just to inform decisions but to generate the experience you learn from, so that practice costs a forward pass instead of a trip through reality. An agent that can predict its world can rehearse inside it. That is the sentence the paper makes concrete, and it is why the line of work it belongs to keeps coming back.

Provenance Verified against primary literature
VAE (Kingma & Welling, 2013)The V model: a Gaussian-latent autoencoder, reconstruction plus a KL pull toward the prior.
MDN (Bishop, 1994) / Graves (2013)The M model: a mixture-density RNN; temperature reshapes the mixture when sampling.
CMA-ES (Hansen)Gradient-free evolution of the 867-parameter linear controller.
LSTM (Hochreiter & Schmidhuber, 1997)The recurrent memory whose hidden state h carries the predicted future.
correctionThe temperature τ follows the SketchRNN convention the paper adopts: it scales each Gaussian’s standard deviation by √τ (variance by τ) and sharpens the mixture weights toward the likeliest mode. Many summaries mention only the variance, and √τ (not τ) on the standard deviation is the easy thing to get wrong. At τ = 0.1 this collapse is what makes the dreamed monsters stop firing.

Questions you might still have

?

Why train the controller with evolution instead of backpropagation?
Reward is sparse, delayed, and flows through a non-differentiable environment, so backprop has no clean path to follow. By keeping the controller tiny (867 parameters), a gradient-free optimizer like CMA-ES can search it directly using only each rollout’s final score. The heavy, differentiable learning is left to V and M, which predict observations and never touch the reward.

?

How can an agent learn to dodge fireballs it never really saw?
M is trained to predict the next latent, so it doubles as an environment: state and action in, next state out. Sampling from M generates a rollout entirely in latent space, fireballs included. The controller trains against those dreamed rollouts and transfers back, because at deployment it reads the same latents, now produced by the real game instead of M.

?

If the dream is just a model, why does a noisier dream transfer better?
A near-deterministic dream has exploitable flaws, and the controller will happily learn a policy that games them (making dream-monsters never fire) and means nothing in reality. Raising the temperature injects uncertainty that breaks those exploits, so the agent has to learn a policy that works under genuine unpredictability. Too much noise, though, and the dream becomes too chaotic to learn from: best transfer is at τ = 1.15.

?

Why a VAE for V, and not a plain autoencoder?
The KL term caps how much information each code can carry and keeps the codes packed into a smooth Gaussian blob with no empty pockets. That tidiness pays off when M starts inventing latents that V never produced from a real frame: they still land somewhere the decoder can interpret, instead of in undefined space.

Footnotes & further reading

  1. The paper: Ha & Schmidhuber, World Models (Google Brain / NNAISENSE, NeurIPS 2018). The authors' interactive version, with playable demos, lives at worldmodels.github.io.
  2. The V model is a variational autoencoder: Kingma & Welling, Auto-Encoding Variational Bayes (we have a separate explainer). The reconstruction-plus-KL loss is minimized; the KL pulls the codes toward the standard-normal prior N(0,I)\mathcal{N}(\mathbf{0},\mathbf{I}).
  3. The M model is a mixture density network on top of an LSTM: Bishop, Mixture Density Networks (1994); Graves, Generating Sequences With RNNs (2013); and the temperature trick as used in Ha & Eck, A Neural Representation of Sketch Drawings (SketchRNN). The LSTM itself is Hochreiter & Schmidhuber (1997).
  4. The controller is evolved with CMA-ES: Hansen, The CMA Evolution Strategy: A Tutorial. Population 64, each candidate averaged over 16 rollouts; car racing was solved after about 1800 generations.
  5. The lineage of learning and planning with recurrent world models runs back through Schmidhuber, On Learning to Think (2015) and his 1990–1991 work on RNN controller-model systems, which the paper builds on directly.
  6. Numbers quoted here are from the paper: car-racing scores 632±251632 \pm 251 (V only), 906±21906 \pm 21 (full model); the Doom temperature table (τ\tau from 0.10 to 1.30); and parameter counts 4,348,547 (V), 422,368 (M), 867 (car-racing C). The Doom controller is slightly larger at 1,088 because it is also fed the LSTM's cell state.