World Models
An agent that learns to act inside its own dream.
Split an agent into a large model that learns how its world looks and moves, and a tiny controller that decides what to do. The world model can be trained without any reward, and once it can predict the future, the controller can be trained entirely inside a hallucination and then dropped back into reality.
Explaining the paperWorld ModelsWhat if an agent could practice a task thousands of times without ever touching the real environment, by running the practice runs inside a model it dreamed up itself?
A batter facing a 100 mph fastball has less time to react than it takes for the visual signal to travel from the eye to the conscious brain. They swing anyway, and they connect, because they are not reacting to the ball. They are reacting to a prediction of where the ball will be. We all do this constantly. The brain carries a compressed running model of the world and acts on what that model expects to happen next, not on the raw flood of the senses, which arrives a beat too late to be useful.
World Models, by David Ha and Jürgen Schmidhuber, asks what an artificial agent built on the same principle would look like, and then builds the smallest version that works. The agent has three parts that mirror the cognitive story: something that sees and compresses each frame, something that remembers and predicts what comes next, and something that decides. The first two together are the "world model." The third is a controller so small you could fit its parameters on an index card.
That division is the whole idea, and it produces a result that sounds unlikely the first time you hear it. Because the world model learns to predict the future, you can unplug the real environment and let the model generate the agent's experience instead. The agent trains inside this dreamed-up environment, never seeing a real pixel, and the policy it learns there transfers back to the actual game. To make all of that land, we need to build the three parts in order, see how they snap together into a loop, and then watch what happens when you cut reality out of that loop.
A big world model and a tiny controller
Start with a problem that has shaped reinforcement learning for years: credit assignment. An agent takes thousands of actions and, much later, gets a single number telling it how well the whole episode went. Which actions deserve the credit? Backpropagation, the workhorse that trains huge networks elsewhere, needs a clean differentiable path from a loss back to every weight. A sparse, delayed, often non-differentiable reward gives it almost nothing to chew on. So in practice people who train policies by reinforcement learning keep the networks small, because a small network is all the weak reward signal can manage to steer.
The paper's move is to refuse the tradeoff. It splits the agent so that the part needing a rich, easy, differentiable objective and the part needing the weak reward signal are different networks, trained in different ways.
- V (Vision) compresses the current frame into a short latent vector . It is a variational autoencoder, trained by ordinary backpropagation just to reconstruct what it sees. No reward involved.
- M (Memory) predicts the next latent from the current one and the action. It is a recurrent network, also trained by backpropagation, just to predict the future. No reward involved.
- C (Controller) looks at what V and M hand it and picks an action. It is a single linear layer with a few hundred parameters, and it is the only part that ever sees the reward.
V and M are the world model. They hold essentially all of the agent's capacity (for the car-racing agent, about 4.8 million parameters between them) and they learn in a self-supervised way, predicting observations, a task that hands backpropagation a dense signal at every pixel and every time step. C holds 867 parameters and learns from reward alone. By keeping C tiny, the reward only has to solve a tiny search problem, which means you can throw a gradient-free optimizer at it and not care that the reward is sparse. Most of the learning happens where learning is cheap; only a sliver happens where it is hard.
Three letters, then, in dependency order: V turns a picture into a few numbers, M turns a history of those numbers into a guess about the next ones, and C turns both into an action. Build them one at a time.
V: compress what you see
The environment hands the agent a 64-by-64 color image at every step. That is 12,288 numbers, almost all of them redundant: neighboring pixels agree, the sky is the sky, the road is the road. The job of V is to throw away the redundancy and keep a compact code of what actually matters in the frame.
V is a variational autoencoder, or VAE (we have a separate explainer on the VAE if you want the full derivation). The short version: an encoder squeezes the image down to a small latent vector (32 numbers for the car-racing agent, 64 for the Doom agent), and a decoder tries to rebuild the original image from just those numbers. Train the pair to make the reconstruction match the input and the bottleneck is forced to spend its handful of numbers on whatever is most worth keeping.
The loss has two terms, and the paper minimizes their sum. The first is reconstruction error, the pixel-by-pixel squared difference between the frame and its rebuild. The second is a regularizer that pulls the distribution of codes toward a standard Gaussian:
That second term is what makes it a variational autoencoder rather than a plain one, and it earns its keep here for a specific reason. It caps how much information any single code can carry and it keeps the codes packed into a smooth, well-behaved blob around the origin with no strange empty pockets. That matters because of what happens next: M is going to start inventing latent codes that V never produced from a real image. A tidy latent space means those invented codes still land somewhere V's decoder can make sense of, instead of off in some undefined corner. (The paper leans on the capacity limit as the stated reason for the VAE; the robustness to invented codes is a welcome consequence of the same Gaussian prior.)
Drag across the figure to bend the track. The frame on the left is encoded into the 32-number code in the middle, then decoded back on the right. Watch what survives the squeeze and what does not:
The reconstruction is visibly lossy, and that is the point. V is not a photograph, it is a thumbnail: enough to know where the road bends and where the car sits, with the texture sanded off. The whole V network is trained for a single pass over ten thousand random-play episodes and then frozen. From here on, the agent never works with pixels again. It works with .
One honest caveat the paper raises about itself: because V learns with no idea what the task is, it cannot know which details will turn out to matter. It faithfully reproduces irrelevant wall textures in Doom while fumbling the task-critical tiles on the road in car racing. The reconstruction loss counts every pixel the same and never hears about the task, so V keeps whatever is large and repetitive and easy to rebuild, which is not always what the policy needs. We will come back to this.
M: predict what happens next
V compresses space. M compresses time. Its job is to look at the current code , the action the agent is about to take, and a running summary of everything that came before, and predict the next code . If V is the eye, M is the part that has watched enough of the world to know roughly what tends to happen next.
The running summary is the key object, so give it a name now. M is a recurrent network, specifically an LSTM, and at each step it carries a hidden state . You can think of as M's compressed belief about the situation: not just what the last frame looked like, but where things seem to be heading. Because M is trained to predict the future, ends up holding exactly the information you would need to predict the future. Hold that thought, because it is what makes the controller's job easy later.
Now the part that takes a little care. M does not predict a single next code. It predicts a probability distribution over next codes. The reason is that real environments branch. From the same situation, a monster might fire or it might not; the road might continue straight or a curve might be about to appear. A network forced to output one number would have to average those futures into a mush that is none of them. So M outputs a distribution with room for several distinct possibilities at once.
The tool for that is a mixture density network, or MDN, an idea from Bishop (1994). Instead of one Gaussian, the network outputs a weighted mix of several (five, in this paper), each with its own mean and width. The whole predicted distribution for the next latent is
where the weights say how likely each branch is and sum to one. Stitch this onto the LSTM and you get the MDN-RNN, the same recipe Ha used to make a network draw sketches one stroke at a time. Each branch of the mixture can stand for a genuinely different thing that might happen next, which is exactly what you need to capture a discrete event like "a fireball appears."
That mixture also comes with a single dial that turns out to matter enormously, so meet it now. A temperature controls how much randomness you allow when you sample the next latent from the mixture. Following the sketch-drawing work, lowering does two things at once: it shrinks every Gaussian (temperature scales each mode's variance by , so its width, the standard deviation, scales by ) and it sharpens the choice among branches toward the single most likely one. At the model collapses onto its favorite branch with no spread, effectively deterministic. At you get the distribution the network actually learned. Above one, the futures spread out and the rare branches get more airtime.
Slide and watch the distribution over one latent coordinate. The tall mode is the usual case, where nothing much changes; the small amber mode is the rare branch, where a fireball forms. Notice how the rare mode dies as you cool the model down:
M is trained on the same ten thousand random episodes as V, for twenty passes, with one simple objective: make the mixture assign high probability to the latent that actually came next. No reward, no planning, just next-frame prediction in the compressed space V built. After training, M is a generative model of how the world moves. That is a stronger thing to own than it sounds, and the rest of the paper is about cashing it in.
C: the part that acts
After V and M, the controller is almost an anticlimax, which is the design working as intended. C takes the current code and M's hidden state , glues them into one vector, and runs a single linear layer to produce the action. That is the entire policy, and it is the one numbered equation in the paper:
No hidden layers, no nonlinearity beyond squashing the output into the valid action range. For the car-racing agent, is 32 numbers and is 256, so the concatenation is 288 numbers, and mapping that to three continuous controls (steer, gas, brake) takes 288 × 3 weights plus 3 biases, which is 867 parameters. That is the whole brain that touches the reward.
The controller is kept this small on purpose, because now you can optimize it with a method that needs no gradients at all. The paper uses CMA-ES, an evolution strategy. The idea is refreshingly blunt: keep a population of candidate controllers drawn from a Gaussian over the 867 numbers, run each one in the environment, see which scored well, and shift the Gaussian toward the winners, also reshaping its covariance so it stretches along the directions that have been paying off. Repeat for many generations. It only ever needs the final score of a rollout, so it does not care that the reward is sparse or that env.step is a non-differentiable black box, and it parallelizes trivially across machines. CMA-ES handles a few thousand parameters comfortably, and 867 sits well inside that.
Here is the loop all three parts make together. Each step, the environment shows a frame; V codes it to ; C reads and M's state and emits an action; the action moves the world and also feeds M, which rolls its hidden state forward to be ready for the next step:
def rollout(controller): # one episode in the real world
obs = env.reset()
h = rnn.initial_state() # M's hidden state: the running memory
total, done = 0, False
while not done:
z = vae.encode(obs) # V: 64x64x3 frame -> z
a = controller.action([z, h]) # C: a = W_c [z, h] + b_c
obs, reward, done = env.step(a)
h = rnn.forward([a, z, h]) # M: roll the memory one step forward
total += reward
return totalNow watch what feeding C that hidden state buys. On the car-racing task, a controller that sees only the current code drives, but badly: it wobbles and clips the sharp corners, scoring . Give the very same linear controller M's hidden state as well, and the score jumps to . Nothing changed except that C can now see what M expects to happen next. The driving steadies because the agent is no longer reacting to the current frame alone; it is acting on a prediction, the same trick as the batter. The information about the future that M packed into was the thing standing between wobbly and smooth.
That is worth pausing on. CarRacing-v0 counts as solved at an average of 900 over 100 consecutive trials, a bar no published method had cleared. Prior deep-RL agents scored in the 591 to 652 range, and the best entry on the leaderboard managed . A linear controller reading a learned world model was the first to actually solve it.
Training inside a dream
Everything so far used the real environment to generate the agent's experience. But look again at what M is. It takes the current latent and an action and produces the next latent. That is precisely the signature of an environment: state plus action goes in, next state comes out. M is not just predicting the world; M is a world, a self-contained one that lives entirely in latent space.
So unplug reality. Instead of asking the real game for the next frame, ask M for the next latent: sample from its mixture, feed that back in as if it were a real observation, and let the loop run. The controller cannot tell the difference, because it only ever saw latents to begin with. For the Doom agent the paper also trains M to predict whether the agent dies, so the dreamed environment can even end episodes. Nothing real is in the loop. The agent is dreaming, and it can act inside the dream:
def dream(controller, tau): # one episode inside M's hallucination
z = sample_initial_latent() # start from a latent, never a pixel
h = rnn.initial_state()
total, done = 0, False
while not done:
a = controller.action([z, h]) # same C, it cannot tell it is dreaming
z, done = rnn.dream(a, z, h, tau) # M samples the next z (and death)
h = rnn.forward([a, z, h])
total += 1 # Take Cover: +1 for each step survived
return totalThe two code blocks are the same loop with one line swapped. In the real rollout, the next observation comes from env.step. In the dream, it comes from rnn.dream. Train the controller against the second, then deploy it against the first. On the Doom "Take Cover" task, where the agent has to survive a barrage of fireballs and solving means staying alive past 750 time steps on average, an agent trained entirely in the dream scored around 900 in the dream and then, dropped into the real game, survived for steps. It learned to dodge fireballs it had only ever met in its imagination.
This is not a small curiosity. A learned dream environment runs without a game engine, without rendering pixels, without physics you have to compute. It is just a recurrent network unrolling in latent space, so you can run it on a GPU as fast as the network goes, spin up as many parallel dreams as you like, and never pay the cost of the real simulator while the controller practices. The catch is that the dream is only as honest as M, and that is where the story gets interesting.
Cheating the dream, and why noise fixes it
M is a model, which means M is wrong in places. And a controller trained inside M is under no obligation to be fair about it. It will happily find a policy that scores beautifully against M's flaws and means nothing in reality, the way a speedrunner finds a glitch that walks through a wall the designer never meant to be walkable.
The paper watched exactly this happen. In early Doom dreams the controller discovered an adversarial policy: it learned to move in a way that caused the dreamed monsters to never fire. Even when a fireball started to form, the agent could make it fizzle out, as if it had found a cheat code in its own imagination. Inside the dream the agent was untouchable. Back in the real game, where monsters do not honor the exploit, the same policy was useless.
The fix is the temperature dial from the memory section. Crank up and you inject extra uncertainty into M's predictions: fireballs appear less predictably, the dream behaves more erratically, and the precise sequence of events the exploit depended on stops being reliable. You cannot game a world that keeps surprising you. A noisier dream is a harder dream, and a harder dream is an honest one.
But uncertainty cuts both ways, and the paper's temperature table lays the tradeoff out plainly. Step the dial through the measured settings and compare two numbers at each: the score the controller gets in the dream, and the score the very same controller gets back in the real game:
Read the two ends against each other. At the dream is nearly deterministic, the monsters mode-collapse into never shooting, and the agent racks up a near-perfect by exploiting a world with no danger in it. Deployed for real, that same agent scores , worse than flailing at random (a random policy survives for ). The dream score was a fantasy in the literal sense. At the dream is honestly hard, the controller can no longer cheat, its dream score is a modest , and that is the version that transfers best, surviving steps in the real game. Push to and the dream gets so chaotic there is little left to learn, and both scores sag. A dream you can game teaches the wrong lesson, and a dose of noise keeps the agent from studying for the wrong test. Too little noise and it learns exploits, too much and there is nothing left to learn, so temperature is a knob you tune.
There is a quiet reason the mixture density network was the right choice for M, and it shows up right here. A plain deterministic predictor would have one future, the easiest possible world to exploit. Because M outputs a real distribution with separate modes, you can dial its uncertainty up with temperature and you can represent genuinely branching events like a fireball that may or may not appear. The multimodality is not decoration; it is what lets you make the dream both realistic and tamper-resistant.
What it actually does, and where it breaks
Pull back and the system is almost suspiciously simple. Compress each frame with an unsupervised autoencoder. Predict the next compressed frame with a recurrent mixture model. Bolt a linear controller onto those two and evolve its few hundred weights. Two of the three parts never see a reward, and the one that does is small enough for a gradient-free search to handle. From those pieces you get the first agent to solve CarRacing-v0, and an agent that learns to play Doom inside a hallucination and carries the skill back to the real game.
The paper also sketches where this goes for harder problems, and it is careful to mark these as proposals rather than finished results. For a world too rich to learn from random play, it suggests an iterative loop: act in the real environment, collect what you saw, improve M on the new data, train C inside the improved M, and repeat. To push the agent toward the parts of the world M understands worst, it suggests rewarding the agent for surprising M, by flipping the sign of M's own prediction loss so that high prediction error becomes something to seek out. That is a rough cousin of Schmidhuber's long line of work on artificial curiosity, though seeking raw prediction error is not quite the same as seeking learning progress, and the paper is upfront that one round was all its simple tasks needed. Whether the loop scales is left open.
The limits are stated plainly, and they follow from the design. Because V is trained with no notion of the task, it spends its bottleneck on whatever looks prominent, not on whatever matters, so a feature the policy needs can get compressed away while a useless wall texture is preserved in detail. Because M is a finite recurrent network, it can only hold so much of a world, and over long iterative training it is prone to catastrophic forgetting, losing old skills as it picks up new ones. And the dream is always a model: useful exactly to the degree that M is faithful, exploitable exactly where M is not, which is why the temperature knob had to exist at all.
What survives all of that is the shape of the idea, and it has aged well. Learn a compact, predictive model of your world from cheap unsupervised data. Put nearly all of your capacity there, where a dense objective makes learning easy, and keep the reward-driven part small. Then use the model not just to inform decisions but to generate the experience you learn from, so that practice costs a forward pass instead of a trip through reality. An agent that can predict its world can rehearse inside it. That is the sentence the paper makes concrete, and it is why the line of work it belongs to keeps coming back.
Questions you might still have
Why train the controller with evolution instead of backpropagation?
Reward is sparse, delayed, and flows through a non-differentiable environment, so backprop has no clean path to follow. By keeping the controller tiny (867 parameters), a gradient-free optimizer like CMA-ES can search it directly using only each rollout’s final score. The heavy, differentiable learning is left to V and M, which predict observations and never touch the reward.
How can an agent learn to dodge fireballs it never really saw?
M is trained to predict the next latent, so it doubles as an environment: state and action in, next state out. Sampling from M generates a rollout entirely in latent space, fireballs included. The controller trains against those dreamed rollouts and transfers back, because at deployment it reads the same latents, now produced by the real game instead of M.
If the dream is just a model, why does a noisier dream transfer better?
A near-deterministic dream has exploitable flaws, and the controller will happily learn a policy that games them (making dream-monsters never fire) and means nothing in reality. Raising the temperature injects uncertainty that breaks those exploits, so the agent has to learn a policy that works under genuine unpredictability. Too much noise, though, and the dream becomes too chaotic to learn from: best transfer is at τ = 1.15.
Why a VAE for V, and not a plain autoencoder?
The KL term caps how much information each code can carry and keeps the codes packed into a smooth Gaussian blob with no empty pockets. That tidiness pays off when M starts inventing latents that V never produced from a real frame: they still land somewhere the decoder can interpret, instead of in undefined space.
Footnotes & further reading
- The paper: Ha & Schmidhuber, World Models (Google Brain / NNAISENSE, NeurIPS 2018). The authors' interactive version, with playable demos, lives at worldmodels.github.io.
- The V model is a variational autoencoder: Kingma & Welling, Auto-Encoding Variational Bayes (we have a separate explainer). The reconstruction-plus-KL loss is minimized; the KL pulls the codes toward the standard-normal prior .
- The M model is a mixture density network on top of an LSTM: Bishop, Mixture Density Networks (1994); Graves, Generating Sequences With RNNs (2013); and the temperature trick as used in Ha & Eck, A Neural Representation of Sketch Drawings (SketchRNN). The LSTM itself is Hochreiter & Schmidhuber (1997).
- The controller is evolved with CMA-ES: Hansen, The CMA Evolution Strategy: A Tutorial. Population 64, each candidate averaged over 16 rollouts; car racing was solved after about 1800 generations.
- The lineage of learning and planning with recurrent world models runs back through Schmidhuber, On Learning to Think (2015) and his 1990–1991 work on RNN controller-model systems, which the paper builds on directly.
- Numbers quoted here are from the paper: car-racing scores (V only), (full model); the Doom temperature table ( from 0.10 to 1.30); and parameter counts 4,348,547 (V), 422,368 (M), 867 (car-racing C). The Doom controller is slightly larger at 1,088 because it is also fed the LSTM's cell state.
How could this explainer be improved? Found an error, or something unclear? I read every message.