Model-based RL · Generative models

World Models

An agent that learns to act inside its own dream.

Split an agent into a large model that learns how its world looks and moves, and a tiny controller that decides what to do. The world model can be trained without any reward, and once it can predict the future, the controller can be trained entirely inside a hallucination and then dropped back into reality.

Explaining the paperWorld ModelsHa, Schmidhuber · Google Brain · NNAISENSE · NeurIPS 2018 · arXiv:1803.10122 ↗

An agent can practice a task thousands of times without ever touching the real environment, by running the practice runs inside a model it builds itself.

A batter facing a 100 mph fastball has less time to react than it takes for the visual signal to travel from the eye to the conscious brain. They swing anyway, and they connect, because they are not reacting to the ball. They are reacting to a prediction of where the ball will be. We all do this constantly. The brain carries a compressed running model of the world and acts on what that model expects to happen next, not on the raw flood of the senses, which arrives a beat too late to be useful.

World Models, by David Ha and Jürgen Schmidhuber, asks what an artificial agent built on the same principle would look like, and then builds the smallest version that works. The agent has three parts that mirror the cognitive story: something that sees and compresses each frame, something that remembers and predicts what comes next, and something that decides. The first two together are the "world model." The third is a controller so small you could fit its parameters on an index card.

That division drives the rest of the design, and it produces a result that sounds unlikely the first time you hear it. Because the world model learns to predict the future, you can unplug the real environment and let the model generate the agent's experience instead. The agent trains inside this dreamed-up environment, never seeing a real pixel, and the policy it learns there transfers back to the actual game. To make all of that land, we need to build the three parts in order, see how they snap together into a loop, and then watch what happens when you cut reality out of that loop.

A big world model and a tiny controller

Start with a problem that has shaped reinforcement learning for years: credit assignment. An agent takes thousands of actions and, much later, gets a single number telling it how well the whole episode went. Which actions deserve the credit? Backpropagation, the workhorse that trains huge networks elsewhere, needs a clean differentiable path from a loss back to every weight. A sparse, delayed, often non-differentiable reward gives it almost nothing to chew on. So in practice people who train policies by reinforcement learning keep the networks small, because a small network is all the weak reward signal can manage to steer.

The paper refuses that tradeoff. It splits the agent so that the part needing a rich, easy, differentiable objective and the part needing the weak reward signal are different networks, trained in different ways.

V (Vision) compresses the current frame into a short latent vector $z$ . It is a variational autoencoder, trained by ordinary backpropagation just to reconstruct what it sees. No reward involved.
M (Memory) predicts the next latent $z$ from the current one and the action. It is a recurrent network, also trained by backpropagation, just to predict the future. No reward involved.
C (Controller) looks at what V and M hand it and picks an action. It is a single linear layer with a few hundred parameters, and it is the only part that ever sees the reward.

V and M are the world model. They hold essentially all of the agent's capacity (for the car-racing agent, about 4.8 million parameters between them) and they learn in a self-supervised way, predicting observations, a task that hands backpropagation a dense signal at every pixel and every time step. C holds 867 parameters and learns from reward alone. By keeping C tiny, the reward only has to solve a tiny search problem, which means you can throw a gradient-free optimizer at it and not care that the reward is sparse. Everything turns on keeping C tiny: a gradient-free search keeps a population of candidate parameter vectors and breeds the ones that score best, which is comfortable across 867 numbers and hopeless across a million, so shrinking the reward-driven part is exactly what makes evolution a practical optimizer and the sparse reward a non-issue.

Three letters, then, in dependency order: V turns a picture into a few numbers, M turns a history of those numbers into a guess about the next ones, and C turns both into an action. Build them one at a time.

V: compress what you see

The environment hands the agent a 64-by-64 color image at every step. That is 12,288 numbers, almost all of them redundant: neighboring pixels agree, the sky is the sky, the road is the road. The job of V is to throw away the redundancy and keep a compact code of what actually matters in the frame.

V is a variational autoencoder, or VAE (we have a separate explainer on the VAE if you want the full derivation). The short version: an encoder squeezes the image down to a small latent vector $z$ (32 numbers for the car-racing agent, 64 for the Doom agent), and a decoder tries to rebuild the original image from just those numbers. Train the pair to make the reconstruction match the input and the bottleneck is forced to spend its handful of numbers on whatever is most worth keeping.

The loss has two terms, and the paper minimizes their sum. The first is reconstruction error, the pixel-by-pixel squared difference between the frame and its rebuild. The second is a regularizer that pulls the distribution of codes toward a standard Gaussian:

\mathcal{L}_V = \underbrace{\lVert \mathbf{x} - \hat{\mathbf{x}} \rVert_2^2}_{\text{reconstruction}} \; + \; \underbrace{D_{\mathrm{KL}}\!\big(q(z \mid \mathbf{x}) \,\Vert\, \mathcal{N}(\mathbf{0}, \mathbf{I})\big)}_{\text{keep the code space tidy}}

That second term is what makes it a variational autoencoder rather than a plain one, and it does two things here. It caps how much information any single code can carry and it keeps the codes packed into a smooth, well-behaved blob around the origin with no strange empty pockets. The cap stops the encoder from burning all 32 of its numbers on pixel trivia, so the budget goes to structure that the task might actually use. That matters because of what happens next: M is going to start inventing latent codes that V never produced from a real image. A tidy latent space means those invented codes still land somewhere V's decoder can make sense of, instead of off in some undefined corner. (The paper leans on the capacity limit as the stated reason for the VAE; the robustness to invented codes is a welcome consequence of the same Gaussian prior.)

Drag across the figure to bend the track. The frame on the left is encoded into the 32-number code in the middle, then decoded back on the right. Watch what survives the squeeze and what does not:

Figure 1 · compress and rebuild

A frame is encoded into a 32-number latent z, then decoded back. The road and the car survive the bottleneck; the lane dashes and fine texture do not. V keeps the gist of the scene and spends none of its budget on detail you would not miss.

The reconstruction is visibly lossy, and that is the point. V is not a photograph, it is a thumbnail: enough to know where the road bends and where the car sits, with the texture sanded off. The whole V network is trained for a single pass over ten thousand random-play episodes and then frozen. From here on, the agent never works with pixels again. It works with $z$ .

The paper raises a caveat about itself: because V learns with no idea what the task is, it cannot know which details will turn out to matter. It faithfully reproduces irrelevant wall textures in Doom while fumbling the task-critical tiles on the road in car racing. The reconstruction loss counts every pixel the same and carries no information about the task, so V keeps whatever is large and repetitive and easy to rebuild, which is not always what the policy needs. We will come back to this.

M: predict what happens next

V compresses space. M compresses time. Its job is to look at the current code $z_t$ , the action $a_t$ the agent is about to take, and a running summary of everything that came before, and predict the next code $z_{t+1}$ . If V is the eye, M is the part trained on enough of the world to predict roughly what tends to happen next.

The running summary is the key object, so give it a name now. M is a recurrent network, specifically an LSTM (long short-term memory, a recurrent cell built to retain information across many steps), and at each step it carries a hidden state $h_t$ . You can think of $h_t$ as M's compressed belief about the situation: not just what the last frame looked like, but where things seem to be heading. Because M is trained to predict the future, $h_t$ ends up holding exactly the information you would need to predict the future. Hold that thought, because it is what makes the controller's job easy later.

The next point takes a little care. M does not predict a single next code. It predicts a probability distribution over next codes, because real environments branch. From the same situation, a monster might fire or it might not; the road might continue straight or a curve might be about to appear. A network forced to output one number would have to average those futures into a mush that is none of them. So M outputs a distribution with room for several distinct possibilities at once. The consequence lands on the controller: futures really do branch, a single predicted mean would be a physically impossible average frame, and a controller trained to act against that average would be blindsided by the branches that actually happen. Keeping the branches distinct, as the figure below draws them, is what lets the controller prepare for each one.

A Doom monster that fires on roughly half of all frames makes the "impossible average" concrete. The true next frame is one of two things: a fireball has appeared, or it has not. A predictor forced to emit one frame minimizes its error by splitting the difference, so it outputs a faint half-fireball, a smeared blob at 50% opacity that the game can never actually render. No real moment of the game looks like that, so the controller has trained against a situation that does not exist. When a real frame arrives it is always one extreme or the other, full fireball or none, and the policy tuned on the ghostly in-between has prepared for neither: it neither dodges hard nor relaxes, because the average frame sits halfway between dodging and relaxing. A mixture avoids the blur. It keeps the two outcomes as two separate modes with weights near 0.5 each, so a sampled dream draws a coherent frame, either a sharp fireball or a clean miss, and over many dreamed rollouts the controller faces full fireballs about half the time and learns to dodge the thing it will actually meet, not the smear that averaging would have handed it.

Figure 2 · why one frame is a lie

P(branch A)50%

A schematic toy of one branching event. The world can go two ways: branch A (a fireball forms) or branch B (none). A model that predicts one frame averages them into a ghost at opacity P(A), a blur the game can never render. The mixture refuses the blur: it keeps A and B as two real modes, and the slider only shifts their weights. Drag P(branch A): the mean stays one impossible frame the whole way; the mixture stays two real frames. (Toy of one event; the real M is a five-component mixture density over the latent

z

The tool for that is a mixture density network, or MDN, an idea from Bishop (1994). Instead of one Gaussian, the network outputs a weighted mix of several (five, in this paper), each with its own mean and width. The whole predicted distribution for the next latent is

P(z_{t+1} \mid a_t, z_t, h_t) \;=\; \sum_{k=1}^{5} \pi_k \,\mathcal{N}\!\big(z_{t+1}; \, \mu_k, \, \sigma_k^2\big)

where the weights $\pi_k$ say how likely each branch is and sum to one. Stitch this onto the LSTM and you get the MDN-RNN, the same recipe Ha used to make a network draw sketches one stroke at a time. Each branch of the mixture can stand for a distinctly different thing that might happen next, which is exactly what you need to capture a discrete event like "a fireball appears."

That mixture also comes with a single dial, so meet it now. A temperature $\tau$ controls how much randomness you allow when you sample the next latent from the mixture. Following the sketch-drawing work, lowering $\tau$ does two things at once: it shrinks every Gaussian (temperature scales each mode's variance by $\tau$ , so its width, the standard deviation, scales by $\sqrt{\tau}$ ) and it sharpens the choice among branches toward the single most likely one. At $\tau \to 0$ the model collapses onto its favorite branch with no spread, effectively deterministic. At $\tau = 1$ you get the distribution the network actually learned. Above one, the futures spread out and the rare branches get more airtime.

Slide $\tau$ and watch the distribution over one latent coordinate. The tall mode is the usual case, where nothing much changes; the small amber mode is the rare branch, where a fireball forms. The rare mode shrinks toward zero weight as you cool the model down:

Figure 3 · the next latent is a distribution

temp ττ = 1.00

M predicts the next latent as a mixture of five Gaussians, not a single point. Lower the temperature

\tau

and every mode narrows while the weight piles onto the likeliest one, so the rare "fireball" branch and its probability vanish. This collapse is the problem the temperature section below has to solve.

M is trained on the same ten thousand random episodes as V, for twenty passes, with one simple objective: make the mixture assign high probability to the latent that actually came next. No reward, no planning, just next-frame prediction in the compressed space V built. After training, M is a generative model of how the world moves, and the rest of the paper is about cashing it in.

C: the part that acts

After V and M, the controller is almost an anticlimax, which is the design working as intended. C takes the current code $z_t$ and M's hidden state $h_t$ , glues them into one vector, and runs a single linear layer to produce the action. That is the whole policy, and it is the one numbered equation in the paper:

a_t = W_c\,[\,z_t \; h_t\,] + b_c

(1)

No hidden layers, no nonlinearity beyond squashing the output into the valid action range. For the car-racing agent, $z_t$ is 32 numbers and $h_t$ is 256, so the concatenation is 288 numbers, and mapping that to three continuous controls (steer, gas, brake) takes 288 × 3 weights plus 3 biases, which is 867 parameters. That is the whole brain that touches the reward.

The controller is kept this small on purpose, because now you can optimize it with a method that needs no gradients at all. The paper uses CMA-ES (covariance-matrix adaptation evolution strategy), an evolution strategy. The idea is refreshingly blunt: keep a population of candidate controllers drawn from a Gaussian over the 867 numbers, run each one in the environment, see which scored well, and shift the Gaussian toward the winners, also reshaping its covariance so it stretches along the directions that have been paying off. Repeat for many generations. It only ever needs the final score of a rollout, so it does not care that the reward is sparse or that env.step is a non-differentiable black box, and it parallelizes trivially across machines. CMA-ES handles a few thousand parameters comfortably, and 867 sits well inside that.

Together the three parts form a loop. Each step, the environment shows a frame; V codes it to $z$ ; C reads $z$ and M's state $h$ and emits an action; the action moves the world and also feeds M, which rolls its hidden state forward to be ready for the next step:

Figure 4 · the agent loop

The loop, with capacity bars drawn to scale (car-racing counts). V sees, M predicts, C acts. V holds 4.35M parameters and M holds 422K; C's 867 are a sliver you can barely see. Almost the entire agent is the world model.

def rollout(controller):              # one episode in the real world
    obs = env.reset()
    h = rnn.initial_state()           # M's hidden state: the running memory
    total, done = 0, False
    while not done:
        z = vae.encode(obs)           # V:  64x64x3 frame  ->  z
        a = controller.action([z, h]) # C:  a = W_c [z, h] + b_c
        obs, reward, done = env.step(a)
        h = rnn.forward([a, z, h])    # M:  roll the memory one step forward
        total += reward
    return total

Feeding C that hidden state $h$ buys something concrete. On the car-racing task, a controller that sees only the current code $z$ drives, but badly: it wobbles and clips the sharp corners, scoring $632 \pm 251$ . Give the very same linear controller M's hidden state $h$ as well, and the score jumps to $906 \pm 21$ . Nothing changed except that C can now see what M expects to happen next. The driving steadies because the agent is no longer reacting to the current frame alone; it is acting on a prediction, the same as the batter.

That $906 \pm 21$ clears the bar for solving the task. CarRacing-v0 counts as solved at an average of 900 over 100 consecutive trials, a bar no published method had cleared. The strongest prior deep-RL agents scored in the 591 to 652 range (DQN trailed at 343), and the best entry on the leaderboard managed $838 \pm 11$ . A linear controller reading a learned world model was the first to actually solve it.

Training inside a dream

Everything so far used the real environment to generate the agent's experience. M, though, takes the current latent and an action and produces the next latent. That is precisely the signature of an environment: state plus action goes in, next state comes out. M is not just predicting the world; M is a world, a self-contained one that lives entirely in latent space.

So unplug reality. Instead of asking the real game for the next frame, ask M for the next latent: sample $z_{t+1}$ from its mixture, feed that back in as if it were a real observation, and let the loop run. The controller cannot tell the difference, because it only ever saw latents to begin with. For the Doom agent the paper also trains M to predict whether the agent dies, so the dreamed environment can even end episodes. Nothing real is in the loop. The agent is dreaming, and it can act inside the dream:

Figure 5 · the dream loop

With the real environment cut out, M generates the next latent itself and the agent lives inside the hallucination: sample

\hat z

, act, feed the action back to M, repeat. Here the dreamed rollout is a corridor of fireballs the agent learns to dodge. The pixels are decoded only so we can watch; the agent trains purely on latents.

def dream(controller, tau):           # one episode inside M's hallucination
    z = sample_initial_latent()       # start from a latent, never a pixel
    h = rnn.initial_state()
    total, done = 0, False
    while not done:
        a = controller.action([z, h]) # same C, it cannot tell it is dreaming
        z, done = rnn.dream(a, z, h, tau)  # M samples the next z (and death)
        h = rnn.forward([a, z, h])
        total += 1                    # Take Cover: +1 for each step survived
    return total

The two code blocks are the same loop with one line swapped. In the real rollout, the next observation comes from env.step. In the dream, it comes from rnn.dream. Train the controller against the second, then deploy it against the first. On the Doom "Take Cover" task, where the agent has to survive a barrage of fireballs and solving means staying alive past 750 time steps on average, an agent trained entirely in the dream scored around 900 in the dream and then, dropped into the real game, survived for $1092 \pm 556$ steps. It learned to dodge fireballs it had only ever met in its imagination.

A learned dream environment runs without a game engine, without rendering pixels, without physics you have to compute. It is just a recurrent network unrolling in latent space, so you can run it on a GPU as fast as the network goes, spin up as many parallel dreams as you like, and never pay the cost of the real simulator while the controller practices. But the dream is only as honest as M, and its limits start to show.

Cheating the dream, and why noise fixes it

M is a model, which means M is wrong in places. And a controller trained inside M is under no obligation to be fair about it. It will happily find a policy that scores beautifully against M's flaws and means nothing in reality, the way a speedrunner finds a glitch that walks through a wall the designer never meant to be walkable. The incentive is built in: when the controller trains in the dream its entire reward flows through M, so it treats whatever M shows as ground truth and optimizes for whatever scores well there. If some flaw in M is easier to exploit than the real task is to solve, the search finds it first, and nothing in the loop ever tells the controller it is cheating.

The paper watched exactly this happen. In early Doom dreams the controller discovered an adversarial policy: it learned to move in a way that caused the dreamed monsters to never fire. Even when a fireball started to form, the agent could make it fizzle out, as if it had found a cheat code in its own imagination. Inside the dream the agent was untouchable. Back in the real game, where monsters do not honor the exploit, the same policy was useless.

Raising the temperature closes the exploit. Crank $\tau$ up and you inject extra uncertainty into M's predictions: fireballs appear less predictably, the dream behaves more erratically, and the precise sequence of events the exploit depended on stops being reliable.

But uncertainty cuts both ways, and the paper's temperature table lays the tradeoff out plainly. Step the dial through the measured settings and compare two numbers at each: the score the controller gets in the dream, and the score the very same controller gets back in the real game:

Figure 6 · the exploitation tradeoff

temp ττ = 1.15 · best transfer

Take Cover scores against temperature

\tau

(the paper's numbers). At low

\tau

the dream looks solved (~2086) while the real game score collapses to 193, below a random policy. Add uncertainty and the curves cross: real transfer peaks at

\tau = 1.15

with

1092

, then a harder dream starts to hurt again.

The two ends of the table point in opposite directions. At $\tau = 0.1$ the dream is nearly deterministic, the monsters mode-collapse into never shooting, and the agent racks up a near-perfect $2086$ by exploiting a world with no danger in it. Deployed for real, that same agent scores $193$ , worse than flailing at random (a random policy survives for $210$ ). The dream score was a fantasy in the literal sense. At $\tau = 1.15$ the dream is honestly hard, the controller can no longer cheat, its dream score is a modest $918$ , and that is the version that transfers best, surviving $1092$ steps in the real game. Push $\tau$ to $1.30$ and the dream gets so chaotic there is little left to learn, and both scores sag. A dream you can game teaches the wrong lesson, and a dose of noise keeps the agent from studying for the wrong test.

Return to the mixture density network for a moment. A plain deterministic predictor would have one future, the easiest possible world to exploit. Because M outputs a real distribution with separate modes, you can dial its uncertainty up with temperature and you can represent genuinely branching events like a fireball that may or may not appear. The multimodality is what lets you make the dream both realistic and tamper-resistant.

What it solves, and where it breaks

For a system that does this much, there is almost suspiciously little to it: an unsupervised autoencoder compressing each frame, a recurrent mixture model predicting the next compressed frame, and a linear controller evolved over those two. Two of the three parts never see a reward, and the one that does is small enough for a gradient-free search to handle. From those pieces comes the first agent to solve CarRacing-v0, and one that learns to play Doom inside a hallucination and carries the skill back to the real game.

The paper also sketches where this goes for harder problems, and it is careful to mark these as proposals rather than finished results. For a world too rich to learn from random play, it suggests an iterative loop: act in the real environment, collect what you saw, improve M on the new data, train C inside the improved M, and repeat. To push the agent toward the parts of the world M understands worst, it suggests rewarding the agent for surprising M, by flipping the sign of M's own prediction loss so that high prediction error becomes something to seek out. That is a rough cousin of Schmidhuber's long line of work on artificial curiosity, though seeking raw prediction error is not quite the same as seeking learning progress, and the paper is upfront that one round was all its simple tasks needed. Whether the loop scales is left open.

The limits follow straight from the design. Because V is trained with no notion of the task, it spends its bottleneck on whatever looks prominent, not on whatever matters, so a feature the policy needs can get compressed away while a useless wall texture is preserved in detail. The root cause is the objective: V is graded only on pixel reconstruction, and every pixel is weighted equally in that loss, so the budget flows to whatever is visually large and repeated rather than to whatever the task needs, and nothing in V's training marks the road edge as mattering more than the wall texture. Because M is a finite recurrent network, it can only hold so much of a world, and over long iterative training it is prone to catastrophic forgetting, losing old skills as it picks up new ones. And the dream is always a model: useful exactly to the degree that M is faithful, exploitable exactly where M is not, which is why the temperature knob had to exist at all.

What survives all of that is the shape of the idea, and it has aged well. Learn a compact, predictive model of your world from cheap unsupervised data, put nearly all of your capacity there where a dense objective makes learning easy, keep the reward-driven part small, and then use the model not only to inform decisions but to generate the experience you learn from, so practice costs a forward pass instead of a trip through reality. An agent that can predict its world can rehearse inside it. That is the sentence the paper makes concrete, and it is why the line of work it belongs to keeps coming back.

Provenance Verified against primary literature

VAE (Kingma & Welling, 2013)The V model: a Gaussian-latent autoencoder, reconstruction plus a KL pull toward the prior.

MDN (Bishop, 1994) / Graves (2013)The M model: a mixture-density RNN; temperature reshapes the mixture when sampling.

CMA-ES (Hansen)Gradient-free evolution of the 867-parameter linear controller.

LSTM (Hochreiter & Schmidhuber, 1997)The recurrent memory whose hidden state h carries the predicted future.

correctionThe temperature τ follows the SketchRNN convention the paper adopts: it scales each Gaussian’s standard deviation by √τ (variance by τ) and sharpens the mixture weights toward the likeliest mode. Many summaries mention only the variance, and √τ (not τ) on the standard deviation is the easy thing to get wrong. At τ = 0.1 this collapse is what makes the dreamed monsters stop firing.

Questions you might still have

Why train the controller with evolution instead of backpropagation?
Reward is sparse, delayed, and flows through a non-differentiable environment, so backprop has no clean path to follow. By keeping the controller tiny (867 parameters), a gradient-free optimizer like CMA-ES can search it directly using only each rollout’s final score. The heavy, differentiable learning is left to V and M, which predict observations and never touch the reward.

How can an agent learn to dodge fireballs it never really saw?
M is trained to predict the next latent, so it doubles as an environment: state and action in, next state out. Sampling from M generates a rollout entirely in latent space, fireballs included. The controller trains against those dreamed rollouts and transfers back, because at deployment it reads the same latents, now produced by the real game instead of M.

If the dream is just a model, why does a noisier dream transfer better?
A near-deterministic dream has exploitable flaws, and the controller will happily learn a policy that games them (making dream-monsters never fire) and means nothing in reality. Raising the temperature injects uncertainty that breaks those exploits, so the agent has to learn a policy that works under genuine unpredictability. Too much noise, though, and the dream becomes too chaotic to learn from: best transfer is at τ = 1.15.

Why a VAE for V, and not a plain autoencoder?
The KL term caps how much information each code can carry and keeps the codes packed into a smooth Gaussian blob with no empty pockets. That tidiness pays off when M starts inventing latents that V never produced from a real frame: they still land somewhere the decoder can interpret, instead of in undefined space.

Footnotes & further reading

The paper: Ha & Schmidhuber, World Models (Google Brain / NNAISENSE, NeurIPS 2018). The authors' interactive version, with playable demos, lives at worldmodels.github.io.
The V model is a variational autoencoder: Kingma & Welling, Auto-Encoding Variational Bayes (we have a separate explainer). The reconstruction-plus-KL loss is minimized; the KL pulls the codes toward the standard-normal prior $\mathcal{N}(\mathbf{0},\mathbf{I})$ .
The M model is a mixture density network on top of an LSTM: Bishop, Mixture Density Networks (1994); Graves, Generating Sequences With RNNs (2013); and the temperature trick as used in Ha & Eck, A Neural Representation of Sketch Drawings (SketchRNN). The LSTM itself is Hochreiter & Schmidhuber (1997).
The controller is evolved with CMA-ES: Hansen, The CMA Evolution Strategy: A Tutorial. Population 64, each candidate averaged over 16 rollouts; car racing was solved after about 1800 generations.
The lineage of learning and planning with recurrent world models runs back through Schmidhuber, On Learning to Think (2015) and his 1990–1991 work on RNN controller-model systems, which the paper builds on directly.
Numbers quoted here are from the paper: car-racing scores $632 \pm 251$ (V only), $906 \pm 21$ (full model); the Doom temperature table ( $\tau$ from 0.10 to 1.30); and parameter counts 4,348,547 (V), 422,368 (M), 867 (car-racing C). The Doom controller is slightly larger at 1,088 because it is also fed the LSTM's cell state.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.