Generative Adversarial Networks
Two networks play a forgery game until the fakes pass.
One network paints fakes, the other plays art critic, and the only feedback the forger ever gets is the critic's opinion. Push that game to its conclusion and the fakes become samples from the real data distribution. The whole thing trains with plain backpropagation and no likelihood anywhere in sight.
Explaining the paperGenerative Adversarial NetsHow do you train a network to draw faces when you can never write down what makes a face likely?
Generative modeling hit a wall for years. Say you want a model that produces new photographs of faces. The honest way to fit such a model is maximum likelihood: write down a probability density over images, then nudge its parameters until real faces score high under it. The trouble is the density. For any model rich enough to capture faces, computing means an integral over every possible configuration of the model's hidden variables, and that integral is intractable. The methods of the day fought this with Markov chains and approximate inference, machinery that was slow, fragile, and hard to scale.
Generative Adversarial Nets, from Ian Goodfellow and colleagues at Montréal in 2014, sidesteps the density entirely. The pitch: do not estimate any probabilities. Instead, set up a game between two networks and let the game do the estimating for you. A generator tries to manufacture samples that pass for real. A discriminator tries to catch the fakes. Train them against each other and, at the game's equilibrium, the generator is producing exactly the real data distribution and the discriminator cannot do better than a coin flip. No Markov chains, no inference network, no explicit likelihood. Just backpropagation and a forward pass to sample.
The argument that this works is short, a calculus exercise: what the two networks are, why the best possible discriminator is a ratio of densities, what quantity the generator is secretly driving down when it fights that discriminator, and why the obvious training loss had to be quietly swapped for a different one. Throughout, the claims the 2014 paper actually proves stay separate from the field's later, sharper understanding of why GANs are so hard to train.
Two networks, one forgery game
The paper's own analogy is a team of counterfeiters against the police. The counterfeiters print fake currency and try to spend it without getting caught. The police try to tell the counterfeits from the genuine bills. Each side pressures the other to improve, and the equilibrium is reached when the fakes are indistinguishable from real money, at which point the police can only guess.
Make that precise. The generator is a network that takes a random vector drawn from a simple, fixed prior (the paper uses a uniform distribution; a Gaussian is the common modern choice) and maps it to a sample in data space. The discriminator takes an and returns a single number , its estimate of the probability that is a real data point rather than one of the generator's fakes.
They are wired into a single objective, one that wants to make large and wants to make small:
Read it one term at a time. The first term rewards for assigning a high probability to real data. The second rewards for assigning a low probability to fakes, since is large there. So the discriminator wants as large as it can get. The generator only appears inside the second term, and it wants the opposite: it wants close to 1, so that heads toward and shrinks. One number, pulled in two directions. That is the entire setup.
A name for this kind of objective: a two-player minimax game. The outer and inner say the generator is choosing its move anticipating the discriminator's best response. The 2014 paper stops there and frames the solution as the global minimum of a criterion we are about to derive. The vocabulary you will hear today, that the solution is a Nash equilibrium and a saddle point of , came later (it is correct, and this page uses it, but those words are not in the original paper). The distinction matters more than it sounds, and it is the root of why GANs are temperamental, a thread that comes back when we watch training collapse.
Notice what is absent. Nowhere does anything evaluate , the probability the generator assigns to an image. The generator is never asked "how likely is this face under you?" It is only ever asked "can you fool the critic?" That is the trick that dodges the intractable integral. The price, which we will pay in full at the end, is that the model has no likelihood to report.
The generator turns noise into data
Before the game, get a feel for the generator on its own, because it is doing something subtle. It is a deterministic function. Feed it a fixed and you always get the same . The randomness in the output comes entirely from the randomness you feed in. So takes a simple distribution, a uniform blob of noise, and reshapes it into something complicated, the distribution of faces. The resulting is the pushforward of the prior through : whatever distribution you get by drawing and applying . The paper's own word for it is that "implicitly defines" . It is never written down.
How does a smooth map turn flat noise into a lumpy distribution? By stretching some regions and squeezing others. Where is steep, a small interval of gets spread across a wide interval of , so the output probability there is thin. Where is nearly flat, a wide interval of gets crushed into a narrow interval of , so probability piles up. The paper's Figure 1 puts it exactly: contracts in regions of high density and expands in regions of low density. In one dimension this is the change-of-variables rule (the same Jacobian bookkeeping a normalizing flow uses, except a GAN never has to compute it):
Drag the slider below. At the start is the flat identity, so uniform noise produces a uniform output and the threads stay parallel. As it "trains," bends into the shape that maps uniform onto the two-bump target: evenly spaced noise samples bunch together under the modes, where the curve goes flat and density should be high, and spread apart across the empty middle, where the curve is steep and density should be low.
That is the whole job of the generator: find a warp of noise whose pushforward is the data. The question is how it could ever learn the right warp without anyone telling it what the data density is. The answer is the discriminator.
The discriminator is a density ratio
Freeze the generator for a moment, so is some fixed distribution, and ask: what is the best possible discriminator? This has a clean answer, and it is the hinge the whole paper turns on.
Write the value function as an integral over data space. Real points are drawn with density and fakes with density , so
(The expectation over turned into an integral over because is defined as the pushforward; no Jacobian goes missing.) Now look at the integrand at a single point . With and held fixed, we are choosing the number to maximize . That is a one-variable calculus problem. Before doing the calculus, guess the shape of the answer: at this one point the discriminator is just choosing one number, the probability it reports that is real, and the best possible choice can only depend on how much real density and how much fake density sit there, so expect a fraction built from and . Set the derivative to zero, , and out drops
and the second derivative is negative, so it is a genuine maximum. Because each point can be optimized on its own, this pointwise solution is the optimal discriminator everywhere at once. Read it as what it is: at any point, the best critic reports the fraction of the probability there that comes from real data. Where only real data lives, . Where only fakes live, . Where the two distributions are equally dense, , a shrug.
Ground that at one point. Pick a location where real data is four times as dense as the generator's output, so of the probability sitting there, 80% comes from and 20% from . The optimal discriminator at that point reports the real fraction directly, : shown a sample landing there, the best critic is right to call it real four times out of five. Now let the generator improve until it matches the data, so at that same point both densities are equal, a 50/50 split. The fraction becomes and the critic is forced to a coin flip, not because it got worse but because real and fake are now genuinely indistinguishable there. The discriminator's output reads off the local mix; driving it to one half everywhere is the generator erasing the mix.
This is a density ratio in disguise. Rearranged, it is the exact likelihood ratio between real and fake:
The discriminator never learns either density, but the two of them in ratio fall out of it for free. (It equals the Bayesian posterior "probability this is real" only because the game shows real and fake in a one-to-one mix, which is baked into eq 1.)
The figure below is the picture to hold onto. The amber curve is a fixed data distribution. The teal curve is the generator's, and the slider walks it from far off toward sitting exactly on top of the data. The violet curve is computed from eq (2). Watch what happens to it. While the two distributions are separated, swings confidently between 1 and 0, telling real from fake with ease. As the generator closes the gap, the violet curve sags. When , it is pinned flat at : the perfect classifier has been reduced to a coin flip, because there is genuinely nothing left to discriminate.
So the discriminator, at its best, is a meter for how far apart the two distributions are. When it is helpless, they match. That is suggestive. If the generator could somehow read that meter and act to drive it to everywhere, it would be driving onto . The next step makes that intuition exact.
What the generator really minimizes
Substitute the optimal discriminator back into the value function. The generator is now playing against a perfect critic, so its remaining objective is a function of alone, which the paper calls the virtual training criterion :
This looks like a mess of log-ratios, but it is a famous quantity wearing a disguise. Add and subtract a inside each expectation (so each denominator becomes the average distribution ) and the whole thing reorganizes into two Kullback-Leibler divergences plus a constant:
A KL divergence measures how far is from , and it is zero exactly when they are equal and positive otherwise. The two KL terms here, each comparing a distribution to the average of the two, are precisely the definition of the Jensen-Shannon divergence, a symmetric, well-behaved distance between distributions:
Now the game has a one-line meaning. Against a perfect discriminator, the generator is minimizing the Jensen-Shannon divergence between its samples and the real data. The Jensen-Shannon divergence is never negative and is zero only when the two distributions are identical, so bottoms out at its constant floor, (all logs here are natural), and it reaches that floor at exactly one place: . At that point the optimal discriminator is everywhere, which is what the flattened violet curve in the last figure was showing. The forgery game, played to completion, recovers the data distribution. That is the central theorem.
Two pieces of that are worth slowing down on, because they are where the intuition lives. A divergence is a number that measures how far one distribution sits from another: it is zero exactly when the two distributions are identical and grows as they pull apart, the way a distance does (it need not be symmetric, which is why "divergence" and not "distance"). The Jensen-Shannon divergence is the particular one that falls out of the algebra here. It is the symmetric, bounded cousin of KL: symmetric because swapping and leaves it unchanged, bounded because it never exceeds , so the whole objective is trapped between at the bottom and at the top. Minimizing it pushes the two distributions together, and there is nowhere lower to go than zero, which is reached only when they coincide.
The coin-flip discriminator is the observable tell that this has happened. Plug into the optimal discriminator and the two densities in numerator and denominator are equal, so is forced to at every point. The best possible critic, given infinite capacity and a perfectly trained classifier, can do no better than answer "real" with probability one half on everything it is shown, a coin flip. That is not the critic failing; it is the critic reporting, correctly, that there is no longer any difference between real and generated samples to find. A critic stuck at one half everywhere is the signal that the generator has matched the data. (The flat-one-half reading is the balanced-game case the value function in eq (1) builds in, real and fake shown one-to-one; that even mix is what makes the meeting point.)
The figure makes the landscape literal. Slide the generator's distribution across the data and watch trace out a bowl. The floor is , touched only when the two distributions coincide. The height of the ball above that floor is exactly , the only part of the objective the generator can still push down.
Everything in that derivation lives in the space of distributions, and that is the difference between the paper's theory and what you can actually run. The proof assumes can be made perfectly optimal at every step and that can be moved freely. In that idealized, infinite-capacity setting, the paper proves the objective is convex with a single global optimum and that the generator's parameters converge to that optimum. The moment you replace "move freely" with "take a gradient step on the weights of an MLP," the convexity is gone, the guarantees evaporate, and, in the paper's own words, the network "introduces multiple critical points in parameter space." The clean story is true in function space and only hoped for in practice.
The loop, and the loss they really use
How do you actually play the game? You cannot solve the inner to completion at every step, that would be its own full optimization, so you alternate. Take gradient steps to improve the discriminator, then one step to improve the generator, and repeat. As long as the generator changes slowly, the discriminator stays near its optimum and the gradient the generator receives is close to the one our theory described. The paper uses , the cheapest option, and momentum. The figure below plays the smallest version of that game, watch what the update pattern does around a saddle.
The discriminator step is exactly what eq (1) says: push up on real data and down on fakes. The generator step is where a wrinkle hides. Eq (1) tells the generator to minimize . Early in training that loss barely works, for a concrete reason. When the generator is bad, the discriminator spots its fakes with high confidence, so sits near 0. Right there, the loss surface is flat, so the gradient handed back to the generator is tiny. The generator is losing badly with no signal about how to do better. The loss has saturated.
The fix in the paper is a small change with a large effect. Instead of having the generator minimize , have it maximize :
Both losses want the same thing, a fake that calls real, and both share the same equilibrium. The difference is the gradient when the generator is losing. The figure plots the generator's cost under each version against . On the left, where bad early samples live, the original minimax cost (amber) is nearly flat, while the swapped cost (teal) is steep. The ratio of their gradients is , which blows up as : exactly when the generator most needs a push, the new loss gives it one many times larger.
This swap is the version everyone trains, and it comes with an asterisk worth stating plainly. The tidy "GANs minimize the Jensen-Shannon divergence" result was derived for the original minimax loss against a perfect discriminator. The non-saturating loss has the same fixed point but a different gradient, and later analysis (Arjovsky and Bottou, 2017) showed that gradient corresponds to a different, reverse-KL-flavored objective, not a clean descent on the Jensen-Shannon divergence. So the theory you carry from the last section is the right intuition for where the game ends, but it is not a literal description of the loss in the training loop. Hold both pictures.
The whole step, in code, is short:
# one GAN training step (Algorithm 1, with k = 1)
x = sample_real(batch) # real data, x ~ p_data
z = sample_noise(batch) # latent prior, z ~ p_z
# 1. discriminator: learn to tell real from fake (ascend)
fake = G(z).detach() # block the generator's gradient
loss_d = -(log(D(x)) + log(1 - D(fake))).mean()
loss_d.backward(); opt_d.step()
# 2. generator: learn to fool D (descend the NON-saturating loss)
z = sample_noise(batch)
loss_g = -log(D(G(z))).mean() # i.e. maximize log D(G(z))
loss_g.backward(); opt_g.step() # gradient flows D -> G, D frozenTwo details to read off it. The detach() on the fake during the discriminator step stops the generator from being updated there; we only want to learn from that line. And in the generator step the gradient flows backward through the frozen discriminator into : the discriminator is the channel through which the data's influence reaches the generator. The generator never sees a real image. It only ever sees the direction the critic says would make its fake more convincing.
Watching it learn, and collapse
When it works, training looks like the generator slowly spreading out to cover the data. Picture the real distribution as a handful of separated clusters, the modes. A healthy generator starts as a shapeless blob and, batch by batch, fans out until its samples land on every cluster in the right proportion. The left regime in the figure below is that happy path.
But the same game has a failure mode built into its incentives, and the paper named it before anyone had a cure. Look again at the generator's job: produce samples the discriminator calls real. Nothing in that sentence demands variety. If the generator finds one output that reliably fools the current discriminator, it can profit by producing that one output over and over, mapping many different values to nearly the same . Contrast this with a supervised loss, which grades every output against its own target, so every mode of the data pulls on the model. Here the only grade is 's opinion of the samples actually produces, is never shown the modes skipped, so nothing in the game charges for skipping them. Mode coverage is no one's job. The paper calls this the "Helvetica scenario" (the modern name is mode collapse): collapses too many values of onto too few values of to have the diversity needed to model . Each batch can look perfectly real and yet the generator only ever visits one corner of the data. Toggle the figure to the collapse regime to see it: the samples cluster on a single mode and hop between modes instead of covering them.
Mode collapse is one of three distinct ways GAN training goes wrong, and they are worth keeping separate. It is here that the "saddle point" framing earns its keep. A minimax solution is not the bottom of a valley, where every direction is uphill; it is a saddle, a minimum along the generator's axes and a maximum along the discriminator's. Simultaneous gradient steps on a saddle do not have to converge. They can orbit it forever, the way descending one player while ascending the other traces a circle around the center of . That orbiting is the second failure, oscillation. The third is the vanishing gradient from the last section: let the discriminator get too good, too fast, and it saturates the generator's loss and starves it of signal. No single trick fixes all three, which is why a decade of follow-up work went into taming this game.
Does it actually work?
In 2014, with no agreed way to score a likelihood-free model, the paper fit a Gaussian Parzen window to the generator's samples, a kernel density estimate: drop a small Gaussian on each generated sample, sum them into a makeshift density, and report the log-likelihood of the real test set under that. The numbers are in Table 1.
On MNIST the adversarial net posts the best score, 225 against the next-best 214. On the Toronto Face Database it comes second, 2057 behind the Stacked Contractive Autoencoder's 2110, and the spread across all four models is small relative to the error bars. This is "competitive," not "dominant." Nor is the metric to be trusted: a Parzen-window estimate is a poor stand-in for a likelihood the GAN cannot provide, and later work (Theis, van den Oord, and Bethge, 2015) showed these estimates can rank models in the wrong order and that, in high dimensions, a model can score well while producing nonsense or score badly while producing excellent images. What the 1406.2661 experiments really establish is that the samples looked good and were demonstrably not memorized copies of the training set (the paper shows each sample beside its nearest training neighbor), and that the field would need years to invent evaluations worth believing. The qualitative figures, blurry by today's standards, were the real evidence.
What was missing, and what came after
The paper is candid about its own limits, and the list reads like a map of the next decade of research. Two disadvantages are called out directly. First, there is no explicit representation of : you can sample from a GAN, but you cannot ask it how probable a given image is, which rules out anything that needs a likelihood. Second, and must be kept in careful balance through training, the synchronization problem that produces mode collapse and instability.
Set against those costs are real advantages, and they are what made GANs matter. No Markov chains are ever needed, for training or for sampling. Generating a sample is a single forward pass, fast and exact, with no chain to mix. Only backpropagation is used to get gradients, so the model rides the same hardware and tooling as everything else in deep learning. And because the generator is shaped only by gradients passed back through the discriminator and never fit to pixels directly, it can produce very sharp, even degenerate distributions, where methods that rely on a blurring Markov chain cannot.
What followed turned the demonstration into a field. DCGAN (2015) found the convolutional architecture and training recipe that made image GANs stable enough to be useful. Wasserstein GAN (2017) traded the implicit Jensen-Shannon objective for the Earth-Mover distance, whose gradient does not vanish when the distributions barely overlap, directly attacking the saturation problem. Progressive growing and StyleGAN (2018-2019) pushed the same game to photoreal faces at high resolution. By around 2021 a different family, diffusion models, overtook GANs on many image-generation benchmarks by trading the unstable two-player game for a stable regression objective. GANs did not vanish (they remain fast and competitive in places), but the throne moved.
Step back and the contribution is one idea, stated four ways. You can fit a generative model without ever writing down its density. You do it by pitting it against a classifier whose best move is to report a density ratio. Fighting that classifier turns out to minimize a real divergence to the data. And the whole arrangement trains with nothing but backpropagation. Every modern adversarial method, and a good deal of the instability folklore around them, is a footnote to those four sentences.
Questions you might still have
If the generator never sees a real image, how does it learn?
It only ever receives gradients passed back through the discriminator. The discriminator looks at real data; the generator is told how to nudge its samples so the discriminator rates them higher. The data reaches the generator secondhand, as a direction, never as a target to copy.
Does GAN training really minimize the Jensen-Shannon divergence?
Only in the idealized case: the original minimax loss, with a perfectly optimal discriminator at every step. The non-saturating loss everyone actually uses shares the same fixed point but follows a different gradient (closer to a reverse KL), which is part of why GANs chase modes and can drop them.
Why are GANs so notoriously unstable?
The solution is a saddle point of the value function, not the bottom of a valley, so simultaneous gradient steps can orbit it instead of settling. Separately, if the discriminator gets too good its gradient to the generator vanishes, and the incentives also permit mode collapse. Three distinct failure modes, no single fix.
If a GAN has no likelihood, how do you judge it?
Mostly not by numbers. The paper’s Parzen-window scores are a weak proxy, and the field later judged GANs by sample quality (metrics like FID) and by what they enabled. A GAN can sample beautifully and still be unable to tell you how probable any particular image is.
Footnotes & further reading
- The paper: Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Generative Adversarial Nets (Université de Montréal, NeurIPS 2014). Original code.
- The zero-sum / Nash / saddle-point framing, the orbit, and the "not motivated by a theoretical concern" note on the non-saturating loss: Goodfellow, NIPS 2016 Tutorial: Generative Adversarial Networks.
- Why the non-saturating loss is reverse-KL-like and the source of the instability: Arjovsky & Bottou, Towards Principled Methods for Training GANs (2017), and Arjovsky, Chintala & Bottou, Wasserstein GAN.
- When GAN training provably orbits rather than converges (the manifold / not-absolutely-continuous case): Mescheder, Geiger & Nowozin, Which Training Methods for GANs do actually Converge? (ICML 2018).
- Why the Parzen-window numbers should not be trusted: Theis, van den Oord & Bethge, A note on the evaluation of generative models (2015).
- The lineage: DCGAN (2015), Conditional GANs (2014), StyleGAN (2018), and Diffusion Models Beat GANs on Image Synthesis (2021).
How could this explainer be improved? Found an error, or something unclear? I read every message.