Generative models · Theory

Generative Adversarial Networks

Two networks play a forgery game until the fakes pass.

One network paints fakes, the other plays art critic, and the only feedback the forger ever gets is the critic's opinion. Push that game to its conclusion and the fakes become samples from the real data distribution. It trains with plain backpropagation and no likelihood anywhere in sight.

Explaining the paperGenerative Adversarial NetsGoodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio · Université de Montréal · NeurIPS 2014 · arXiv:1406.2661 ↗

How do you train a network to draw faces when you can never write down what makes a face likely?

Generative modeling hit a wall for years. Say you want a model that produces new photographs of faces. The standard way to fit such a model is maximum likelihood: write down a probability density $p_g(\mathbf{x})$ over images, then nudge its parameters until real faces score high under it. The density is the intractable part. For any model rich enough to capture faces, computing $p_g(\mathbf{x})$ means an integral over every possible configuration of the model's hidden variables, and that integral is intractable. The methods of the day fought this with Markov chains and approximate inference, machinery that was slow, fragile, and hard to scale.

Generative Adversarial Nets, from Ian Goodfellow and colleagues at Montréal in 2014, sidesteps the density entirely. The idea: do not estimate any probabilities. Instead, set up a game between two networks and let the game do the estimating for you. A generator tries to manufacture samples that pass for real. A discriminator tries to catch the fakes. Train them against each other and, at the game's equilibrium, the generator is producing exactly the real data distribution and the discriminator cannot do better than a coin flip. No Markov chains, no inference network, no explicit likelihood. Just backpropagation and a forward pass to sample.

The argument that this works is short, a calculus exercise: what the two networks are, why the best possible discriminator is a ratio of densities, what quantity the generator is driving down when it fights that discriminator, and why the obvious training loss had to be swapped for a different one. Throughout, the claims the 2014 paper actually proves stay separate from the field's later, sharper understanding of why GANs are so hard to train.

Two networks, one forgery game

The paper's own analogy is a team of counterfeiters against the police. The counterfeiters print fake currency and try to spend it without getting caught. The police try to tell the counterfeits from the genuine bills. Each side pressures the other to improve, and the equilibrium is reached when the fakes are indistinguishable from real money, at which point the police can only guess.

Here is the precise version. The generator $G$ is a network that takes a random vector $\mathbf{z}$ drawn from a simple, fixed prior $p_{\mathbf{z}}$ (the paper uses a uniform distribution; a Gaussian is the common modern choice) and maps it to a sample $\mathbf{x} = G(\mathbf{z})$ in data space. The discriminator $D$ takes an $\mathbf{x}$ and returns a single number $D(\mathbf{x}) \in (0,1)$ , its estimate of the probability that $\mathbf{x}$ is a real data point rather than one of the generator's fakes.

They are wired into a single objective, one that $D$ wants to make large and $G$ wants to make small:

\min_G \max_D \; V(D,G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\big[\log D(\mathbf{x})\big] + \mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}\big[\log\big(1 - D(G(\mathbf{z}))\big)\big]

(1)

Each term carries its own meaning. The first term rewards $D$ for assigning a high probability $D(\mathbf{x})$ to real data. The second rewards $D$ for assigning a low probability $D(G(\mathbf{z}))$ to fakes, since $1 - D$ is large there. So the discriminator wants $V$ as large as it can get. The generator only appears inside the second term, and it wants the opposite: it wants $D(G(\mathbf{z}))$ close to 1, so that $\log(1 - D(G(\mathbf{z})))$ heads toward $-\infty$ and $V$ shrinks.

A name for this kind of objective: a two-player minimax game. The outer $\min_G$ and inner $\max_D$ say the generator is choosing its move anticipating the discriminator's best response. The 2014 paper stops there and frames the solution as the global minimum of a criterion we are about to derive. The vocabulary you will hear today, that the solution is a Nash equilibrium and a saddle point of $V$ , came later (it is correct, and this page uses it, but those words are not in the original paper). The distinction is the root of why GANs are temperamental, which the collapse behavior makes concrete.

Nowhere does anything evaluate $p_g(\mathbf{x})$ , the probability the generator assigns to an image. The generator is never asked "how likely is this face under you?" It is only ever asked "can you fool the critic?" That question is how the game dodges the intractable integral. The price, which we will pay in full at the end, is that the model has no likelihood to report.

The generator turns noise into data

Before the game, the generator on its own is already doing something subtle. It is a deterministic function. Feed it a fixed $\mathbf{z}$ and you always get the same $\mathbf{x}$ . The randomness in the output comes entirely from the randomness you feed in. So $G$ takes a simple distribution, a uniform blob of noise, and reshapes it into something complicated, the distribution of faces. The resulting $p_g$ is the pushforward of the prior through $G$ : whatever distribution you get by drawing $\mathbf{z}$ and applying $G$ . The paper's own word for it is that $G$ "implicitly defines" $p_g$ . It is never written down.

How does a smooth map turn flat noise into a lumpy distribution? By stretching some regions and squeezing others. Where $G$ is steep, a small interval of $\mathbf{z}$ gets spread across a wide interval of $\mathbf{x}$ , so the output probability there is thin. Where $G$ is nearly flat, a wide interval of $\mathbf{z}$ gets crushed into a narrow interval of $\mathbf{x}$ , so probability piles up. The paper's Figure 1 puts it exactly: $G$ contracts in regions of high density and expands in regions of low density. In one dimension this is the change-of-variables rule (the same Jacobian bookkeeping that a normalizing flow (a generative model built from invertible maps) uses to track this density change exactly — except a GAN never has to compute it):

p_g(x) = \frac{p_{\mathbf{z}}(z)}{|G'(z)|}, \qquad x = G(z)

Drag the slider below. At the start $G$ is the flat identity, so uniform noise produces a uniform output and the threads stay parallel. As it "trains," $G$ bends into the shape that maps uniform $\mathbf{z}$ onto the two-bump target: evenly spaced noise samples bunch together under the modes, where the curve goes flat and density should be high, and spread apart across the empty middle, where the curve is steep and density should be low.

Figure 1 · noise pushed into shape

trainingmatched

The generator is one deterministic warp

x = G(z)

. Evenly spaced latent samples z (bottom) ride up to where they land in x (top). Train

G

and the even ticks squeeze together under the two modes of p_data (contract, density piles up) and stretch apart in the gap (expand, density thins), until the teal p_g matches the amber target. The density is never written down; it is whatever falls out of pushing noise through

G

That is the generator's entire job: find a warp of noise whose pushforward is the data. The question is how it could ever learn the right warp without anyone telling it what the data density is. A second network, the discriminator, supplies the missing signal.

The discriminator is a density ratio

With the generator frozen, so $p_g$ is some fixed distribution, the question is what the best possible discriminator can be. This has a precise answer, and the rest of the paper turns on it.

Write the value function as an integral over data space. Real points are drawn with density $p_{\text{data}}(x)$ and fakes with density $p_g(x)$ , so

V(D,G) = \int_x \Big[\, p_{\text{data}}(x)\,\log D(x) + p_g(x)\,\log\big(1 - D(x)\big) \,\Big]\,dx

(3)

(The expectation over $\mathbf{z}$ turned into an integral over $x$ because $p_g$ is defined as the pushforward; no Jacobian goes missing.) The integrand at a single point is what matters here. With $a = p_{\text{data}}(x)$ and $b = p_g(x)$ held fixed, we are choosing the number $D(x)$ to maximize $a\log D + b\log(1-D)$ . That is a one-variable calculus problem. The shape of the answer is predictable before doing the calculus: at this one point the discriminator chooses a single number, the probability it reports that $x$ is real, and the best possible choice can only depend on how much real density and how much fake density sit there, so the answer is a fraction built from $a$ and $b$ . Setting the derivative to zero, $a/D - b/(1-D) = 0$ , gives

D^*_G(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}

(2)

and the second derivative is negative, so it is a genuine maximum. Because each point $x$ can be optimized on its own, this pointwise solution is the optimal discriminator everywhere at once. At any point, the best critic reports the fraction of the probability there that comes from real data. Where only real data lives, $D^* = 1$ . Where only fakes live, $D^* = 0$ . Where the two distributions are equally dense, $D^* = \tfrac12$ , a shrug.

A single point makes this concrete. Pick a location $x$ where real data is four times as dense as the generator's output, so of the probability sitting there, 80% comes from $p_{\text{data}}$ and 20% from $p_g$ . The optimal discriminator at that point reports the real fraction directly, $D^*(x) = 0.8/(0.8 + 0.2) = 0.8$ : shown a sample landing there, the best critic calls it real four times out of five, matching the true proportion. Now let the generator improve until it matches the data, so at that same point both densities are equal, a 50/50 split. The fraction becomes $0.5/(0.5 + 0.5) = 0.5$ and the critic is forced to a coin flip, not because it got worse but because real and fake are now indistinguishable there. The discriminator's output reads off the local mix; driving it to one half everywhere is the generator erasing the mix.

This is a density ratio. Rearranged, it is the exact likelihood ratio between real and fake:

\frac{D^*(x)}{1 - D^*(x)} = \frac{p_{\text{data}}(x)}{p_g(x)}

The discriminator never learns either density, yet their ratio is exactly what its output encodes. (It equals the Bayesian posterior "probability this is real" only because the game shows real and fake in a one-to-one mix, which is baked into eq 1.)

The figure below is the picture to hold onto. The amber curve is a fixed data distribution. The teal curve is the generator's, and the slider walks it from far off toward sitting exactly on top of the data. The violet curve is $D^*$ computed from eq (2). While the two distributions are separated, $D^*$ takes values near 1 and near 0, telling real from fake with ease. As the generator closes the gap, the violet curve sags. When $p_g = p_{\text{data}}$ , it is pinned flat at $\tfrac12$ : the perfect classifier has been reduced to a coin flip, because there is genuinely nothing left to discriminate.

Figure 2 · the optimal discriminator

training28%

For a frozen generator, the best discriminator is the density ratio

D^* = p_{\text{data}}/(p_{\text{data}} + p_g)

(violet). It rides toward 1 where real data dominates and toward 0 where the fakes do. Drive

p_g

onto

p_{\text{data}}

and

D^*

collapses onto the dashed

\tfrac12

line. The readout is the game's value

C(G)

, which bottoms out at

-\log 4

exactly when the two distributions coincide.

So the discriminator, at its best, is a meter for how far apart the two distributions are. When it is helpless, they match. If the generator could somehow read that meter and act to drive it to $\tfrac12$ everywhere, it would be driving $p_g$ onto $p_{\text{data}}$ . Substituting the optimal discriminator back in makes that intuition exact.

What the generator really minimizes

Substitute the optimal discriminator $D^*$ back into the value function. The generator is now playing against a perfect critic, so its remaining objective is a function of $G$ alone, which the paper calls the virtual training criterion $C(G)$ :

C(G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\!\Big[\log \tfrac{p_{\text{data}}}{p_{\text{data}}+p_g}\Big] + \mathbb{E}_{\mathbf{x}\sim p_g}\!\Big[\log \tfrac{p_g}{p_{\text{data}}+p_g}\Big]

(4)

This looks like a mess of log-ratios, but it is a famous quantity wearing a disguise. Add and subtract a $\log 2$ inside each expectation (so each denominator becomes the average distribution $M = (p_{\text{data}} + p_g)/2$ ) and the expression reorganizes into two Kullback-Leibler divergences plus a constant:

C(G) = -\log 4 + \mathrm{KL}\!\Big(p_{\text{data}}\,\big\|\,\tfrac{p_{\text{data}}+p_g}{2}\Big) + \mathrm{KL}\!\Big(p_g\,\big\|\,\tfrac{p_{\text{data}}+p_g}{2}\Big)

(5)

A KL divergence $\mathrm{KL}(P\|Q)$ measures how far $P$ is from $Q$ , and it is zero exactly when they are equal and positive otherwise. The two KL terms here, each comparing a distribution to the average of the two, are precisely the definition of the Jensen-Shannon divergence, a symmetric, well-behaved distance between distributions:

C(G) = -\log 4 + 2\cdot \mathrm{JSD}\big(p_{\text{data}}\,\|\,p_g\big)

(6)

Now the game has a one-line meaning. Against a perfect discriminator, the generator is minimizing the Jensen-Shannon divergence between its samples and the real data. The Jensen-Shannon divergence is never negative and is zero only when the two distributions are identical, so $C(G)$ bottoms out at its constant floor, $-\log 4 \approx -1.386$ (all logs here are natural), and it reaches that floor at exactly one place: $p_g = p_{\text{data}}$ . At that point the optimal discriminator is $\tfrac12$ everywhere, which is what the flattened violet curve in Figure 2 shows.

Two pieces of that carry the intuition. A divergence is a number that measures how far one distribution sits from another: it is zero exactly when the two distributions are identical and grows as they pull apart, the way a distance does (it need not be symmetric, which is why "divergence" and not "distance"). The Jensen-Shannon divergence is the particular one that falls out of the algebra here. It is the symmetric, bounded cousin of KL: symmetric because swapping $p_{\text{data}}$ and $p_g$ leaves it unchanged, bounded because it never exceeds $\log 2$ , so the objective $C(G) = -\log 4 + 2\,\mathrm{JSD}$ is trapped between $-\log 4$ at the bottom and $0$ at the top. Minimizing it pushes the two distributions together, and there is nowhere lower to go than zero, which is reached only when they coincide.

A coin-flip discriminator marks the point where the generator has matched the data. Plug $p_g = p_{\text{data}}$ into the optimal discriminator $D^* = p_{\text{data}}/(p_{\text{data}} + p_g)$ and the two densities in numerator and denominator are equal, so $D^*$ is forced to $\tfrac12$ at every point. The best possible critic, given infinite capacity and a perfectly trained classifier, can do no better than answer "real" with probability one half on everything it is shown, a coin flip. That is not the critic failing; it is the critic reporting, correctly, that there is no longer any difference between real and generated samples to find. When the critic is stuck at one half everywhere, the generator has matched the data. (The flat-one-half reading is the balanced-game case the value function in eq (1) builds in, real and fake shown one-to-one, and that even one-to-one mix sets the meeting point at $\tfrac12$ .)

The figure makes the landscape literal. Slide the generator's distribution across the data and watch $C(G)$ trace out a bowl. The floor is $-\log 4$ , touched only when the two distributions coincide. The height of the ball above that floor is exactly $2\cdot\mathrm{JSD}$ , the only part of the objective the generator can still push down.

Figure 3 · the global optimum

p_g shift1.62

With the optimal discriminator plugged in, the generator's objective is

C(G) = -\log 4 + 2\,\mathrm{JSD}

, a bowl over the space of distributions. The floor is

-\log 4 \approx -1.386

, reached only at

p_g = p_{\text{data}}

. The gap from the ball down to the floor is the divergence still left to remove. (The upper reaches, where the curve flattens toward 0, are a derived corollary of

\mathrm{JSD} \le \log 2

, not something the paper states.)

Everything in that derivation lives in the space of distributions, and that is the difference between the paper's theory and what you can actually run. The proof assumes $D$ can be made perfectly optimal at every step and that $p_g$ can be moved freely. In that idealized, infinite-capacity setting, the paper proves the objective is convex with a single global optimum and that the generator's parameters converge to that optimum. The moment you replace "move $p_g$ freely" with "take a gradient step on the weights of an MLP (a plain feed-forward multi-layer perceptron)," the convexity is gone, the guarantees evaporate, and, in the paper's own words, the network "introduces multiple critical points in parameter space." The clean story is true in function space and only hoped for in practice.

The loop, and the loss they really use

How do you actually play the game? You cannot solve the inner $\max_D$ to completion at every step, that would be its own full optimization, so you alternate. Take $k$ gradient steps to improve the discriminator, then one step to improve the generator, and repeat. As long as the generator changes slowly, the discriminator stays near its optimum and the gradient the generator receives is close to the one our theory described. The paper uses $k = 1$ , the cheapest option, and momentum. The figure below plays the smallest version of that game, watch what the update pattern does around a saddle.

Figure 4 · orbiting the saddle

step 240/240

critic steps kk = 1step size ηη = 0.12

A toy two-parameter game, not a GAN:

x

descends and

y

ascends

V(x,y) = xy - 0.03\,y^2

, the tiny

y^2

term gives the critic a finite optimum to track. From the same start, simultaneous steps spiral away from the equilibrium; alternating steps with

k=1

circle it for a long time, and more critic steps per round keep

y

near its best response and tighten the spiral. On the pure bilinear game

xy

no choice of

k

converges — alternating steps orbit forever, so the taming needs the critic to have an optimum to stay near.

The discriminator step is exactly what eq (1) says: push $D(\mathbf{x})$ up on real data and $D(G(\mathbf{z}))$ down on fakes. The generator step contains a subtlety. Eq (1) tells the generator to minimize $\log(1 - D(G(\mathbf{z})))$ . Early in training that loss barely works, for a concrete reason. When the generator is bad, the discriminator spots its fakes with high confidence, so $D(G(\mathbf{z}))$ sits near 0. Right there, the loss surface is flat, so the gradient handed back to the generator is tiny. The generator is losing badly with no signal about how to do better. The loss has saturated.

The paper makes a small change with a large effect. Instead of having the generator minimize $\log(1 - D(G(\mathbf{z})))$ , have it maximize $\log D(G(\mathbf{z}))$ :

\text{minimize } \log\big(1 - D(G(\mathbf{z}))\big) \quad\longrightarrow\quad \text{maximize } \log D(G(\mathbf{z}))

Both losses reward the same thing, a fake that $D$ calls real, and both share the same equilibrium. The two losses differ in the gradient they supply when the generator is losing. The figure plots the generator's cost under each version against $D(G(\mathbf{z}))$ . On the left, where bad early samples live, the original minimax cost (amber) is nearly flat, while the swapped cost (teal) is steep. The ratio of their gradients is $(1-D)/D$ , which blows up as $D \to 0$ : exactly where the generator's gradient is weakest under the old loss, the new loss provides one many times larger.

Figure 5 · the non-saturating swap

D(G(z))0.10

The generator's cost versus

D(G(z))

, the critic's verdict on a fake. Early on

D(G(z))\!\approx\!0

(far left): the minimax cost

\log(1\!-\!D)

is flat there, a vanishing gradient, while the non-saturating cost

-\log D

is near-vertical, a strong gradient. Drag the marker to compare slopes. Same fixed point, very different push when the generator is losing.

This swap is the version everyone trains, and it comes with an asterisk that must be stated plainly. The tidy "GANs minimize the Jensen-Shannon divergence" result was derived for the original minimax loss against a perfect discriminator. The non-saturating loss has the same fixed point but a different gradient, and later analysis (Arjovsky and Bottou, 2017) showed that gradient corresponds to a different objective with the flavor of the reverse KL divergence (KL(p_g‖p_data) rather than the other ordering, a direction that rewards piling all mass on a few modes), not a pure descent on the Jensen-Shannon divergence. So the Jensen-Shannon result is the right intuition for where the game ends, but it is not a literal description of the loss in the training loop.

The full training step, in code, is short:

# one GAN training step (Algorithm 1, with k = 1)
x = sample_real(batch)             # real data,  x ~ p_data
z = sample_noise(batch)            # latent prior, z ~ p_z

# 1. discriminator: learn to tell real from fake (ascend)
fake = G(z).detach()               # block the generator's gradient
loss_d = -(log(D(x)) + log(1 - D(fake))).mean()
loss_d.backward(); opt_d.step()

# 2. generator: learn to fool D (descend the NON-saturating loss)
z = sample_noise(batch)
loss_g = -log(D(G(z))).mean()      # i.e. maximize log D(G(z))
loss_g.backward(); opt_g.step()    # gradient flows D -> G, D frozen

Two details to read off it. The detach() on the fake during the discriminator step stops the generator from being updated there; we only want $D$ to learn from that line. And in the generator step the gradient flows backward through the frozen discriminator into $G$ : the discriminator is the channel through which the data's influence reaches the generator. The generator never sees a real image. It only ever sees the direction the critic says would make its fake more convincing.

Watching it learn, and collapse

When it works, training looks like the generator slowly spreading out to cover the data. Take the real distribution to be a handful of separated clusters, the modes. A healthy generator starts as a shapeless blob and, batch by batch, fans out until its samples land on every cluster in the right proportion. The left regime in the figure below is that happy path.

But the same game has a failure mode built into its incentives, and the paper named it before anyone had a cure. The generator's job is to produce samples the discriminator calls real. Nothing in that sentence demands variety. If the generator finds one output that reliably fools the current discriminator, it can profit by producing that one output over and over, mapping many different $\mathbf{z}$ values to nearly the same $\mathbf{x}$ . A supervised loss differs here: every input carries its own target, so the model is penalized whenever it misses any mode of the data. Here the only grade is $D$ 's opinion of the samples $G$ actually produces, $D$ is never shown the modes $G$ skipped, so nothing in the game charges $G$ for skipping them. The paper calls this the "Helvetica scenario" (the modern name is mode collapse): $G$ collapses too many values of $\mathbf{z}$ onto too few values of $\mathbf{x}$ to have the diversity needed to model $p_{\text{data}}$ . Each batch can look perfectly real and yet the generator only ever visits one corner of the data. Toggle the figure to the collapse regime to see it: the samples cluster on a single mode and hop between modes instead of covering them.

Figure 6 · coverage, and collapse

Real data is eight modes on a ring; teal dots are the generator's samples. Healthy: they fan out from a blob to cover all eight. Mode collapse (the Helvetica scenario): they pile onto one mode and hop, fooling the critic batch by batch while throwing away the diversity of

p_{\text{data}}

. An illustration of the two regimes, not a live GAN.

Mode collapse is one of three distinct ways GAN training goes wrong, and they need to be kept separate. Here the "saddle point" framing does its real work. A minimax solution is not the bottom of a valley, where every direction is uphill; it is a saddle, a minimum along the generator's axes and a maximum along the discriminator's. Simultaneous gradient steps on a saddle do not have to converge. They can orbit it forever, the way descending one player while ascending the other traces a circle around the center of $V(x,y) = xy$ . That orbiting is the second failure, oscillation. The third is the vanishing gradient described above: let the discriminator get too good, too fast, and it saturates the generator's loss and starves it of signal. No single trick fixes all three, which is why a decade of follow-up work went into taming this game.

Does it actually work?

In 2014, with no agreed way to score a likelihood-free model, the paper fit a Gaussian Parzen window to the generator's samples, a kernel density estimate: drop a small Gaussian on each generated sample, sum them into a makeshift density, and report the log-likelihood of the real test set under that. The numbers are in Table 1.

Figure 7 · the paper's numbers

Parzen-window log-likelihood (higher is better) for four models on two datasets, with error bars. The adversarial net tops MNIST, but on TFD the Stacked CAE leads and the gap to GAN is within noise. The paper makes "no claim that these samples are better." The metric itself is weak.

On MNIST the adversarial net posts the best score, 225 against the next-best 214. On the Toronto Face Database it comes second, 2057 behind the Stacked Contractive Autoencoder's 2110, and that gap is within the error bars (the leader does separate from the weaker DBN and Deep GSN). This is "competitive," not "dominant." Nor is the metric to be trusted: a Parzen-window estimate is a poor stand-in for a likelihood the GAN cannot provide, and later work (Theis, van den Oord, and Bethge, 2015) showed these estimates can rank models in the wrong order and that, in high dimensions, a model can score well while producing nonsense or score badly while producing excellent images. What the 1406.2661 experiments really establish is that the samples looked good and were demonstrably not memorized copies of the training set (the paper shows each sample beside its nearest training neighbor), and that the field would need years to invent evaluations it could trust. The qualitative figures, blurry by today's standards, were the real evidence.

What was missing, and what came after

The paper lists its own limits, and the list reads like a map of the next decade of research. Two disadvantages are called out directly. First, there is no explicit representation of $p_g(\mathbf{x})$ : you can sample from a GAN, but you cannot ask it how probable a given image is, which rules out anything that needs a likelihood. Second, $D$ and $G$ must be kept in careful balance through training, the synchronization problem that produces mode collapse and instability.

Set against those costs are real advantages, and they are what made GANs matter. No Markov chains are ever needed, for training or for sampling. Generating a sample is a single forward pass, fast and exact, with no chain to mix. Only backpropagation is used to get gradients, so the model rides the same hardware and tooling as everything else in deep learning. And because the generator is shaped only by gradients passed back through the discriminator and never fit to pixels directly, it can produce very sharp, even degenerate distributions, where methods that rely on a blurring Markov chain cannot.

What followed turned the demonstration into a field. DCGAN (2015) found the convolutional architecture and training recipe that made image GANs stable enough to be useful. Wasserstein GAN (2017) traded the implicit Jensen-Shannon objective for the Earth-Mover distance, whose gradient does not vanish when the distributions barely overlap, directly attacking the saturation problem. Progressive growing and StyleGAN (2018-2019) pushed the same game to photoreal faces at high resolution. By around 2021 a different family, diffusion models, overtook GANs on many image-generation benchmarks by trading the unstable two-player game for a stable regression objective. GANs did not vanish (they remain fast and competitive in places), but the throne moved.

What the paper showed is that you can fit a generative model without ever writing down its density. You pit it against a classifier whose best move is to report a density ratio, and fighting that classifier amounts to minimizing a real divergence to the data, all of it trained with nothing but backpropagation. The fakes were blurry and the metric was weak, but every modern adversarial method, and a good deal of the instability folklore around them, traces back to that one arrangement.

Provenance Verified against primary literature

Goodfellow et al. (2014)The minimax value function, optimal D, and the −log 4 / Jensen-Shannon result (eqs 1–6).

Goodfellow (2016 tutorial)The zero-sum / Nash-equilibrium / saddle-point framing and the V = xy orbit.

Arjovsky & Bottou (2017)The non-saturating loss is reverse-KL-like, not a literal Jensen-Shannon descent.

Theis et al. (2015)Parzen-window log-likelihood is a weak, often misleading proxy for sample quality.

correctionThe 2014 paper never says "Nash equilibrium," "saddle point," or "mode collapse." It says "minimax game" and "the Helvetica scenario." We use the modern terms for intuition and attribute them here. And "GANs minimize the Jensen-Shannon divergence" holds only for the original saturating loss at an optimal discriminator, not for the non-saturating loss everyone actually trains.

Questions you might still have

If the generator never sees a real image, how does it learn?
It only ever receives gradients passed back through the discriminator. The discriminator looks at real data; the generator is told how to nudge its samples so the discriminator rates them higher. The data reaches the generator secondhand, as a direction, never as a target to copy.

Does GAN training really minimize the Jensen-Shannon divergence?
Only in the idealized case: the original minimax loss, with a perfectly optimal discriminator at every step. The non-saturating loss everyone actually uses shares the same fixed point but follows a different gradient (closer to a reverse KL), which is part of why GANs chase modes and can drop them.

Why are GANs so notoriously unstable?
The solution is a saddle point of the value function, not the bottom of a valley, so simultaneous gradient steps can orbit it instead of settling. Separately, if the discriminator gets too good its gradient to the generator vanishes, and the incentives also permit mode collapse. Three distinct failure modes, no single fix.

If a GAN has no likelihood, how do you judge it?
Mostly not by numbers. The paper’s Parzen-window scores are a weak proxy, and the field later judged GANs by sample quality (metrics like FID, the Fréchet Inception Distance, which compares the statistics of generated and real images in a feature space) and by what they enabled. A GAN can sample beautifully and still be unable to tell you how probable any particular image is.

Footnotes & further reading

The paper: Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Generative Adversarial Nets (Université de Montréal, NeurIPS 2014). Original code.
The zero-sum / Nash / saddle-point framing, the $V(x,y)=xy$ orbit, and the "not motivated by a theoretical concern" note on the non-saturating loss: Goodfellow, NIPS 2016 Tutorial: Generative Adversarial Networks.
Why the non-saturating loss is reverse-KL-like and the source of the instability: Arjovsky & Bottou, Towards Principled Methods for Training GANs (2017), and Arjovsky, Chintala & Bottou, Wasserstein GAN.
When GAN training provably orbits rather than converges (the manifold / not-absolutely-continuous case): Mescheder, Geiger & Nowozin, Which Training Methods for GANs do actually Converge? (ICML 2018).
Why the Parzen-window numbers should not be trusted: Theis, van den Oord & Bethge, A note on the evaluation of generative models (2015).
The lineage: DCGAN (2015), Conditional GANs (2014), StyleGAN (2018), and Diffusion Models Beat GANs on Image Synthesis (2021).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.