VerifiedarXiv:2506.1420232 min
Generative models · Theory

Generative Adversarial Networks

Two networks play a forgery game until the fakes pass.

One network paints fakes, the other plays art critic, and the only feedback the forger ever gets is the critic's opinion. Push that game to its conclusion and the fakes become samples from the real data distribution. The whole thing trains with plain backpropagation and no likelihood anywhere in sight.

Explaining the paperGenerative Adversarial NetsGoodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio · Université de Montréal · NeurIPS 2014 · arXiv:1406.2661

How do you train a network to draw faces when you can never write down what makes a face likely?

Generative modeling hit a wall for years. Say you want a model that produces new photographs of faces. The honest way to fit such a model is maximum likelihood: write down a probability density pg(x)p_g(\mathbf{x}) over images, then nudge its parameters until real faces score high under it. The trouble is the density. For any model rich enough to capture faces, computing pg(x)p_g(\mathbf{x}) means an integral over every possible configuration of the model's hidden variables, and that integral is intractable. The methods of the day fought this with Markov chains and approximate inference, machinery that was slow, fragile, and hard to scale.

Generative Adversarial Nets, from Ian Goodfellow and colleagues at Montréal in 2014, sidesteps the density entirely. The pitch: do not estimate any probabilities. Instead, set up a game between two networks and let the game do the estimating for you. A generator tries to manufacture samples that pass for real. A discriminator tries to catch the fakes. Train them against each other and, at the game's equilibrium, the generator is producing exactly the real data distribution and the discriminator cannot do better than a coin flip. No Markov chains, no inference network, no explicit likelihood. Just backpropagation and a forward pass to sample.

The argument that this works is short, a calculus exercise: what the two networks are, why the best possible discriminator is a ratio of densities, what quantity the generator is secretly driving down when it fights that discriminator, and why the obvious training loss had to be quietly swapped for a different one. Throughout, the claims the 2014 paper actually proves stay separate from the field's later, sharper understanding of why GANs are so hard to train.

Two networks, one forgery game

The paper's own analogy is a team of counterfeiters against the police. The counterfeiters print fake currency and try to spend it without getting caught. The police try to tell the counterfeits from the genuine bills. Each side pressures the other to improve, and the equilibrium is reached when the fakes are indistinguishable from real money, at which point the police can only guess.

Make that precise. The generator GG is a network that takes a random vector z\mathbf{z} drawn from a simple, fixed prior pzp_{\mathbf{z}} (the paper uses a uniform distribution; a Gaussian is the common modern choice) and maps it to a sample x=G(z)\mathbf{x} = G(\mathbf{z}) in data space. The discriminator DD takes an x\mathbf{x} and returns a single number D(x)(0,1)D(\mathbf{x}) \in (0,1), its estimate of the probability that x\mathbf{x} is a real data point rather than one of the generator's fakes.

They are wired into a single objective, one that DD wants to make large and GG wants to make small:

minGmaxD  V(D,G)=Expdata[logD(x)]+Ezpz[log(1D(G(z)))]\min_G \max_D \; V(D,G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\big[\log D(\mathbf{x})\big] + \mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}}\big[\log\big(1 - D(G(\mathbf{z}))\big)\big]
(1)

Read it one term at a time. The first term rewards DD for assigning a high probability D(x)D(\mathbf{x}) to real data. The second rewards DD for assigning a low probability D(G(z))D(G(\mathbf{z})) to fakes, since 1D1 - D is large there. So the discriminator wants VV as large as it can get. The generator only appears inside the second term, and it wants the opposite: it wants D(G(z))D(G(\mathbf{z})) close to 1, so that log(1D(G(z)))\log(1 - D(G(\mathbf{z}))) heads toward -\infty and VV shrinks. One number, pulled in two directions. That is the entire setup.

A name for this kind of objective: a two-player minimax game. The outer minG\min_G and inner maxD\max_D say the generator is choosing its move anticipating the discriminator's best response. The 2014 paper stops there and frames the solution as the global minimum of a criterion we are about to derive. The vocabulary you will hear today, that the solution is a Nash equilibrium and a saddle point of VV, came later (it is correct, and this page uses it, but those words are not in the original paper). The distinction matters more than it sounds, and it is the root of why GANs are temperamental, a thread that comes back when we watch training collapse.

Notice what is absent. Nowhere does anything evaluate pg(x)p_g(\mathbf{x}), the probability the generator assigns to an image. The generator is never asked "how likely is this face under you?" It is only ever asked "can you fool the critic?" That is the trick that dodges the intractable integral. The price, which we will pay in full at the end, is that the model has no likelihood to report.

The generator turns noise into data

Before the game, get a feel for the generator on its own, because it is doing something subtle. It is a deterministic function. Feed it a fixed z\mathbf{z} and you always get the same x\mathbf{x}. The randomness in the output comes entirely from the randomness you feed in. So GG takes a simple distribution, a uniform blob of noise, and reshapes it into something complicated, the distribution of faces. The resulting pgp_g is the pushforward of the prior through GG: whatever distribution you get by drawing z\mathbf{z} and applying GG. The paper's own word for it is that GG "implicitly defines" pgp_g. It is never written down.

How does a smooth map turn flat noise into a lumpy distribution? By stretching some regions and squeezing others. Where GG is steep, a small interval of z\mathbf{z} gets spread across a wide interval of x\mathbf{x}, so the output probability there is thin. Where GG is nearly flat, a wide interval of z\mathbf{z} gets crushed into a narrow interval of x\mathbf{x}, so probability piles up. The paper's Figure 1 puts it exactly: GG contracts in regions of high density and expands in regions of low density. In one dimension this is the change-of-variables rule (the same Jacobian bookkeeping a normalizing flow uses, except a GAN never has to compute it):

pg(x)=pz(z)G(z),x=G(z)p_g(x) = \frac{p_{\mathbf{z}}(z)}{|G'(z)|}, \qquad x = G(z)

Drag the slider below. At the start GG is the flat identity, so uniform noise produces a uniform output and the threads stay parallel. As it "trains," GG bends into the shape that maps uniform z\mathbf{z} onto the two-bump target: evenly spaced noise samples bunch together under the modes, where the curve goes flat and density should be high, and spread apart across the empty middle, where the curve is steep and density should be low.

Figure 1 · noise pushed into shape
matched
The generator is one deterministic warp x=G(z)x = G(z). Evenly spaced latent samples z (bottom) ride up to where they land in x (top). Train GG and the even ticks squeeze together under the two modes of p_data (contract, density piles up) and stretch apart in the gap (expand, density thins), until the teal p_g matches the amber target. The density is never written down; it is whatever falls out of pushing noise through GG.

That is the whole job of the generator: find a warp of noise whose pushforward is the data. The question is how it could ever learn the right warp without anyone telling it what the data density is. The answer is the discriminator.

The discriminator is a density ratio

Freeze the generator for a moment, so pgp_g is some fixed distribution, and ask: what is the best possible discriminator? This has a clean answer, and it is the hinge the whole paper turns on.

Write the value function as an integral over data space. Real points are drawn with density pdata(x)p_{\text{data}}(x) and fakes with density pg(x)p_g(x), so

V(D,G)=x[pdata(x)logD(x)+pg(x)log(1D(x))]dxV(D,G) = \int_x \Big[\, p_{\text{data}}(x)\,\log D(x) + p_g(x)\,\log\big(1 - D(x)\big) \,\Big]\,dx
(3)

(The expectation over z\mathbf{z} turned into an integral over xx because pgp_g is defined as the pushforward; no Jacobian goes missing.) Now look at the integrand at a single point xx. With a=pdata(x)a = p_{\text{data}}(x) and b=pg(x)b = p_g(x) held fixed, we are choosing the number D(x)D(x) to maximize alogD+blog(1D)a\log D + b\log(1-D). That is a one-variable calculus problem. Before doing the calculus, guess the shape of the answer: at this one point the discriminator is just choosing one number, the probability it reports that xx is real, and the best possible choice can only depend on how much real density and how much fake density sit there, so expect a fraction built from aa and bb. Set the derivative to zero, a/Db/(1D)=0a/D - b/(1-D) = 0, and out drops

DG(x)=pdata(x)pdata(x)+pg(x)D^*_G(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}
(2)

and the second derivative is negative, so it is a genuine maximum. Because each point xx can be optimized on its own, this pointwise solution is the optimal discriminator everywhere at once. Read it as what it is: at any point, the best critic reports the fraction of the probability there that comes from real data. Where only real data lives, D=1D^* = 1. Where only fakes live, D=0D^* = 0. Where the two distributions are equally dense, D=12D^* = \tfrac12, a shrug.

Ground that at one point. Pick a location xx where real data is four times as dense as the generator's output, so of the probability sitting there, 80% comes from pdatap_{\text{data}} and 20% from pgp_g. The optimal discriminator at that point reports the real fraction directly, D(x)=0.8/(0.8+0.2)=0.8D^*(x) = 0.8/(0.8 + 0.2) = 0.8: shown a sample landing there, the best critic is right to call it real four times out of five. Now let the generator improve until it matches the data, so at that same point both densities are equal, a 50/50 split. The fraction becomes 0.5/(0.5+0.5)=0.50.5/(0.5 + 0.5) = 0.5 and the critic is forced to a coin flip, not because it got worse but because real and fake are now genuinely indistinguishable there. The discriminator's output reads off the local mix; driving it to one half everywhere is the generator erasing the mix.

This is a density ratio in disguise. Rearranged, it is the exact likelihood ratio between real and fake:

D(x)1D(x)=pdata(x)pg(x)\frac{D^*(x)}{1 - D^*(x)} = \frac{p_{\text{data}}(x)}{p_g(x)}

The discriminator never learns either density, but the two of them in ratio fall out of it for free. (It equals the Bayesian posterior "probability this is real" only because the game shows real and fake in a one-to-one mix, which is baked into eq 1.)

The figure below is the picture to hold onto. The amber curve is a fixed data distribution. The teal curve is the generator's, and the slider walks it from far off toward sitting exactly on top of the data. The violet curve is DD^* computed from eq (2). Watch what happens to it. While the two distributions are separated, DD^* swings confidently between 1 and 0, telling real from fake with ease. As the generator closes the gap, the violet curve sags. When pg=pdatap_g = p_{\text{data}}, it is pinned flat at 12\tfrac12: the perfect classifier has been reduced to a coin flip, because there is genuinely nothing left to discriminate.

Figure 2 · the optimal discriminator
28%
For a frozen generator, the best discriminator is the density ratio D=pdata/(pdata+pg)D^* = p_{\text{data}}/(p_{\text{data}} + p_g) (violet). It rides toward 1 where real data dominates and toward 0 where the fakes do. Drive pgp_g onto pdatap_{\text{data}} and DD^* collapses onto the dashed 12\tfrac12 line. The readout is the game's value C(G)C(G), which bottoms out at log4-\log 4 exactly when the two distributions coincide.

So the discriminator, at its best, is a meter for how far apart the two distributions are. When it is helpless, they match. That is suggestive. If the generator could somehow read that meter and act to drive it to 12\tfrac12 everywhere, it would be driving pgp_g onto pdatap_{\text{data}}. The next step makes that intuition exact.

What the generator really minimizes

Substitute the optimal discriminator DD^* back into the value function. The generator is now playing against a perfect critic, so its remaining objective is a function of GG alone, which the paper calls the virtual training criterion C(G)C(G):

C(G)=Expdata ⁣[logpdatapdata+pg]+Expg ⁣[logpgpdata+pg]C(G) = \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}}\!\Big[\log \tfrac{p_{\text{data}}}{p_{\text{data}}+p_g}\Big] + \mathbb{E}_{\mathbf{x}\sim p_g}\!\Big[\log \tfrac{p_g}{p_{\text{data}}+p_g}\Big]
(4)

This looks like a mess of log-ratios, but it is a famous quantity wearing a disguise. Add and subtract a log2\log 2 inside each expectation (so each denominator becomes the average distribution M=(pdata+pg)/2M = (p_{\text{data}} + p_g)/2) and the whole thing reorganizes into two Kullback-Leibler divergences plus a constant:

C(G)=log4+KL ⁣(pdatapdata+pg2)+KL ⁣(pgpdata+pg2)C(G) = -\log 4 + \mathrm{KL}\!\Big(p_{\text{data}}\,\big\|\,\tfrac{p_{\text{data}}+p_g}{2}\Big) + \mathrm{KL}\!\Big(p_g\,\big\|\,\tfrac{p_{\text{data}}+p_g}{2}\Big)
(5)

A KL divergence KL(PQ)\mathrm{KL}(P\|Q) measures how far PP is from QQ, and it is zero exactly when they are equal and positive otherwise. The two KL terms here, each comparing a distribution to the average of the two, are precisely the definition of the Jensen-Shannon divergence, a symmetric, well-behaved distance between distributions:

C(G)=log4+2JSD(pdatapg)C(G) = -\log 4 + 2\cdot \mathrm{JSD}\big(p_{\text{data}}\,\|\,p_g\big)
(6)

Now the game has a one-line meaning. Against a perfect discriminator, the generator is minimizing the Jensen-Shannon divergence between its samples and the real data. The Jensen-Shannon divergence is never negative and is zero only when the two distributions are identical, so C(G)C(G) bottoms out at its constant floor, log41.386-\log 4 \approx -1.386 (all logs here are natural), and it reaches that floor at exactly one place: pg=pdatap_g = p_{\text{data}}. At that point the optimal discriminator is 12\tfrac12 everywhere, which is what the flattened violet curve in the last figure was showing. The forgery game, played to completion, recovers the data distribution. That is the central theorem.

Two pieces of that are worth slowing down on, because they are where the intuition lives. A divergence is a number that measures how far one distribution sits from another: it is zero exactly when the two distributions are identical and grows as they pull apart, the way a distance does (it need not be symmetric, which is why "divergence" and not "distance"). The Jensen-Shannon divergence is the particular one that falls out of the algebra here. It is the symmetric, bounded cousin of KL: symmetric because swapping pdatap_{\text{data}} and pgp_g leaves it unchanged, bounded because it never exceeds log2\log 2, so the whole objective C(G)=log4+2JSDC(G) = -\log 4 + 2\,\mathrm{JSD} is trapped between log4-\log 4 at the bottom and 00 at the top. Minimizing it pushes the two distributions together, and there is nowhere lower to go than zero, which is reached only when they coincide.

The coin-flip discriminator is the observable tell that this has happened. Plug pg=pdatap_g = p_{\text{data}} into the optimal discriminator D=pdata/(pdata+pg)D^* = p_{\text{data}}/(p_{\text{data}} + p_g) and the two densities in numerator and denominator are equal, so DD^* is forced to 12\tfrac12 at every point. The best possible critic, given infinite capacity and a perfectly trained classifier, can do no better than answer "real" with probability one half on everything it is shown, a coin flip. That is not the critic failing; it is the critic reporting, correctly, that there is no longer any difference between real and generated samples to find. A critic stuck at one half everywhere is the signal that the generator has matched the data. (The flat-one-half reading is the balanced-game case the value function in eq (1) builds in, real and fake shown one-to-one; that even mix is what makes 12\tfrac12 the meeting point.)

The figure makes the landscape literal. Slide the generator's distribution across the data and watch C(G)C(G) trace out a bowl. The floor is log4-\log 4, touched only when the two distributions coincide. The height of the ball above that floor is exactly 2JSD2\cdot\mathrm{JSD}, the only part of the objective the generator can still push down.

Figure 3 · the global optimum
1.62
With the optimal discriminator plugged in, the generator's objective is C(G)=log4+2JSDC(G) = -\log 4 + 2\,\mathrm{JSD}, a bowl over the space of distributions. The floor is log41.386-\log 4 \approx -1.386, reached only at pg=pdatap_g = p_{\text{data}}. The gap from the ball down to the floor is the divergence still left to remove. (The upper reaches, where the curve flattens toward 0, are a derived corollary of JSDlog2\mathrm{JSD} \le \log 2, not something the paper states.)

Everything in that derivation lives in the space of distributions, and that is the difference between the paper's theory and what you can actually run. The proof assumes DD can be made perfectly optimal at every step and that pgp_g can be moved freely. In that idealized, infinite-capacity setting, the paper proves the objective is convex with a single global optimum and that the generator's parameters converge to that optimum. The moment you replace "move pgp_g freely" with "take a gradient step on the weights of an MLP," the convexity is gone, the guarantees evaporate, and, in the paper's own words, the network "introduces multiple critical points in parameter space." The clean story is true in function space and only hoped for in practice.

The loop, and the loss they really use

How do you actually play the game? You cannot solve the inner maxD\max_D to completion at every step, that would be its own full optimization, so you alternate. Take kk gradient steps to improve the discriminator, then one step to improve the generator, and repeat. As long as the generator changes slowly, the discriminator stays near its optimum and the gradient the generator receives is close to the one our theory described. The paper uses k=1k = 1, the cheapest option, and momentum. The figure below plays the smallest version of that game, watch what the update pattern does around a saddle.

Figure 4 · orbiting the saddle
step 240/240
k = 1η = 0.12
A toy two-parameter game, not a GAN: xx descends and yy ascends V(x,y)=xy0.03y2V(x,y) = xy - 0.03\,y^2, the tiny y2y^2 term gives the critic a finite optimum to track. From the same start, simultaneous steps spiral away from the equilibrium; alternating steps with k=1k=1 circle it for a long time, and more critic steps per round keep yy near its best response and tighten the spiral. On the pure bilinear game xyxy no kk falls in, alternating steps orbit forever, so the taming needs the critic to have an optimum to stay near.

The discriminator step is exactly what eq (1) says: push D(x)D(\mathbf{x}) up on real data and D(G(z))D(G(\mathbf{z})) down on fakes. The generator step is where a wrinkle hides. Eq (1) tells the generator to minimize log(1D(G(z)))\log(1 - D(G(\mathbf{z}))). Early in training that loss barely works, for a concrete reason. When the generator is bad, the discriminator spots its fakes with high confidence, so D(G(z))D(G(\mathbf{z})) sits near 0. Right there, the loss surface is flat, so the gradient handed back to the generator is tiny. The generator is losing badly with no signal about how to do better. The loss has saturated.

The fix in the paper is a small change with a large effect. Instead of having the generator minimize log(1D(G(z)))\log(1 - D(G(\mathbf{z}))), have it maximize logD(G(z))\log D(G(\mathbf{z})):

minimize log(1D(G(z)))maximize logD(G(z))\text{minimize } \log\big(1 - D(G(\mathbf{z}))\big) \quad\longrightarrow\quad \text{maximize } \log D(G(\mathbf{z}))

Both losses want the same thing, a fake that DD calls real, and both share the same equilibrium. The difference is the gradient when the generator is losing. The figure plots the generator's cost under each version against D(G(z))D(G(\mathbf{z})). On the left, where bad early samples live, the original minimax cost (amber) is nearly flat, while the swapped cost (teal) is steep. The ratio of their gradients is (1D)/D(1-D)/D, which blows up as D0D \to 0: exactly when the generator most needs a push, the new loss gives it one many times larger.

Figure 5 · the non-saturating swap
0.10
The generator's cost versus D(G(z))D(G(z)), the critic's verdict on a fake. Early on D(G(z)) ⁣ ⁣0D(G(z))\!\approx\!0 (far left): the minimax cost log(1 ⁣ ⁣D)\log(1\!-\!D) is flat there, a vanishing gradient, while the non-saturating cost logD-\log D is near-vertical, a strong gradient. Drag the marker to compare slopes. Same fixed point, very different push when the generator is losing.

This swap is the version everyone trains, and it comes with an asterisk worth stating plainly. The tidy "GANs minimize the Jensen-Shannon divergence" result was derived for the original minimax loss against a perfect discriminator. The non-saturating loss has the same fixed point but a different gradient, and later analysis (Arjovsky and Bottou, 2017) showed that gradient corresponds to a different, reverse-KL-flavored objective, not a clean descent on the Jensen-Shannon divergence. So the theory you carry from the last section is the right intuition for where the game ends, but it is not a literal description of the loss in the training loop. Hold both pictures.

The whole step, in code, is short:

# one GAN training step (Algorithm 1, with k = 1)
x = sample_real(batch)             # real data,  x ~ p_data
z = sample_noise(batch)            # latent prior, z ~ p_z

# 1. discriminator: learn to tell real from fake (ascend)
fake = G(z).detach()               # block the generator's gradient
loss_d = -(log(D(x)) + log(1 - D(fake))).mean()
loss_d.backward(); opt_d.step()

# 2. generator: learn to fool D (descend the NON-saturating loss)
z = sample_noise(batch)
loss_g = -log(D(G(z))).mean()      # i.e. maximize log D(G(z))
loss_g.backward(); opt_g.step()    # gradient flows D -> G, D frozen

Two details to read off it. The detach() on the fake during the discriminator step stops the generator from being updated there; we only want DD to learn from that line. And in the generator step the gradient flows backward through the frozen discriminator into GG: the discriminator is the channel through which the data's influence reaches the generator. The generator never sees a real image. It only ever sees the direction the critic says would make its fake more convincing.

Watching it learn, and collapse

When it works, training looks like the generator slowly spreading out to cover the data. Picture the real distribution as a handful of separated clusters, the modes. A healthy generator starts as a shapeless blob and, batch by batch, fans out until its samples land on every cluster in the right proportion. The left regime in the figure below is that happy path.

But the same game has a failure mode built into its incentives, and the paper named it before anyone had a cure. Look again at the generator's job: produce samples the discriminator calls real. Nothing in that sentence demands variety. If the generator finds one output that reliably fools the current discriminator, it can profit by producing that one output over and over, mapping many different z\mathbf{z} values to nearly the same x\mathbf{x}. Contrast this with a supervised loss, which grades every output against its own target, so every mode of the data pulls on the model. Here the only grade is DD's opinion of the samples GG actually produces, DD is never shown the modes GG skipped, so nothing in the game charges GG for skipping them. Mode coverage is no one's job. The paper calls this the "Helvetica scenario" (the modern name is mode collapse): GG collapses too many values of z\mathbf{z} onto too few values of x\mathbf{x} to have the diversity needed to model pdatap_{\text{data}}. Each batch can look perfectly real and yet the generator only ever visits one corner of the data. Toggle the figure to the collapse regime to see it: the samples cluster on a single mode and hop between modes instead of covering them.

Figure 6 · coverage, and collapse
Real data is eight modes on a ring; teal dots are the generator's samples. Healthy: they fan out from a blob to cover all eight. Mode collapse (the Helvetica scenario): they pile onto one mode and hop, fooling the critic batch by batch while throwing away the diversity of pdatap_{\text{data}}. An illustration of the two regimes, not a live GAN.

Mode collapse is one of three distinct ways GAN training goes wrong, and they are worth keeping separate. It is here that the "saddle point" framing earns its keep. A minimax solution is not the bottom of a valley, where every direction is uphill; it is a saddle, a minimum along the generator's axes and a maximum along the discriminator's. Simultaneous gradient steps on a saddle do not have to converge. They can orbit it forever, the way descending one player while ascending the other traces a circle around the center of V(x,y)=xyV(x,y) = xy. That orbiting is the second failure, oscillation. The third is the vanishing gradient from the last section: let the discriminator get too good, too fast, and it saturates the generator's loss and starves it of signal. No single trick fixes all three, which is why a decade of follow-up work went into taming this game.

Does it actually work?

In 2014, with no agreed way to score a likelihood-free model, the paper fit a Gaussian Parzen window to the generator's samples, a kernel density estimate: drop a small Gaussian on each generated sample, sum them into a makeshift density, and report the log-likelihood of the real test set under that. The numbers are in Table 1.

Figure 7 · the paper's numbers
Parzen-window log-likelihood (higher is better) for four models on two datasets, with error bars. The adversarial net tops MNIST, but on TFD the Stacked CAE leads and the gap to GAN is within noise. The paper makes "no claim that these samples are better." The metric itself is weak, which is the more important lesson.

On MNIST the adversarial net posts the best score, 225 against the next-best 214. On the Toronto Face Database it comes second, 2057 behind the Stacked Contractive Autoencoder's 2110, and the spread across all four models is small relative to the error bars. This is "competitive," not "dominant." Nor is the metric to be trusted: a Parzen-window estimate is a poor stand-in for a likelihood the GAN cannot provide, and later work (Theis, van den Oord, and Bethge, 2015) showed these estimates can rank models in the wrong order and that, in high dimensions, a model can score well while producing nonsense or score badly while producing excellent images. What the 1406.2661 experiments really establish is that the samples looked good and were demonstrably not memorized copies of the training set (the paper shows each sample beside its nearest training neighbor), and that the field would need years to invent evaluations worth believing. The qualitative figures, blurry by today's standards, were the real evidence.

What was missing, and what came after

The paper is candid about its own limits, and the list reads like a map of the next decade of research. Two disadvantages are called out directly. First, there is no explicit representation of pg(x)p_g(\mathbf{x}): you can sample from a GAN, but you cannot ask it how probable a given image is, which rules out anything that needs a likelihood. Second, DD and GG must be kept in careful balance through training, the synchronization problem that produces mode collapse and instability.

Set against those costs are real advantages, and they are what made GANs matter. No Markov chains are ever needed, for training or for sampling. Generating a sample is a single forward pass, fast and exact, with no chain to mix. Only backpropagation is used to get gradients, so the model rides the same hardware and tooling as everything else in deep learning. And because the generator is shaped only by gradients passed back through the discriminator and never fit to pixels directly, it can produce very sharp, even degenerate distributions, where methods that rely on a blurring Markov chain cannot.

What followed turned the demonstration into a field. DCGAN (2015) found the convolutional architecture and training recipe that made image GANs stable enough to be useful. Wasserstein GAN (2017) traded the implicit Jensen-Shannon objective for the Earth-Mover distance, whose gradient does not vanish when the distributions barely overlap, directly attacking the saturation problem. Progressive growing and StyleGAN (2018-2019) pushed the same game to photoreal faces at high resolution. By around 2021 a different family, diffusion models, overtook GANs on many image-generation benchmarks by trading the unstable two-player game for a stable regression objective. GANs did not vanish (they remain fast and competitive in places), but the throne moved.

Step back and the contribution is one idea, stated four ways. You can fit a generative model without ever writing down its density. You do it by pitting it against a classifier whose best move is to report a density ratio. Fighting that classifier turns out to minimize a real divergence to the data. And the whole arrangement trains with nothing but backpropagation. Every modern adversarial method, and a good deal of the instability folklore around them, is a footnote to those four sentences.

Provenance Verified against primary literature
Goodfellow et al. (2014)The minimax value function, optimal D, and the −log 4 / Jensen-Shannon result (eqs 1–6).
Goodfellow (2016 tutorial)The zero-sum / Nash-equilibrium / saddle-point framing and the V = xy orbit.
Arjovsky & Bottou (2017)The non-saturating loss is reverse-KL-like, not a literal Jensen-Shannon descent.
Theis et al. (2015)Parzen-window log-likelihood is a weak, often misleading proxy for sample quality.
correctionThe 2014 paper never says "Nash equilibrium," "saddle point," or "mode collapse." It says "minimax game" and "the Helvetica scenario." We use the modern terms for intuition and attribute them here. And "GANs minimize the Jensen-Shannon divergence" holds only for the original saturating loss at an optimal discriminator, not for the non-saturating loss everyone actually trains.

Questions you might still have

?

If the generator never sees a real image, how does it learn?
It only ever receives gradients passed back through the discriminator. The discriminator looks at real data; the generator is told how to nudge its samples so the discriminator rates them higher. The data reaches the generator secondhand, as a direction, never as a target to copy.

?

Does GAN training really minimize the Jensen-Shannon divergence?
Only in the idealized case: the original minimax loss, with a perfectly optimal discriminator at every step. The non-saturating loss everyone actually uses shares the same fixed point but follows a different gradient (closer to a reverse KL), which is part of why GANs chase modes and can drop them.

?

Why are GANs so notoriously unstable?
The solution is a saddle point of the value function, not the bottom of a valley, so simultaneous gradient steps can orbit it instead of settling. Separately, if the discriminator gets too good its gradient to the generator vanishes, and the incentives also permit mode collapse. Three distinct failure modes, no single fix.

?

If a GAN has no likelihood, how do you judge it?
Mostly not by numbers. The paper’s Parzen-window scores are a weak proxy, and the field later judged GANs by sample quality (metrics like FID) and by what they enabled. A GAN can sample beautifully and still be unable to tell you how probable any particular image is.

Footnotes & further reading

  1. The paper: Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio, Generative Adversarial Nets (Université de Montréal, NeurIPS 2014). Original code.
  2. The zero-sum / Nash / saddle-point framing, the V(x,y)=xyV(x,y)=xy orbit, and the "not motivated by a theoretical concern" note on the non-saturating loss: Goodfellow, NIPS 2016 Tutorial: Generative Adversarial Networks.
  3. Why the non-saturating loss is reverse-KL-like and the source of the instability: Arjovsky & Bottou, Towards Principled Methods for Training GANs (2017), and Arjovsky, Chintala & Bottou, Wasserstein GAN.
  4. When GAN training provably orbits rather than converges (the manifold / not-absolutely-continuous case): Mescheder, Geiger & Nowozin, Which Training Methods for GANs do actually Converge? (ICML 2018).
  5. Why the Parzen-window numbers should not be trusted: Theis, van den Oord & Bethge, A note on the evaluation of generative models (2015).
  6. The lineage: DCGAN (2015), Conditional GANs (2014), StyleGAN (2018), and Diffusion Models Beat GANs on Image Synthesis (2021).