Flow Matching for Generative Modeling
Teach the model one arrow at a time.
A generative model is a machine that turns noise into data. Flow Matching trains one by regressing the velocity of a fixed noise-to-data path, with no diffusion process to reason about and no differential equation to solve while you train. Build a short tower of ideas and the whole method falls out, plus a straighter path that samples faster.
Explaining the paperFlow Matching for Generative ModelingWhat if training a generative model meant nothing fancier than showing it which way to step, over and over?
Every modern image generator is, underneath, a way of solving one problem: you have a pile of samples from some distribution you cannot write down (faces, say, or photographs of dogs) and you want a machine that produces fresh ones. The trick that has dominated the last few years is to learn the journey from easy to hard. Start from pure Gaussian noise, which you can sample trivially, and learn a transformation that carries that noise onto the data. Diffusion models do this by setting up a stochastic process that slowly destroys data into noise, then learning to run it backward. It works beautifully, and it is also a lot of machinery: a noising process, a score to estimate, a schedule to tune, and a reverse-time stochastic differential equation to reason about.
Flow Matching, from Meta AI, throws most of that machinery out. The pitch fits in a sentence. Pick a fixed path that carries noise to data. Write down the velocity that travels along it. Train a network to copy that velocity by plain least-squares regression. There is no diffusion process, no score, no stochastic reversal, and no differential equation running inside the training loop. At generation time you take the learned velocity and follow it from noise to data with any off-the-shelf ODE solver.
The catch, and the reason the paper is clever rather than obvious, is that the velocity you actually want to copy is an average over the entire dataset and is hopeless to compute. The bulk of this post is the argument for why you can ignore that and regress a one-example stand-in instead, and get the same answer. We build it in order: what a flow is, the objective you wish you could write, why it is intractable, and the trick that makes it tractable. Then the payoff, a path borrowed from optimal transport that runs in straight lines and samples faster than diffusion.
A flow is a velocity field you follow
Picture every point in your space (every possible image, flattened to a vector in ) and imagine attaching a little arrow to each one. The arrows can change over time. That time-varying field of arrows is a vector field, written : at time and location it tells you which way, and how fast, to move. Drop a particle into the field and let it ride: its position obeys an ordinary differential equation,
where is where the particle that started at has drifted to by time . The map is the flow. Run it on a whole cloud of starting points and the cloud gets stretched and reshaped. If we start the cloud as pure noise and the flow reshapes it into the data distribution by time , we have a generative model. Model the field with a neural network and this is a Continuous Normalizing Flow (CNF).
One convention to fix now, because it trips people coming from diffusion. Here is noise and is data. Generation always runs forward in time, from to . Diffusion papers usually go the other way (data at , noise at , then reverse), because they have a separate noising process to undo. Flow Matching has no such process to reverse, so it just points the clock the natural way.
Below is the picture to keep in your head for the rest of the post. A field of arrows, and noise particles riding it from a blob in the middle out onto the data. The whole game is learning that field.
CNFs are old and elegant, and almost nobody used them at scale. The reason is training. The classic way to fit a CNF is maximum likelihood, which needs you to solve the ODE (1) numerically on every training step to know where your samples went and how the density changed. Solving an ODE means many sequential evaluations of the network, and doing that inside every gradient step is painfully slow. People wanted a way to train the field directly, without simulating it. That is the door Flow Matching opens.
The objective you wish you could use
Suppose a kind oracle handed you the right velocity field: the field that, when you follow it, carries noise exactly onto the data. Then training is the easiest thing in the world. Just regress your network onto it with squared error:
Read it straight: pick a random time , pick a point from the cloud as it looks at that time (that is the under the expectation), and push the network's arrow toward the oracle's arrow. Drive this loss to zero and your equals everywhere the cloud visits, so your flow generates the data. No ODE solve, no likelihood, no diffusion. This is the Flow Matching objective, and it is the cleanest training loss in generative modeling.
There is exactly one problem, and it is fatal as stated. We do not have the oracle. We have no idea what is. There are infinitely many fields whose flow lands on the data, and even if we fixed a particular target path we want the cloud to follow, the field that generates it is an integral over the whole unknown data distribution. We cannot sample from and we cannot evaluate . The dream objective (2) is uncomputable. The rest of the method is one long, satisfying maneuver to make it computable without changing its answer.
One example at a time
The escape begins by shrinking the problem. Forget the whole dataset for a moment and look at a single data point . For that one point, define a simple path: a cloud that starts as the standard Gaussian at and contracts onto a tight little blob centered at by . Because it is conditioned on one example, call it a conditional probability path, . The paper takes it to be Gaussian at every time:
with a mean and a scalar standard deviation that we get to design. The only requirements are the two endpoints. At every conditional path must be the same standard noise, so and . At it must concentrate on its data point, so and , a width small enough that the blob is essentially the point . Everything in between is free.
This is a thing we can actually touch. Sampling from is one line, with , and the velocity that generates it (we derive it shortly) is a tidy closed form. Below is one such path. Slide the time and watch the noise cloud march in and shrink onto a single . Toggle the schedule to compare the optimal-transport path with the diffusion one, which take the same endpoints by different routes.
One conditional path is not a generative model. It only knows about one data point. But a dataset is a pile of data points, and we are about to add the piles up.
Average the easy fields into the hard one
Here is the bridge from one example back to the whole distribution. The conditional paths all start at the same noise and each ends at its own data point. If we mix them, weighting each by how likely its data point is under the data distribution , we recover a path over the whole dataset. That mixture is the marginal probability path:
Check the endpoints. At every conditional path is the same standard Gaussian, so the mixture is too: . At each conditional path is a spike at its , so the mixture is a spike at every data point in proportion to how common it is, which is the data distribution itself, . So this mixture path is exactly the noise-to-data journey we wanted, assembled out of per-example pieces.
Now the part that is not obvious. Each conditional path has its own generating velocity . What single velocity field generates the mixture path? It would be lovely if it were just the average of the conditional velocities, and it almost is. You have to weight each conditional velocity by the posterior probability that the point came from data point , which by Bayes is :
This is the paper's first key result, and it is a clean one: the marginal velocity (5) generates the marginal path (4). The proof is two lines of the continuity equation, the conservation law that says a velocity generates a path exactly when . Differentiate the mixture (4) in time, swap the conditional velocity in using the fact that each generates its own , pull the divergence outside the integral, and out pops the continuity equation for and . The weighted average of the simple fields is the field that drives the mixture.
So the intractable oracle from the dream objective is not so mysterious after all. It is a posterior-weighted average of conditional velocities we can each write down. Drag the probe point below: each amber arrow is one data point's conditional push (fatter when that point is the more likely explanation of where the probe is), and the teal arrow is the average they produce, which is at that spot.
We are closer, but not done. The formula (5) still has the data distribution and the marginal density inside it, both unknown, both integrals over the whole dataset. We still cannot compute to regress against. The last step is the one that feels like a magic trick.
The trick: regress the easy target, get the hard one for free
Instead of regressing against the intractable marginal velocity, regress against the per-example conditional velocity. Replace the oracle target in the dream objective (2) with , and sample from that one example's conditional path. This is the Conditional Flow Matching (CFM) objective:
Every piece of this is computable. Draw a real data point from your dataset, draw a time , draw a sample on that example's conditional path, and regress the network onto the closed-form conditional velocity. No marginal, no integral, no . Just one example, one time, one arrow.
And the claim that makes it all work: the CFM objective (6) and the Flow Matching objective (2) have identical gradients in . Minimizing the easy one is exactly minimizing the hard one. Your network converges to the marginal velocity even though you only ever showed it conditional velocities.
Why is that true? It is not a coincidence, it is a property of squared error. Expand both losses with . The term has no in it, so it drops out of the gradient. The term is identical in both losses once you note that averaging over the marginal is the same as averaging over and then , since one is the mixture of the other. The only term that could differ is the cross term , and plugging the definition of the marginal velocity (5) into it shows the marginal cross term and the conditional cross term integrate to the same thing. The two losses differ by a constant that does not depend on , so
There is a one-sentence way to feel it without the algebra. Least-squares regression always converges to the conditional mean of its target. The target in CFM is the conditional velocity for a random , so the network converges to , the average conditional velocity over all the data points that could have produced this . That average is the marginal velocity (5). You are training against single arrows and landing on their average, because that is what regression does. This is the same move that lets denoising score matching train a score without ever knowing the true score, generalized from scores to velocities.
That is the entire conceptual payload. Everything from here is choosing a good conditional path so the arrows are easy to learn and cheap to follow. But first, make the loop concrete.
A worked step
Take the optimal-transport path we are about to define, with for cleanliness. The conditional path is the straight interpolation from a noise sample to the data point , and its velocity is the constant . One training step on a batch of images, each a -vector:
# one Conditional Flow Matching step, OT path (sigma_min = 0)
x1 = sample_data() # a real data point [n, d]
x0 = randn_like(x1) # a noise sample [n, d]
t = rand(n, 1) # times ~ U[0,1] [n, 1]
xt = (1 - t) * x0 + t * x1 # point on the straight path
target = x1 - x0 # the constant velocity (the answer)
loss = mse(v(xt, t), target) # regress the arrow
loss.backward() # no ODE solve, no diffusionConcretely, for CIFAR-10 at a batch might be of shape , the same shape of fresh Gaussian noise, of shape , the interpolation point of shape , and the regression target also . The network (a U-Net) takes the noised image and the time and predicts a velocity. The loss is one mean-squared error. No ODE was solved, no noise schedule was consulted, no score was estimated. It is the simplest training loop in the family.
To generate, you do the one thing CFM never did during training: actually solve the ODE, forward from noise to data.
# generation: solve the ODE forward, t = 0 -> 1
x = randn(n, d) # start from pure noise (p0)
for t in linspace(0, 1, steps): # any off-the-shelf ODE solver
x = x + (1 / steps) * v(x, t) # Euler step along the learned field
return x # x ~ data distribution (p1)A formula for the target arrow
The worked step quietly used a closed form for the conditional velocity. Here is where it comes from, and it covers every Gaussian path at once. Take the Gaussian conditional path (3) and build the obvious flow that produces it: start with a standard normal sample and stretch it by the standard deviation, then shift it by the mean,
When , the output is exactly , so this affine map pushes the noise onto the conditional path. Its velocity is forced: differentiate the flow definition (1) and the inverse of the affine map, and the unique velocity that generates this path is
where the primes are time derivatives. This one formula is the workhorse. Hand it any mean schedule and width schedule with the right endpoints and it spits out the exact velocity to regress. Two choices recover the diffusion world.
The variance-exploding path keeps the mean at the data point and grows the noise: , for an increasing schedule. Plug into (9) and the mean term vanishes, leaving . The variance-preserving path, the one behind DDPM, scales the mean down by and the variance up to compensate, with and . Plug those in and (9) reproduces the velocity of the probability-flow ODE that score-based diffusion already uses. So diffusion paths are not a rival to Flow Matching. They are points inside its menu, reachable by a particular pair of schedules .
That reframing is worth a beat. The whole apparatus of forward and reverse stochastic processes was one way to arrive at a Gaussian path. Flow Matching skips the process and writes the path down. Even when you choose the diffusion path, training it with the CFM velocity loss (instead of score matching) turns out more stable in the paper's experiments. And once you see the schedules as free parameters, you can ask for a path no diffusion produces.
The straight-line path
What is the simplest possible pair of schedules? Move the mean and the width in straight lines:
At this is the standard Gaussian (); at it is the blob at (). Both endpoints satisfied, by the most boring interpolation there is. Feed (10) into the velocity formula (9) and the time-derivatives are constants, giving
and the flow that carries a noise sample along it is the straight line . Substitute into the CFM loss and the regression target collapses to a single constant vector,
The target does not depend on at all. It is the arrow that points from the noise sample straight at the data sample, and the particle travels that arrow at constant speed. (With it is just , the form the released libraries use.)
This is not an arbitrary choice. That straight-line flow is the optimal-transport displacement map between the two Gaussians: of all the ways to morph the prior into the data blob, this is the one that moves mass the shortest total distance, in straight lines, at constant speed. Optimal transport is the mathematics of moving a pile of sand to a new shape with the least total carrying, and McCann showed that between two Gaussians the answer is exactly this linear interpolation. The diffusion path, by contrast, takes a curved detour: it barely moves the sample early, then rushes it toward the data near the end, and its trajectories can swing past the target and double back.
Watch both. Each line is one noise sample being carried to the same . The optimal-transport flow draws straight, evenly-paced lines. Toggle to diffusion and the same endpoints get connected by curves that bow out and overshoot.
A caution the paper is careful to state, and so will we. The conditional flow is optimal transport, but the marginal flow (the average over all data points) is not, in general, an optimal-transport map. Straight conditional lines do not guarantee straight marginal trajectories. Still, the marginal field stays relatively simple, and that simplicity is what pays off at sampling time.
Why straight is cheap to sample
Generation means solving the ODE (1) from to , and a solver works by taking discrete steps. The cost is counted in function evaluations (NFE), one network call per step. Fewer steps means cheaper sampling, and how few you can get away with depends entirely on how curved the trajectory is.
A straight, constant-speed path is the easy case. A solver that assumes straight lines between steps tracks it almost perfectly even with a handful of steps. A curved path makes the same coarse solver cut corners: it steps along a chord, misses the bend, and drifts off the true trajectory, so you have to add steps to keep up. That is the practical edge of the optimal-transport path. The paper measures it directly: matching the same numerical error, Flow Matching with the OT path needs roughly 60% of the function evaluations the diffusion path needs.
See it below. We integrate the marginal field with Euler steps and check what fraction of particles land on the data. The optimal-transport field lands them with very few steps. Toggle to the curved diffusion-like field and the same step budget leaves particles stranded between clusters until you crank up.
So what does it actually do
The headline is that a loss this plain is competitive with, and often better than, the carefully-engineered diffusion training it replaces, on the same architecture. The authors take one U-Net and train it three ways: standard score matching with a diffusion path, Flow Matching with that same diffusion path, and Flow Matching with the optimal-transport path. On CIFAR-10 and ImageNet at 32, 64, and 128 the OT version wins across the board.
On ImageNet 32 the OT model reaches FID 5.02 against 5.68 for score matching and 6.99 for DDPM, while needing 122 function evaluations to score matching's 178 and DDPM's 262. (FID, the Fréchet Inception Distance, scores sample quality; lower is better. NFE counts the network calls an adaptive solver needs.) On CIFAR-10 it posts FID 6.35 at 2.99 bits per dimension. On ImageNet 64 it is FID 14.45 at 138 NFE. At 128 it reaches FID 20.9, the first CNF trained at that resolution. Better samples, better likelihoods, and fewer steps to produce them, all from the same network and the plainest possible loss.
Three things travel out of this paper. The CFM trick (regress per-example targets, converge to their average) is the engine, and it works for any path, not just Gaussians. The reframing of diffusion as one Gaussian path among many removed the mystique from a lot of generative modeling and let people design paths on purpose. And the straight optimal-transport path, with its constant-velocity target , became the default recipe almost everywhere downstream. When you read that a model was trained by "regressing on the interpolation between noise and data," that is this paper, with set to zero. Stable Diffusion 3 and the current generation of large flow models are built on it.
The limits are honest. The conditional flow being optimal transport does not make the learned marginal flow optimal transport, so the trajectories are simpler but not perfectly straight, and very few sampling steps still cost you some quality. The Gaussian-path family, broad as it is, is still a family; richer paths (non-isotropic, non-Gaussian, on curved spaces) are left as future work the framework invites. None of that has slowed it down. Flow Matching took the most elegant generative model nobody could train and made it the easiest one to train, by noticing you never needed the hard target in the first place. You just needed to show the network one arrow at a time.
Questions you might still have
If we never compute the true marginal velocity, how can regressing the conditional one be right?
The two losses differ only by a constant that does not depend on θ, so their gradients are identical (Theorem 2). The minimizer of the easy per-example loss is the conditional expectation of the conditional targets, which is exactly the marginal velocity. You train against single-example arrows and land on the average without ever forming it.
What is σ_min, and why not just set it to zero?
It is the width of the little Gaussian sitting on each data point at t = 1. A tiny σ_min keeps p₁(x | x₁) a real density (needed for likelihoods), so the paper keeps the (1−σ_min) factor. The released libraries usually take σ_min = 0, which makes the OT target the clean straight-line velocity x₁ − x₀. We flag the difference in Provenance.
Is Flow Matching just diffusion with extra steps?
No. Diffusion is one choice of path inside the framework (the variance-preserving Gaussian path is a special case). Flow Matching lets you pick the path directly, and the optimal-transport path is a different, straighter one that diffusion never produces. Same objective shape, a strictly larger menu.
Does the network predict the data, the noise, or the velocity?
The velocity. With the OT path the regression target is the constant vector x₁ − (1−σ_min)x₀, which points from the noise sample straight at the data sample. Diffusion methods instead regress the score or the noise; those are reparameterizations of the same information, but the velocity is what you integrate at sampling time.
Footnotes & further reading
- The paper: Lipman, Chen, Ben-Hamu, Nickel, Le, Flow Matching for Generative Modeling (Meta AI / Weizmann, ICLR 2023). Code, and the authors' Flow Matching Guide and Code.
- Continuous Normalizing Flows and the Neural ODE: Chen, Rubanova, Bettencourt, Duvenaud, Neural Ordinary Differential Equations.
- The conditional-vs-marginal regression trick FM generalizes from scores to velocities: Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
- Diffusion as a score-based SDE with a probability-flow ODE, the family the Gaussian paths recover: Song et al., Score-Based Generative Modeling through SDEs, and Ho, Jain, Abbeel, DDPM.
- The optimal-transport displacement interpolation between two Gaussians: McCann, A Convexity Principle for Interacting Gases (1997).
- The concurrent σ_min = 0 special case: Liu, Gong, Liu, Flow Straight and Fast (Rectified Flow), and Albergo & Vanden-Eijnden, stochastic interpolants.
How could this explainer be improved? Found an error, or something unclear? I read every message.