VerifiedarXiv:2210.0274726 min
Diffusion · Generative models

Flow Matching for Generative Modeling

Teach the model one arrow at a time.

A generative model is a machine that turns noise into data. Flow Matching trains one by regressing the velocity of a fixed noise-to-data path, with no diffusion process to reason about and no differential equation to solve while you train. Build a short tower of ideas and the whole method falls out, plus a straighter path that samples faster.

Explaining the paperFlow Matching for Generative ModelingLipman, Chen, Ben-Hamu, Nickel, Le · Meta AI (FAIR) · ICLR 2023 · arXiv:2210.02747

What if training a generative model meant nothing fancier than showing it which way to step, over and over?

Every modern image generator is, underneath, a way of solving one problem: you have a pile of samples from some distribution you cannot write down (faces, say, or photographs of dogs) and you want a machine that produces fresh ones. The trick that has dominated the last few years is to learn the journey from easy to hard. Start from pure Gaussian noise, which you can sample trivially, and learn a transformation that carries that noise onto the data. Diffusion models do this by setting up a stochastic process that slowly destroys data into noise, then learning to run it backward. It works beautifully, and it is also a lot of machinery: a noising process, a score to estimate, a schedule to tune, and a reverse-time stochastic differential equation to reason about.

Flow Matching, from Meta AI, throws most of that machinery out. The pitch fits in a sentence. Pick a fixed path that carries noise to data. Write down the velocity that travels along it. Train a network to copy that velocity by plain least-squares regression. There is no diffusion process, no score, no stochastic reversal, and no differential equation running inside the training loop. At generation time you take the learned velocity and follow it from noise to data with any off-the-shelf ODE solver.

The catch, and the reason the paper is clever rather than obvious, is that the velocity you actually want to copy is an average over the entire dataset and is hopeless to compute. The bulk of this post is the argument for why you can ignore that and regress a one-example stand-in instead, and get the same answer. We build it in order: what a flow is, the objective you wish you could write, why it is intractable, and the trick that makes it tractable. Then the payoff, a path borrowed from optimal transport that runs in straight lines and samples faster than diffusion.

A flow is a velocity field you follow

Picture every point in your space (every possible image, flattened to a vector in Rd\mathbb{R}^d) and imagine attaching a little arrow to each one. The arrows can change over time. That time-varying field of arrows is a vector field, written vt(x)v_t(x): at time tt and location xx it tells you which way, and how fast, to move. Drop a particle into the field and let it ride: its position obeys an ordinary differential equation,

ddtϕt(x)=vt(ϕt(x)),ϕ0(x)=x\frac{d}{dt}\phi_t(x) = v_t\big(\phi_t(x)\big), \qquad \phi_0(x) = x(1)

where ϕt(x)\phi_t(x) is where the particle that started at xx has drifted to by time tt. The map ϕt\phi_t is the flow. Run it on a whole cloud of starting points and the cloud gets stretched and reshaped. If we start the cloud as pure noise p0=N(0,I)p_0 = \mathcal{N}(0, I) and the flow reshapes it into the data distribution p1p_1 by time t=1t = 1, we have a generative model. Model the field vtv_t with a neural network and this is a Continuous Normalizing Flow (CNF).

One convention to fix now, because it trips people coming from diffusion. Here t=0t = 0 is noise and t=1t = 1 is data. Generation always runs forward in time, from 00 to 11. Diffusion papers usually go the other way (data at t=0t=0, noise at t=1t=1, then reverse), because they have a separate noising process to undo. Flow Matching has no such process to reverse, so it just points the clock the natural way.

Below is the picture to keep in your head for the rest of the post. A field of arrows, and noise particles riding it from a blob in the middle out onto the data. The whole game is learning that field.

Figure 1 · a flow carries noise to data
noise p₀ → data p₁ along v_t
A time-dependent velocity field vt. Drop particles in as Gaussian noise at t=0 and let them ride the arrows; by t=1 they have landed on the four data clusters. A generative model is just this field. Press run to watch the flow.

CNFs are old and elegant, and almost nobody used them at scale. The reason is training. The classic way to fit a CNF is maximum likelihood, which needs you to solve the ODE (1) numerically on every training step to know where your samples went and how the density changed. Solving an ODE means many sequential evaluations of the network, and doing that inside every gradient step is painfully slow. People wanted a way to train the field directly, without simulating it. That is the door Flow Matching opens.

The objective you wish you could use

Suppose a kind oracle handed you the right velocity field: the field ut(x)u_t(x) that, when you follow it, carries noise exactly onto the data. Then training is the easiest thing in the world. Just regress your network vtv_t onto it with squared error:

LFM(θ)=Et,pt(x)vt(x)ut(x)2,tU[0,1]\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t,\,p_t(x)}\big\| v_t(x) - u_t(x) \big\|^2, \qquad t \sim \mathcal{U}[0,1](2)

Read it straight: pick a random time tt, pick a point xx from the cloud as it looks at that time (that is the pt(x)p_t(x) under the expectation), and push the network's arrow toward the oracle's arrow. Drive this loss to zero and your vtv_t equals utu_t everywhere the cloud visits, so your flow generates the data. No ODE solve, no likelihood, no diffusion. This is the Flow Matching objective, and it is the cleanest training loss in generative modeling.

There is exactly one problem, and it is fatal as stated. We do not have the oracle. We have no idea what ut(x)u_t(x) is. There are infinitely many fields whose flow lands on the data, and even if we fixed a particular target path ptp_t we want the cloud to follow, the field that generates it is an integral over the whole unknown data distribution. We cannot sample from pt(x)p_t(x) and we cannot evaluate ut(x)u_t(x). The dream objective (2) is uncomputable. The rest of the method is one long, satisfying maneuver to make it computable without changing its answer.

One example at a time

The escape begins by shrinking the problem. Forget the whole dataset for a moment and look at a single data point x1x_1. For that one point, define a simple path: a cloud that starts as the standard Gaussian at t=0t = 0 and contracts onto a tight little blob centered at x1x_1 by t=1t = 1. Because it is conditioned on one example, call it a conditional probability path, pt(xx1)p_t(x \mid x_1). The paper takes it to be Gaussian at every time:

pt(xx1)=N(xμt(x1),  σt(x1)2I)p_t(x \mid x_1) = \mathcal{N}\big(x \,\big|\, \mu_t(x_1),\; \sigma_t(x_1)^2 I\big)(3)

with a mean μt(x1)\mu_t(x_1) and a scalar standard deviation σt(x1)\sigma_t(x_1) that we get to design. The only requirements are the two endpoints. At t=0t = 0 every conditional path must be the same standard noise, so μ0(x1)=0\mu_0(x_1) = 0 and σ0(x1)=1\sigma_0(x_1) = 1. At t=1t = 1 it must concentrate on its data point, so μ1(x1)=x1\mu_1(x_1) = x_1 and σ1(x1)=σmin\sigma_1(x_1) = \sigma_{\min}, a width small enough that the blob is essentially the point x1x_1. Everything in between is free.

This is a thing we can actually touch. Sampling from pt(xx1)p_t(x \mid x_1) is one line, x=μt(x1)+σt(x1)εx = \mu_t(x_1) + \sigma_t(x_1)\,\varepsilon with εN(0,I)\varepsilon \sim \mathcal{N}(0, I), and the velocity that generates it (we derive it shortly) is a tidy closed form. Below is one such path. Slide the time and watch the noise cloud march in and shrink onto a single x1x_1. Toggle the schedule to compare the optimal-transport path with the diffusion one, which take the same endpoints by different routes.

Figure 2 · one conditional path
t = 0.50
A single conditional path pt(x | x₁), drawn as a cloud of samples μt + σt·ε. It starts as the standard Gaussian at t=0 and contracts onto one data point x₁ at t=1. Slide t; toggle OT vs diffusion to see the two schedules for μt and σt.

One conditional path is not a generative model. It only knows about one data point. But a dataset is a pile of data points, and we are about to add the piles up.

Average the easy fields into the hard one

Here is the bridge from one example back to the whole distribution. The conditional paths all start at the same noise and each ends at its own data point. If we mix them, weighting each by how likely its data point is under the data distribution q(x1)q(x_1), we recover a path over the whole dataset. That mixture is the marginal probability path:

pt(x)=pt(xx1)q(x1)dx1p_t(x) = \int p_t(x \mid x_1)\, q(x_1)\, dx_1(4)

Check the endpoints. At t=0t = 0 every conditional path is the same standard Gaussian, so the mixture is too: p0=N(0,I)p_0 = \mathcal{N}(0, I). At t=1t = 1 each conditional path is a spike at its x1x_1, so the mixture is a spike at every data point in proportion to how common it is, which is the data distribution itself, p1qp_1 \approx q. So this mixture path is exactly the noise-to-data journey we wanted, assembled out of per-example pieces.

Now the part that is not obvious. Each conditional path has its own generating velocity ut(xx1)u_t(x \mid x_1). What single velocity field generates the mixture path? It would be lovely if it were just the average of the conditional velocities, and it almost is. You have to weight each conditional velocity by the posterior probability that the point xx came from data point x1x_1, which by Bayes is pt(xx1)q(x1)/pt(x)p_t(x \mid x_1)q(x_1)/p_t(x):

ut(x)=ut(xx1)pt(xx1)q(x1)pt(x)dx1u_t(x) = \int u_t(x \mid x_1)\,\frac{p_t(x \mid x_1)\, q(x_1)}{p_t(x)}\, dx_1(5)

This is the paper's first key result, and it is a clean one: the marginal velocity (5) generates the marginal path (4). The proof is two lines of the continuity equation, the conservation law that says a velocity vtv_t generates a path ptp_t exactly when tpt+ ⁣(ptvt)=0\partial_t p_t + \nabla\!\cdot(p_t v_t) = 0. Differentiate the mixture (4) in time, swap the conditional velocity in using the fact that each ut(x1)u_t(\cdot \mid x_1) generates its own pt(x1)p_t(\cdot \mid x_1), pull the divergence outside the integral, and out pops the continuity equation for utu_t and ptp_t. The weighted average of the simple fields is the field that drives the mixture.

So the intractable oracle utu_t from the dream objective is not so mysterious after all. It is a posterior-weighted average of conditional velocities we can each write down. Drag the probe point below: each amber arrow is one data point's conditional push (fatter when that point is the more likely explanation of where the probe is), and the teal arrow is the average they produce, which is utu_t at that spot.

Figure 3 · the marginal field is an average
t = 0.40
Drag the probe. Each amber arrow is one data point's conditional velocity ut(x | x₁), weighted by the posterior that the probe came from it (the point swells with its weight). The teal arrow is their weighted average: the marginal velocity ut(x) the oracle would have handed us.

We are closer, but not done. The formula (5) still has the data distribution q(x1)q(x_1) and the marginal density pt(x)p_t(x) inside it, both unknown, both integrals over the whole dataset. We still cannot compute ut(x)u_t(x) to regress against. The last step is the one that feels like a magic trick.

The trick: regress the easy target, get the hard one for free

Instead of regressing against the intractable marginal velocity, regress against the per-example conditional velocity. Replace the oracle target ut(x)u_t(x) in the dream objective (2) with ut(xx1)u_t(x \mid x_1), and sample xx from that one example's conditional path. This is the Conditional Flow Matching (CFM) objective:

LCFM(θ)=Et,q(x1),pt(xx1)vt(x)ut(xx1)2\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,q(x_1),\,p_t(x \mid x_1)}\big\| v_t(x) - u_t(x \mid x_1) \big\|^2(6)

Every piece of this is computable. Draw a real data point x1x_1 from your dataset, draw a time tt, draw a sample xx on that example's conditional path, and regress the network onto the closed-form conditional velocity. No marginal, no q(x1)q(x_1) integral, no pt(x)p_t(x). Just one example, one time, one arrow.

And the claim that makes it all work: the CFM objective (6) and the Flow Matching objective (2) have identical gradients in θ\theta. Minimizing the easy one is exactly minimizing the hard one. Your network converges to the marginal velocity utu_t even though you only ever showed it conditional velocities.

Why is that true? It is not a coincidence, it is a property of squared error. Expand both losses with vu2=v22v,u+u2\|v - u\|^2 = \|v\|^2 - 2\langle v, u\rangle + \|u\|^2. The u2\|u\|^2 term has no θ\theta in it, so it drops out of the gradient. The v2\|v\|^2 term is identical in both losses once you note that averaging over the marginal pt(x)p_t(x) is the same as averaging over q(x1)q(x_1) and then pt(xx1)p_t(x \mid x_1), since one is the mixture of the other. The only term that could differ is the cross term v,u\langle v, u\rangle, and plugging the definition of the marginal velocity (5) into it shows the marginal cross term and the conditional cross term integrate to the same thing. The two losses differ by a constant that does not depend on θ\theta, so

θLFM(θ)=θLCFM(θ)\nabla_\theta\, \mathcal{L}_{\text{FM}}(\theta) = \nabla_\theta\, \mathcal{L}_{\text{CFM}}(\theta)(7)

There is a one-sentence way to feel it without the algebra. Least-squares regression always converges to the conditional mean of its target. The target in CFM is the conditional velocity ut(xx1)u_t(x \mid x_1) for a random x1x_1, so the network converges to E[ut(xx1)x]\mathbb{E}[u_t(x \mid x_1) \mid x], the average conditional velocity over all the data points that could have produced this xx. That average is the marginal velocity (5). You are training against single arrows and landing on their average, because that is what regression does. This is the same move that lets denoising score matching train a score without ever knowing the true score, generalized from scores to velocities.

That is the entire conceptual payload. Everything from here is choosing a good conditional path so the arrows are easy to learn and cheap to follow. But first, make the loop concrete.

A worked step

Take the optimal-transport path we are about to define, with σmin=0\sigma_{\min} = 0 for cleanliness. The conditional path is the straight interpolation xt=(1t)x0+tx1x_t = (1-t)\,x_0 + t\,x_1 from a noise sample x0x_0 to the data point x1x_1, and its velocity is the constant x1x0x_1 - x_0. One training step on a batch of nn images, each a dd-vector:

# one Conditional Flow Matching step, OT path (sigma_min = 0)
x1 = sample_data()                 # a real data point     [n, d]
x0 = randn_like(x1)                # a noise sample        [n, d]
t  = rand(n, 1)                    # times ~ U[0,1]        [n, 1]
xt = (1 - t) * x0 + t * x1         # point on the straight path
target = x1 - x0                   # the constant velocity (the answer)
loss = mse(v(xt, t), target)       # regress the arrow
loss.backward()                    # no ODE solve, no diffusion

Concretely, for CIFAR-10 at 32×32×332\times32\times3 a batch might be x1x_1 of shape [256,3072][256, 3072], x0x_0 the same shape of fresh Gaussian noise, tt of shape [256,1][256, 1], the interpolation point xtx_t of shape [256,3072][256, 3072], and the regression target x1x0x_1 - x_0 also [256,3072][256, 3072]. The network vv (a U-Net) takes the noised image and the time and predicts a [256,3072][256, 3072] velocity. The loss is one mean-squared error. No ODE was solved, no noise schedule was consulted, no score was estimated. It is the simplest training loop in the family.

To generate, you do the one thing CFM never did during training: actually solve the ODE, forward from noise to data.

# generation: solve the ODE forward, t = 0 -> 1
x = randn(n, d)                    # start from pure noise (p0)
for t in linspace(0, 1, steps):    # any off-the-shelf ODE solver
    x = x + (1 / steps) * v(x, t)  # Euler step along the learned field
return x                           # x ~ data distribution (p1)

A formula for the target arrow

The worked step quietly used a closed form for the conditional velocity. Here is where it comes from, and it covers every Gaussian path at once. Take the Gaussian conditional path (3) and build the obvious flow that produces it: start with a standard normal sample and stretch it by the standard deviation, then shift it by the mean,

ψt(x)=σt(x1)x+μt(x1)\psi_t(x) = \sigma_t(x_1)\,x + \mu_t(x_1)(8)

When xN(0,I)x \sim \mathcal{N}(0, I), the output ψt(x)\psi_t(x) is exactly N(μt,σt2I)\mathcal{N}(\mu_t, \sigma_t^2 I), so this affine map pushes the noise onto the conditional path. Its velocity is forced: differentiate the flow definition (1) and the inverse of the affine map, and the unique velocity that generates this path is

ut(xx1)=σt(x1)σt(x1)(xμt(x1))+μt(x1)u_t(x \mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}\big(x - \mu_t(x_1)\big) + \mu_t'(x_1)(9)

where the primes are time derivatives. This one formula is the workhorse. Hand it any mean schedule μt\mu_t and width schedule σt\sigma_t with the right endpoints and it spits out the exact velocity to regress. Two choices recover the diffusion world.

The variance-exploding path keeps the mean at the data point and grows the noise: μt(x1)=x1\mu_t(x_1) = x_1, σt(x1)=σ1t\sigma_t(x_1) = \sigma_{1-t} for an increasing schedule. Plug into (9) and the mean term vanishes, leaving ut(xx1)=σ1tσ1t(xx1)u_t(x \mid x_1) = -\frac{\sigma_{1-t}'}{\sigma_{1-t}}\,(x - x_1). The variance-preserving path, the one behind DDPM, scales the mean down by α1t\alpha_{1-t} and the variance up to compensate, with αt=e12T(t)\alpha_t = e^{-\frac12 T(t)} and T(t)=0tβ(s)dsT(t) = \int_0^t \beta(s)\,ds. Plug those in and (9) reproduces the velocity of the probability-flow ODE that score-based diffusion already uses. So diffusion paths are not a rival to Flow Matching. They are points inside its menu, reachable by a particular pair of schedules (μt,σt)(\mu_t, \sigma_t).

That reframing is worth a beat. The whole apparatus of forward and reverse stochastic processes was one way to arrive at a Gaussian path. Flow Matching skips the process and writes the path down. Even when you choose the diffusion path, training it with the CFM velocity loss (instead of score matching) turns out more stable in the paper's experiments. And once you see the schedules as free parameters, you can ask for a path no diffusion produces.

The straight-line path

What is the simplest possible pair of schedules? Move the mean and the width in straight lines:

μt(x1)=tx1,σt(x1)=1(1σmin)t\mu_t(x_1) = t\,x_1, \qquad \sigma_t(x_1) = 1 - (1 - \sigma_{\min})\,t(10)

At t=0t = 0 this is the standard Gaussian (μ=0, σ=1\mu = 0,\ \sigma = 1); at t=1t = 1 it is the blob at x1x_1 (μ=x1, σ=σmin\mu = x_1,\ \sigma = \sigma_{\min}). Both endpoints satisfied, by the most boring interpolation there is. Feed (10) into the velocity formula (9) and the time-derivatives are constants, giving

ut(xx1)=x1(1σmin)x1(1σmin)tu_t(x \mid x_1) = \frac{x_1 - (1 - \sigma_{\min})\,x}{1 - (1 - \sigma_{\min})\,t}(11)

and the flow that carries a noise sample x0x_0 along it is the straight line ψt(x0)=(1(1σmin)t)x0+tx1\psi_t(x_0) = \big(1 - (1 - \sigma_{\min})t\big)x_0 + t\,x_1. Substitute into the CFM loss and the regression target collapses to a single constant vector,

LCFM(θ)=Et,q(x1),p(x0)vt(ψt(x0))(x1(1σmin)x0)2\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,q(x_1),\,p(x_0)}\Big\| v_t\big(\psi_t(x_0)\big) - \big(x_1 - (1 - \sigma_{\min})\,x_0\big) \Big\|^2(12)

The target x1(1σmin)x0x_1 - (1-\sigma_{\min})x_0 does not depend on tt at all. It is the arrow that points from the noise sample straight at the data sample, and the particle travels that arrow at constant speed. (With σmin=0\sigma_{\min} = 0 it is just x1x0x_1 - x_0, the form the released libraries use.)

This is not an arbitrary choice. That straight-line flow ψt\psi_t is the optimal-transport displacement map between the two Gaussians: of all the ways to morph the prior into the data blob, this is the one that moves mass the shortest total distance, in straight lines, at constant speed. Optimal transport is the mathematics of moving a pile of sand to a new shape with the least total carrying, and McCann showed that between two Gaussians the answer is exactly this linear interpolation. The diffusion path, by contrast, takes a curved detour: it barely moves the sample early, then rushes it toward the data near the end, and its trajectories can swing past the target and double back.

Watch both. Each line is one noise sample being carried to the same x1x_1. The optimal-transport flow draws straight, evenly-paced lines. Toggle to diffusion and the same endpoints get connected by curves that bow out and overshoot.

Figure 4 · straight vs curved trajectories
ψ_t = (1−(1−σ)t)x₀ + t·x₁
Conditional trajectories carrying noise samples to one data point x₁. The OT flow moves in straight, constant-speed lines. Toggle to the diffusion flow and the same endpoints are joined by curves that bow and can overshoot before settling.

A caution the paper is careful to state, and so will we. The conditional flow is optimal transport, but the marginal flow (the average over all data points) is not, in general, an optimal-transport map. Straight conditional lines do not guarantee straight marginal trajectories. Still, the marginal field stays relatively simple, and that simplicity is what pays off at sampling time.

Why straight is cheap to sample

Generation means solving the ODE (1) from t=0t = 0 to t=1t = 1, and a solver works by taking discrete steps. The cost is counted in function evaluations (NFE), one network call per step. Fewer steps means cheaper sampling, and how few you can get away with depends entirely on how curved the trajectory is.

A straight, constant-speed path is the easy case. A solver that assumes straight lines between steps tracks it almost perfectly even with a handful of steps. A curved path makes the same coarse solver cut corners: it steps along a chord, misses the bend, and drifts off the true trajectory, so you have to add steps to keep up. That is the practical edge of the optimal-transport path. The paper measures it directly: matching the same numerical error, Flow Matching with the OT path needs roughly 60% of the function evaluations the diffusion path needs.

See it below. We integrate the marginal field with NN Euler steps and check what fraction of particles land on the data. The optimal-transport field lands them with very few steps. Toggle to the curved diffusion-like field and the same step budget leaves particles stranded between clusters until you crank NN up.

Figure 5 · steps you can afford
N = 6
Euler-integrating the field with N steps. The straight OT field lands its particles on the data with few steps. Toggle to the curved diffusion field and the same N cuts corners and strands particles, so you need many more steps for the same quality.

So what does it actually do

The headline is that a loss this plain is competitive with, and often better than, the carefully-engineered diffusion training it replaces, on the same architecture. The authors take one U-Net and train it three ways: standard score matching with a diffusion path, Flow Matching with that same diffusion path, and Flow Matching with the optimal-transport path. On CIFAR-10 and ImageNet at 32, 64, and 128 the OT version wins across the board.

On ImageNet 32 the OT model reaches FID 5.02 against 5.68 for score matching and 6.99 for DDPM, while needing 122 function evaluations to score matching's 178 and DDPM's 262. (FID, the Fréchet Inception Distance, scores sample quality; lower is better. NFE counts the network calls an adaptive solver needs.) On CIFAR-10 it posts FID 6.35 at 2.99 bits per dimension. On ImageNet 64 it is FID 14.45 at 138 NFE. At 128 it reaches FID 20.9, the first CNF trained at that resolution. Better samples, better likelihoods, and fewer steps to produce them, all from the same network and the plainest possible loss.

Three things travel out of this paper. The CFM trick (regress per-example targets, converge to their average) is the engine, and it works for any path, not just Gaussians. The reframing of diffusion as one Gaussian path among many removed the mystique from a lot of generative modeling and let people design paths on purpose. And the straight optimal-transport path, with its constant-velocity target x1x0x_1 - x_0, became the default recipe almost everywhere downstream. When you read that a model was trained by "regressing x1x0x_1 - x_0 on the interpolation between noise and data," that is this paper, with σmin\sigma_{\min} set to zero. Stable Diffusion 3 and the current generation of large flow models are built on it.

The limits are honest. The conditional flow being optimal transport does not make the learned marginal flow optimal transport, so the trajectories are simpler but not perfectly straight, and very few sampling steps still cost you some quality. The Gaussian-path family, broad as it is, is still a family; richer paths (non-isotropic, non-Gaussian, on curved spaces) are left as future work the framework invites. None of that has slowed it down. Flow Matching took the most elegant generative model nobody could train and made it the easiest one to train, by noticing you never needed the hard target in the first place. You just needed to show the network one arrow at a time.

Provenance Verified against primary literature
Flow Matching (2022)Lipman, Chen, Ben-Hamu, Nickel, Le: the FM and CFM objectives, the equal-gradients theorem, the Gaussian-path VF formula, and the OT path.
flow_matching (code)Official library (facebookresearch/flow_matching). CondOTScheduler: α_t = t, σ_t = 1−t; the affine path x_t = α_t x₁ + σ_t x₀ and target velocity α̇_t x₁ + σ̇_t x₀.
McCann (1997)The displacement interpolation between two Gaussians, which makes the OT conditional flow a straight line.
Neural ODEs / CNF (2018)Chen et al.: modeling a flow as the solution of an ODE driven by a learned velocity field.
Denoising score matchingVincent (2011): the conditional-vs-marginal trick FM generalizes from scores to velocities.
correctionThe paper keeps a small σ_min > 0, so the OT regression target is x₁ − (1−σ_min)x₀. The official library (and most downstream code) set σ_min = 0, making the target the clean straight line x₁ − x₀; that σ_min = 0 case is exactly Rectified Flow (Liu et al., concurrent). We teach the σ_min form and note the simplification.

Questions you might still have

?

If we never compute the true marginal velocity, how can regressing the conditional one be right?
The two losses differ only by a constant that does not depend on θ, so their gradients are identical (Theorem 2). The minimizer of the easy per-example loss is the conditional expectation of the conditional targets, which is exactly the marginal velocity. You train against single-example arrows and land on the average without ever forming it.

?

What is σ_min, and why not just set it to zero?
It is the width of the little Gaussian sitting on each data point at t = 1. A tiny σ_min keeps p₁(x | x₁) a real density (needed for likelihoods), so the paper keeps the (1−σ_min) factor. The released libraries usually take σ_min = 0, which makes the OT target the clean straight-line velocity x₁ − x₀. We flag the difference in Provenance.

?

Is Flow Matching just diffusion with extra steps?
No. Diffusion is one choice of path inside the framework (the variance-preserving Gaussian path is a special case). Flow Matching lets you pick the path directly, and the optimal-transport path is a different, straighter one that diffusion never produces. Same objective shape, a strictly larger menu.

?

Does the network predict the data, the noise, or the velocity?
The velocity. With the OT path the regression target is the constant vector x₁ − (1−σ_min)x₀, which points from the noise sample straight at the data sample. Diffusion methods instead regress the score or the noise; those are reparameterizations of the same information, but the velocity is what you integrate at sampling time.

Footnotes & further reading

  1. The paper: Lipman, Chen, Ben-Hamu, Nickel, Le, Flow Matching for Generative Modeling (Meta AI / Weizmann, ICLR 2023). Code, and the authors' Flow Matching Guide and Code.
  2. Continuous Normalizing Flows and the Neural ODE: Chen, Rubanova, Bettencourt, Duvenaud, Neural Ordinary Differential Equations.
  3. The conditional-vs-marginal regression trick FM generalizes from scores to velocities: Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
  4. Diffusion as a score-based SDE with a probability-flow ODE, the family the Gaussian paths recover: Song et al., Score-Based Generative Modeling through SDEs, and Ho, Jain, Abbeel, DDPM.
  5. The optimal-transport displacement interpolation between two Gaussians: McCann, A Convexity Principle for Interacting Gases (1997).
  6. The concurrent σ_min = 0 special case: Liu, Gong, Liu, Flow Straight and Fast (Rectified Flow), and Albergo & Vanden-Eijnden, stochastic interpolants.