Diffusion · Generative models

Flow Matching for Generative Modeling

Teach the model one arrow at a time.

A generative model is a machine that turns noise into data. Flow Matching trains one by regressing the velocity of a fixed noise-to-data path, with no diffusion process to reason about and no differential equation to solve while you train. A few ideas give you the entire method, plus a straighter path that samples faster.

Explaining the paperFlow Matching for Generative ModelingLipman, Chen, Ben-Hamu, Nickel, Le · Meta AI (FAIR) · ICLR 2023 · arXiv:2210.02747 ↗

Training a generative model can be nothing fancier than showing it which way to step, over and over.

Every modern image generator is, underneath, a way of solving one problem: you have a pile of samples from some distribution you cannot write down (faces, say, or photographs of dogs) and you want a machine that produces fresh ones. The approach that has dominated the last few years is to learn the transformation from easy to hard. Start from pure Gaussian noise, which you can sample trivially, and learn a transformation that carries that noise onto the data. Diffusion models do this by setting up a stochastic process that slowly destroys data into noise, then learning to run it backward. It works beautifully, and it is also a lot of machinery: a noising process, a score to estimate (the gradient of the noised data's log-density, the quantity diffusion learns), a schedule to tune, and a reverse-time stochastic differential equation to reason about.

Flow Matching, from Meta AI, throws most of that machinery out. The idea is short. Pick a fixed path that carries noise to data. Write down the velocity that travels along it. Train a network to copy that velocity by plain least-squares regression. There is no diffusion process, no score, no stochastic reversal, and no differential equation running inside the training loop. At generation time you take the learned velocity and follow it from noise to data with any off-the-shelf ODE solver.

The difficulty, and the reason the paper is clever rather than obvious, is that the velocity you actually want to copy is an average over the entire dataset and is hopeless to compute. The bulk of this post is the argument for why you can ignore that and regress a one-example stand-in instead, and get the same answer. A few ideas carry it: what a flow is, the objective you wish you could write, why it is intractable, and the step that makes it tractable. Then the result, a path borrowed from optimal transport that is straight and samples faster than diffusion.

A flow is a velocity field you follow

Every point in your space (every possible image, flattened to a vector in $\mathbb{R}^d$ ) has a little arrow attached to it. The arrows can change over time. That time-varying field of arrows is a vector field, written $v_t(x)$ : at time $t$ and location $x$ it tells you which way, and how fast, to move. Drop a particle into the field and let it ride: its position obeys an ordinary differential equation,

\frac{d}{dt}\phi_t(x) = v_t\big(\phi_t(x)\big), \qquad \phi_0(x) = x

(1)

where $\phi_t(x)$ is where the particle that started at $x$ has drifted to by time $t$ . The map $\phi_t$ is the flow. Run it on a whole cloud of starting points and the cloud gets stretched and reshaped. If we start the cloud as pure noise $p_0 = \mathcal{N}(0, I)$ and the flow reshapes it into the data distribution $p_1$ by time $t = 1$ , we have a generative model. Model the field $v_t$ with a neural network and this is a Continuous Normalizing Flow (CNF).

A convention to fix now, because it trips people coming from diffusion: here $t = 0$ is noise and $t = 1$ is data. Generation always runs forward in time, from $0$ to $1$ . Diffusion papers usually go the other way (data at $t=0$ , noise at $t=1$ , then reverse), because they have a separate noising process to undo. Flow Matching has no such process to reverse, so it points the clock the natural way.

Below is the picture that anchors the rest of the post. A field of arrows, and noise particles riding it from a blob in the middle out onto the data.

Figure 1 · a flow carries noise to data

noise p₀ → data p₁ along v_t

A time-dependent velocity field v_t. Drop particles in as Gaussian noise at t=0 and let them ride the arrows; by t=1 they have landed on the four data clusters. A generative model is just this field. Press run to watch the flow.

CNFs are old and elegant, and almost nobody used them at scale. Training was the bottleneck. The classic way to fit a CNF is maximum likelihood, which needs you to solve the ODE (1) numerically on every training step to know where your samples went and how the density changed. Solving an ODE means many sequential evaluations of the network, and doing that inside every gradient step is painfully slow. People wanted a way to train the field directly, without simulating it.

The objective you wish you could use

Suppose a kind oracle handed you the right velocity field: the field $u_t(x)$ that, when you follow it, carries noise exactly onto the data. Then training is the easiest thing in the world. Regress your network $v_t$ onto it with squared error:

\mathcal{L}_{\text{FM}}(\theta) = \mathbb{E}_{t,\,p_t(x)}\big\| v_t(x) - u_t(x) \big\|^2, \qquad t \sim \mathcal{U}[0,1]

(2)

In words: pick a random time $t$ , pick a point $x$ from the cloud as it looks at that time (that is the $p_t(x)$ under the expectation), and push the network's arrow toward the oracle's arrow. Drive this loss to zero and your $v_t$ equals $u_t$ everywhere the cloud visits, so your flow generates the data. No ODE solve, no likelihood, no diffusion. This is the Flow Matching objective, and it is an unusually simple training loss.

There is one problem, and as stated it is fatal. We do not have the oracle. We have no idea what $u_t(x)$ is. There are infinitely many fields whose flow lands on the data, and even if we fixed a particular target path $p_t$ we want the cloud to follow, the field that generates it is an integral over the entire unknown data distribution. We cannot sample from $p_t(x)$ and we cannot evaluate $u_t(x)$ . The dream objective (2) is uncomputable. The rest of the method is a sequence of steps that makes it computable without changing its answer.

One example at a time

The escape begins by shrinking the problem. Set the dataset aside for a moment and look at a single data point $x_1$ . For that one point, define a simple path: a cloud that starts as the standard Gaussian at $t = 0$ and contracts onto a tight little blob centered at $x_1$ by $t = 1$ . Because it is conditioned on one example, call it a conditional probability path, $p_t(x \mid x_1)$ . The paper takes it to be Gaussian at every time:

p_t(x \mid x_1) = \mathcal{N}\big(x \,\big|\, \mu_t(x_1),\; \sigma_t(x_1)^2 I\big)

(3)

with a mean $\mu_t(x_1)$ and a scalar standard deviation $\sigma_t(x_1)$ that we get to design. The only requirements are the two endpoints. At $t = 0$ every conditional path must be the same standard noise, so $\mu_0(x_1) = 0$ and $\sigma_0(x_1) = 1$ . At $t = 1$ it must concentrate on its data point, so $\mu_1(x_1) = x_1$ and $\sigma_1(x_1) = \sigma_{\min}$ , a width small enough that the blob is essentially the point $x_1$ . Everything in between is free.

This is something we can actually compute. Sampling from $p_t(x \mid x_1)$ is one line, $x = \mu_t(x_1) + \sigma_t(x_1)\,\varepsilon$ with $\varepsilon \sim \mathcal{N}(0, I)$ , and the velocity that generates it (derived below) is a closed form. Below is one such path. Slide the time and watch the noise cloud march in and shrink onto a single $x_1$ . Toggle the schedule to compare the optimal-transport path with the diffusion one; both reach the same endpoints by different routes.

Figure 2 · one conditional path

time tt = 0.50

A single conditional path p_t(x | x₁), drawn as a cloud of samples μ_t + σ_t·ε. It starts as the standard Gaussian at t=0 and contracts onto one data point x₁ at t=1. Slide t; toggle OT vs diffusion to see the two schedules for μ_t and σ_t.

One conditional path is not a generative model. It covers only one data point. But a dataset is a pile of data points, and we are about to add them up.

Average the easy fields into the hard one

The bridge from one example back to the full distribution is a mixture. The conditional paths all start at the same noise and each ends at its own data point. If we mix them, weighting each by how likely its data point is under the data distribution $q(x_1)$ , we recover a path over the entire dataset. That mixture is the marginal probability path:

p_t(x) = \int p_t(x \mid x_1)\, q(x_1)\, dx_1

(4)

Check the endpoints. At $t = 0$ every conditional path is the same standard Gaussian, so the mixture is too: $p_0 = \mathcal{N}(0, I)$ . At $t = 1$ each conditional path is a spike at its $x_1$ , so the mixture is a spike at every data point in proportion to how common it is, which is the data distribution itself, $p_1 \approx q$ . So this mixture path is exactly the noise-to-data path we wanted, assembled out of per-example pieces.

The next part is less obvious. Each conditional path has its own generating velocity $u_t(x \mid x_1)$ . What single velocity field generates the mixture path? It would be lovely if it were just the average of the conditional velocities, and it almost is. You have to weight each conditional velocity by the posterior probability that the point $x$ came from data point $x_1$ , which by Bayes is $p_t(x \mid x_1)q(x_1)/p_t(x)$ :

u_t(x) = \int u_t(x \mid x_1)\,\frac{p_t(x \mid x_1)\, q(x_1)}{p_t(x)}\, dx_1

(5)

The weight measures how plausibly this $x$ came from the path aimed at this $x_1$ : each data point contributes to the velocity at $x$ , but a data point whose conditional path rarely reaches $x$ contributes almost nothing. So the marginal field at $x$ is a weighted average of the conditional velocities of the $x_1$ that could plausibly have produced it.

This is the paper's first key result: the marginal velocity (5) generates the marginal path (4). The proof is two lines of the continuity equation, the conservation law that says a velocity $v_t$ generates a path $p_t$ exactly when $\partial_t p_t + \nabla\!\cdot(p_t v_t) = 0$ . Differentiate the mixture (4) in time, swap the conditional velocity in using the fact that each $u_t(\cdot \mid x_1)$ generates its own $p_t(\cdot \mid x_1)$ , pull the divergence outside the integral, and the result is the continuity equation for $u_t$ and $p_t$ .

So the intractable oracle $u_t$ from the dream objective is not so mysterious after all. It is a posterior-weighted average of conditional velocities we can each write down. Drag the probe point below: each amber arrow is one data point's conditional push (fatter when that point is the more likely explanation of where the probe is), and the teal arrow is the average they produce, which is $u_t$ at that spot.

Figure 3 · the marginal field is an average

time tt = 0.40

Drag the probe. Each amber arrow is one data point's conditional velocity u_t(x | x₁), weighted by the posterior that the probe came from it (the point swells with its weight). The teal arrow is their weighted average: the marginal velocity u_t(x) the oracle would have handed us.

We are closer, but not done. The formula (5) still has the data distribution $q(x_1)$ and the marginal density $p_t(x)$ inside it, both unknown, both integrals over the entire dataset. We still cannot compute $u_t(x)$ to regress against. The last step removes both unknowns at once.

Regress the easy target, land on the hard one

Instead of regressing against the intractable marginal velocity, regress against the per-example conditional velocity. Replace the oracle target $u_t(x)$ in the dream objective (2) with $u_t(x \mid x_1)$ , and sample $x$ from that one example's conditional path. This is the Conditional Flow Matching (CFM) objective:

\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,q(x_1),\,p_t(x \mid x_1)}\big\| v_t(x) - u_t(x \mid x_1) \big\|^2

(6)

Every piece of this is computable. Draw a real data point $x_1$ from your dataset, draw a time $t$ , draw a sample $x$ on that example's conditional path, and regress the network onto the closed-form conditional velocity. No marginal, no $q(x_1)$ integral, no $p_t(x)$ .

And the claim that makes it all work: the CFM objective (6) and the Flow Matching objective (2) have identical gradients in $\theta$ . Minimizing the easy one is exactly minimizing the hard one. Identical gradients everywhere means the two losses share every stationary point and every descent path, so each optimizer step on the cheap conditional loss matches, step for step, the step taken on the impossible marginal loss. Your network converges to the marginal velocity $u_t$ even though you only ever showed it conditional velocities.

Why is that true? It is not a coincidence, it is a property of squared error. Expand both losses with $\|v - u\|^2 = \|v\|^2 - 2\langle v, u\rangle + \|u\|^2$ . The $\|u\|^2$ term has no $\theta$ in it, so it drops out of the gradient. The $\|v\|^2$ term is identical in both losses once you note that averaging over the marginal $p_t(x)$ is the same as averaging over $q(x_1)$ and then $p_t(x \mid x_1)$ , since one is the mixture of the other. The only term that could differ is the cross term $\langle v, u\rangle$ , and plugging the definition of the marginal velocity (5) into it shows the marginal cross term and the conditional cross term integrate to the same thing. The two losses differ by a constant that does not depend on $\theta$ , so

\nabla_\theta\, \mathcal{L}_{\text{FM}}(\theta) = \nabla_\theta\, \mathcal{L}_{\text{CFM}}(\theta)

(7)

There is a one-sentence argument that skips the algebra. Least-squares regression always converges to the conditional mean of its target. The target in CFM is the conditional velocity $u_t(x \mid x_1)$ for a random $x_1$ , so the network converges to $\mathbb{E}[u_t(x \mid x_1) \mid x]$ , the average conditional velocity over all the data points that could have produced this $x$ . That average is the marginal velocity (5). You are training against single arrows and landing on their average, because that is what regression does. This is the same move that lets denoising score matching train a score without ever knowing the true score, generalized from scores to velocities.

Everything from here is choosing a good conditional path so the arrows are easy to learn and cheap to follow. But first, make the loop concrete.

A worked step

The optimal-transport path we are about to define, with $\sigma_{\min} = 0$ for cleanliness, gives the simplest case. The conditional path is the straight interpolation $x_t = (1-t)\,x_0 + t\,x_1$ from a noise sample $x_0$ to the data point $x_1$ , and its velocity is the constant $x_1 - x_0$ . One training step on a batch of $n$ images, each a $d$ -vector:

# one Conditional Flow Matching step, OT path (sigma_min = 0)
x1 = sample_data()                 # a real data point     [n, d]
x0 = randn_like(x1)                # a noise sample        [n, d]
t  = rand(n, 1)                    # times ~ U[0,1]        [n, 1]
xt = (1 - t) * x0 + t * x1         # point on the straight path
target = x1 - x0                   # the constant velocity (the answer)
loss = mse(v(xt, t), target)       # regress the arrow
loss.backward()                    # no ODE solve, no diffusion

Concretely, for CIFAR-10 at $32\times32\times3$ a batch might be $x_1$ of shape $[256, 3072]$ , $x_0$ the same shape of fresh Gaussian noise, $t$ of shape $[256, 1]$ , the interpolation point $x_t$ of shape $[256, 3072]$ , and the regression target $x_1 - x_0$ also $[256, 3072]$ . The network $v$ (a U-Net) takes the noised image and the time and predicts a $[256, 3072]$ velocity. The loss is one mean-squared error. No ODE was solved, no noise schedule was consulted, no score was estimated.

To generate, you do the one thing CFM never did during training: actually solve the ODE, forward from noise to data.

# generation: solve the ODE forward, t = 0 -> 1
x = randn(n, d)                    # start from pure noise (p0)
for t in linspace(0, 1, steps):    # any off-the-shelf ODE solver
    x = x + (1 / steps) * v(x, t)  # Euler step along the learned field
return x                           # x ~ data distribution (p1)

A formula for the target arrow

The worked step leaned on a closed form for the conditional velocity. It comes from one construction that covers every Gaussian path at once. Take the Gaussian conditional path (3) and build the obvious flow that produces it: start with a standard normal sample and stretch it by the standard deviation, then shift it by the mean,

\psi_t(x) = \sigma_t(x_1)\,x + \mu_t(x_1)

(8)

When $x \sim \mathcal{N}(0, I)$ , the output $\psi_t(x)$ is exactly $\mathcal{N}(\mu_t, \sigma_t^2 I)$ , so this affine map pushes the noise onto the conditional path. Its velocity is forced: differentiate the flow definition (1) and the inverse of the affine map, and the unique velocity that generates this path is

u_t(x \mid x_1) = \frac{\sigma_t'(x_1)}{\sigma_t(x_1)}\big(x - \mu_t(x_1)\big) + \mu_t'(x_1)

(9)

where the primes are time derivatives. Hand this formula any mean schedule $\mu_t$ and width schedule $\sigma_t$ with the right endpoints and it gives the exact velocity to regress. Two choices recover the diffusion world.

The variance-exploding path keeps the mean at the data point and grows the noise: $\mu_t(x_1) = x_1$ , $\sigma_t(x_1) = \sigma_{1-t}$ for an increasing schedule. Plug into (9) and the mean term vanishes, leaving $u_t(x \mid x_1) = -\frac{\sigma_{1-t}'}{\sigma_{1-t}}\,(x - x_1)$ . The variance-preserving path, the one behind DDPM (denoising diffusion probabilistic models), scales the mean down by $\alpha_{1-t}$ and the variance up to compensate, with $\alpha_t = e^{-\frac12 T(t)}$ and $T(t) = \int_0^t \beta(s)\,ds$ . Plug those in and (9) reproduces the velocity of the probability-flow ODE (the deterministic counterpart of a diffusion process that reproduces its densities without the noise) that score-based diffusion already uses. So diffusion paths are not a rival to Flow Matching. They are points inside its menu, reachable by a particular pair of schedules $(\mu_t, \sigma_t)$ .

That reframing changes how you see diffusion. The apparatus of forward and reverse stochastic processes was one way to arrive at a Gaussian path. Flow Matching skips the process and writes the path down. Even when you choose the diffusion path, training the network with the CFM velocity loss (instead of score matching) is more stable in the paper's experiments. And once you see the schedules as free parameters, you can ask for a path no diffusion produces.

The straight-line path

What is the simplest possible pair of schedules? Move the mean and the width in straight lines:

\mu_t(x_1) = t\,x_1, \qquad \sigma_t(x_1) = 1 - (1 - \sigma_{\min})\,t

(10)

At $t = 0$ this is the standard Gaussian ( $\mu = 0,\ \sigma = 1$ ); at $t = 1$ it is the blob at $x_1$ ( $\mu = x_1,\ \sigma = \sigma_{\min}$ ). Both endpoints satisfied, by the most boring interpolation there is. Feed (10) into the velocity formula (9) and the time-derivatives are constants, giving

u_t(x \mid x_1) = \frac{x_1 - (1 - \sigma_{\min})\,x}{1 - (1 - \sigma_{\min})\,t}

(11)

and the flow that carries a noise sample $x_0$ along it is the straight line $\psi_t(x_0) = \big(1 - (1 - \sigma_{\min})t\big)x_0 + t\,x_1$ . Substitute into the CFM loss and the regression target collapses to a single constant vector,

\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t,\,q(x_1),\,p(x_0)}\Big\| v_t\big(\psi_t(x_0)\big) - \big(x_1 - (1 - \sigma_{\min})\,x_0\big) \Big\|^2

(12)

The target $x_1 - (1-\sigma_{\min})x_0$ does not depend on $t$ at all. It is the arrow that points from the noise sample straight at the data sample, and the particle travels that arrow at constant speed. (With $\sigma_{\min} = 0$ it reduces to $x_1 - x_0$ , the form the released libraries use.)

The $\sigma_{\min} = 0$ case is the version almost everyone actually runs. Setting $\sigma_{\min}$ to zero shrinks the blob at $x_1$ to the exact point, the interpolation becomes the clean straight line $x_t = (1-t)\,x_0 + t\,x_1$ , and the regression target becomes the constant $x_1 - x_0$ , the arrow pointing straight from the noise sample to the data sample. This is exactly the Rectified Flow recipe of Liu et al., concurrent work that arrived at the same straight-line target from a different direction. The paper itself keeps a small positive $\sigma_{\min}$ as the general form, because a point mass at $x_1$ is not a density and a width of zero would break the likelihood computations the framework also supports; the $(1 - \sigma_{\min})$ factor is the price of keeping $p_t(x \mid x_1)$ a proper Gaussian at the endpoint. For training samples the difference is numerically tiny, which is why the libraries drop it and regress the clean $x_1 - x_0$ directly.

That straight-line flow $\psi_t$ is the optimal-transport displacement map between the two Gaussians: of all the ways to morph the prior into the data blob, this is the one that moves mass the shortest total distance, in straight lines, at constant speed. Optimal transport is the mathematics of moving a pile of sand to a new shape with the least total carrying, and McCann showed that between two Gaussians the answer is exactly this linear interpolation. The diffusion path, by contrast, takes a curved detour: it moves the sample little early on and most of the distance near the end, and its trajectories curve past the target before returning.

The figure below shows both. Each line is one noise sample being carried to the same $x_1$ . The optimal-transport flow draws straight, evenly-paced lines. Toggle to diffusion and the same endpoints get connected by curves that bow outward and pass beyond the target.

Figure 4 · straight vs curved trajectories

ψ_t = (1−(1−σ)t)x₀ + t·x₁

Conditional trajectories carrying noise samples to one data point x₁. The OT flow moves in straight, constant-speed lines. Toggle to the diffusion flow and the same endpoints are joined by curves that bow and can overshoot before settling.

A caution the paper is careful to state, and so will we. The conditional flow is optimal transport, but the marginal flow (the average over all data points) is not, in general, an optimal-transport map. Straight conditional lines do not guarantee straight marginal trajectories. Still, the marginal field stays relatively simple, and that simplicity lowers the cost at sampling time.

Why straight is cheap to sample

Generation means solving the ODE (1) from $t = 0$ to $t = 1$ , and a solver works by taking discrete steps. The cost is counted in function evaluations (NFE), one network call per step. Fewer steps means cheaper sampling, and how few you can get away with depends entirely on how curved the trajectory is.

A straight, constant-speed path is the easy case. A solver that assumes straight lines between steps tracks it almost perfectly even with a handful of steps. On a curved path the same coarse solver steps along a chord, falling short of each bend and drifting off the true trajectory, so you have to add steps to keep up. That is the practical edge of the optimal-transport path. The paper measures it directly: matching the same numerical error, Flow Matching with the OT path needs roughly 60% of the function evaluations the diffusion path needs.

The figure below shows this. We integrate the marginal field with $N$ Euler steps and check what fraction of particles land on the data. The optimal-transport field lands them with very few steps. Toggle to the curved diffusion-like field and the same step budget leaves particles stranded between clusters until you crank $N$ up.

Figure 5 · steps you can afford

steps NN = 6

Euler-integrating the field with N steps. The straight OT field lands its particles on the data with few steps. Toggle to the curved diffusion field and the same N cuts corners and strands particles, so you need many more steps for the same quality. At N = 1 even the OT field fails: at t = 0 every mode pulls equally, so a single Euler step dumps every particle on the mixture mean. Straight conditional paths make the marginal field cheap to integrate, not one-step.

Better samples, fewer steps

The headline is that a loss this plain is competitive with, and often better than, the carefully-engineered diffusion training it replaces, on the same architecture. The authors take one U-Net and train it three ways: standard score matching with a diffusion path, Flow Matching with that same diffusion path, and Flow Matching with the optimal-transport path. On CIFAR-10 and ImageNet at 32, 64, and 128 the OT version wins across the board.

On ImageNet 32 the OT model reaches FID 5.02 against 5.68 for score matching and 6.99 for DDPM, while needing 122 function evaluations to score matching's 178 and DDPM's 262. (FID, the Fréchet Inception Distance, scores sample quality; lower is better. NFE counts the network calls an adaptive solver needs.) On CIFAR-10 it posts FID 6.35 at 2.99 bits per dimension (a likelihood score, the cost, in bits, of compressing each pixel under the model; lower is better). On ImageNet 64 it is FID 14.45 at 138 NFE. At 128 it reaches FID 20.9, the first CNF trained at that resolution. Better samples, better likelihoods, and fewer steps to produce them, all from the same network and the plainest possible loss.

This paper had a lasting effect in three areas. The CFM result (regress per-example targets, converge to their average) is the engine, and it works for any path, not just Gaussians. The reframing of diffusion as one Gaussian path among many removed the mystique from a lot of generative modeling and let people design paths on purpose. And the straight optimal-transport path, with its constant-velocity target $x_1 - x_0$ , became the default recipe almost everywhere downstream. When you read that a model was trained by "regressing $x_1 - x_0$ on the interpolation between noise and data," that is this paper, with $\sigma_{\min}$ set to zero. Stable Diffusion 3 and the current generation of large flow models are built on it.

The limits are clear. The conditional flow being optimal transport does not make the learned marginal flow optimal transport, so the trajectories are simpler but not perfectly straight, and very few sampling steps still cost you some quality. The Gaussian-path family, broad as it is, is still a family; richer paths (non-isotropic, non-Gaussian, on curved spaces) are left as future work the framework invites. None of that has slowed it down. Flow Matching took the most elegant generative model nobody could train and made it the easiest one to train, by noticing you never needed the hard target in the first place. You just needed to show the network one arrow at a time.

Provenance Verified against primary literature

Flow Matching (2022)Lipman, Chen, Ben-Hamu, Nickel, Le: the FM and CFM objectives, the equal-gradients theorem, the Gaussian-path VF formula, and the OT path.

flow_matching (code)Official library (facebookresearch/flow_matching). CondOTScheduler: α_t = t, σ_t = 1−t; the affine path x_t = α_t x₁ + σ_t x₀ and target velocity α̇_t x₁ + σ̇_t x₀.

McCann (1997)The displacement interpolation between two Gaussians, which makes the OT conditional flow a straight line.

Neural ODEs / CNF (2018)Chen et al.: modeling a flow as the solution of an ODE driven by a learned velocity field.

Denoising score matchingVincent (2011): the conditional-vs-marginal trick FM generalizes from scores to velocities.

correctionThe paper keeps a small σ_min > 0, so the OT regression target is x₁ − (1−σ_min)x₀. The official library (and most downstream code) set σ_min = 0, making the target the clean straight line x₁ − x₀; that σ_min = 0 case is exactly Rectified Flow (Liu et al., concurrent). We teach the σ_min form and note the simplification.

Questions you might still have

If we never compute the true marginal velocity, how can regressing the conditional one be right?
The two losses differ only by a constant that does not depend on θ, so their gradients are identical (Theorem 2). The minimizer of the easy per-example loss is the conditional expectation of the conditional targets, which is exactly the marginal velocity. You train against single-example arrows and land on the average without ever forming it.

What is σ_min, and why not just set it to zero?
It is the width of the little Gaussian sitting on each data point at t = 1. A tiny σ_min keeps p₁(x | x₁) a real density (needed for likelihoods), so the paper keeps the (1−σ_min) factor. The released libraries usually take σ_min = 0, which makes the OT target the clean straight-line velocity x₁ − x₀. We flag the difference in Provenance.

Is Flow Matching just diffusion with extra steps?
No. Diffusion is one choice of path inside the framework (the variance-preserving Gaussian path is a special case). Flow Matching lets you pick the path directly, and the optimal-transport path is a different, straighter one that diffusion never produces. Same objective shape, a strictly larger menu.

Does the network predict the data, the noise, or the velocity?
The velocity. With the OT path the regression target is the constant vector x₁ − (1−σ_min)x₀, which points from the noise sample straight at the data sample. Diffusion methods instead regress the score or the noise; those are reparameterizations of the same information, but you integrate the velocity at sampling time.

Footnotes & further reading

The paper: Lipman, Chen, Ben-Hamu, Nickel, Le, Flow Matching for Generative Modeling (Meta AI / Weizmann, ICLR 2023). Code, and the authors' Flow Matching Guide and Code.
Continuous Normalizing Flows and the Neural ODE: Chen, Rubanova, Bettencourt, Duvenaud, Neural Ordinary Differential Equations.
The conditional-vs-marginal regression trick FM generalizes from scores to velocities: Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
Diffusion as a score-based SDE with a probability-flow ODE, the family the Gaussian paths recover: Song et al., Score-Based Generative Modeling through SDEs, and Ho, Jain, Abbeel, DDPM.
The optimal-transport displacement interpolation between two Gaussians: McCann, A Convexity Principle for Interacting Gases (1997).
The concurrent σ_min = 0 special case: Liu, Gong, Liu, Flow Straight and Fast (Rectified Flow), and Albergo & Vanden-Eijnden, stochastic interpolants.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.