Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Distill a diffusion model down to a few steps, and keep both its sharpness and its variety.
Continuous-time consistency models distill cleanly but blur fine detail. Score distillation sharpens but collapses to a few outputs. rCM runs both at once, and scales the combination to 14-billion-parameter image and video models.
Explaining the paperLarge Scale Diffusion Distillation via Score-Regularized Continuous-Time ConsistencyA state-of-the-art video diffusion model spends dozens of slow denoising steps, each one a full pass through a 14-billion-parameter network, to make five seconds of video. rCM does it in one to four steps, 15 to 50 times faster, and unlike the previous best fast model it does not keep producing the same handful of clips.
Diffusion is slow, so we distill it
A diffusion model generates by reversing noise. It starts from a featureless Gaussian blob and walks, one small step at a time, back onto the thin sheet where real images and videos live. The quality is the best we have. The speed is not. Each step is a full forward pass through the network, and you need many of them, often a few dozen, because each step only solves a little slice of an ordinary differential equation and the solver's error grows if you take bigger steps. Worse, most models run every step twice, once with the text prompt and once without, and combine the two: that is classifier-free guidance, and it doubles the cost. So a single video can cost a hundred network passes.
Distillation is the standard cure: train a fast student network to reproduce, in a handful of steps, what a slow teacher produces in many. The number that matters is NFE, the count of network forward passes (function evaluations) it takes to make one sample. The teacher's NFE is dozens times two; the dream is a student whose NFE is one to four. That is a 15 to 50 times speedup, and it is what rCM delivers.
The trouble is that the fast students people knew how to train each gave something up. To see why, and to see what rCM does about it, you only need a few ideas: that distillation methods split into two opposite camps by the kind of divergence they minimize; what a consistency model is and how it samples in a few steps; how the continuous-time version (sCM) works and why it was hard to scale; and why a small, carefully chosen dose of the other camp's objective repairs it. Those four ideas are the method.
Two opposite ways to distill
Every distillation method is trying to make the student's distribution of outputs match the teacher's. The deep split is in which way you measure the mismatch, because the two natural choices behave like opposite personalities. A divergence is a number that is zero when two distributions agree and grows as they differ; the asymmetry is in which distribution you average over.
A forward divergence is averaged over the data: it penalizes the student wherever real data exists but the student assigns little probability. To avoid that penalty everywhere, the student spreads out to cover every mode of the data. This is mode-covering. It keeps variety, but a student of limited capacity, forced to cover everything, smears probability into the empty gaps between modes, and that smear shows up as blur. A reverse divergence is averaged over the student's ownsamples: it penalizes the student for putting mass where the teacher has little. The safe move is to retreat onto one mode and nail it. This is mode-seeking. It produces sharp, clean samples, but it tends to collapse onto a few of them, abandoning the rest.
The figure makes the dichotomy concrete on a two-mode target. Drag from the forward end to the reverse end and watch a single model distribution go from broad-and-covering-both to narrow-and-locked-on-one. Watch the two mode markers: at the forward end both stay lit (diverse); at the reverse end one goes dark (collapse).
The blur on the forward side is not a law of nature. A single Gaussian cannot be both sharp and two-humped, so forced to cover both modes it sits in the middle and smears. A high-capacity model has no such limit: it can put sharp, separate mass on every mode at once, borrowing the strength of each divergence without its weakness. That loophole, a student that is both sharp and diverse, is the opening at the center of the paper.
The map onto real methods is direct. Consistency models, including the continuous-time sCM this paper scales up, are trained on offline data (real samples, or the teacher's trajectory) and behave like the forward, mode-covering side: diverse, but prone to softness. Score distillation methods (DMD, and its relatives) and GANs are trained on the student's own freshly generated samples and behave like the reverse, mode-seeking side: sharp, but prone to collapse. Two caveats to keep precise: sCM is forward-type, not literally forward KL, and a GAN is mode-seeking in practice, not because it minimizes a reverse KL by definition. The labels describe the behavior, not an exact identity.
Consistency models: one jump to the data
Start with the slow teacher. Its sampling path is an ordinary differential equation, the probability-flow ODE, a smooth curve that carries a noisy point down through falling noise levels until it lands on clean data. Solving it accurately takes many small steps.
A consistency model learns a shortcut. Its network is a consistency function that takes any point on that curve, at any noise level, and jumps straight to where the curve ends, the clean data point :
The second equation is the boundary condition: a point that is already clean (noise level zero) maps to itself, since there is nothing to undo. The network is wired to satisfy it by construction, through an EDM-style preconditioning that blends the input and the network output with noise-dependent weights, the blend tuned so the identity holds exactly at the clean end. The name consistency means every point on a given trajectory must jump to the same endpoint: stand anywhere along one path and you land at the one trailhead. You train that by drawing two nearby points on the same trajectory and forcing their predictions to match.
One jump from pure noise is rarely perfect, so a consistency model samples in a few steps by alternating two moves: jump (apply the consistency map to get a clean guess) and renoise (add fresh Gaussian noise back, to a smaller level than before). Each pass commits to less noise and fixes more detail. This is not the same input fed through repeatedly; the renoising injects new randomness each time, and the refinement comes from the learned map. Drag the step count below from one to four and watch the landings tighten onto both data clusters, so diversity survives the speedup.
A small bookkeeping note that pays off later. This paper uses the TrigFlow schedule from sCM, where the noisy point at time is
so is clean data (all signal, ) and is pure noise (all noise). This runs opposite to the flow-matching convention, where time zero is the noise end, so it is worth fixing in mind before reading any sign: here, small means nearly clean. The teacher might have been trained on a different schedule, which is fine. Because the score, the noise, the clean-data prediction and the velocity are four interconvertible views of the same quantity (like Celsius and Fahrenheit, one affine map apart), the paper wraps any teacher into TrigFlow coordinates by matching signal-to-noise ratios (the ratio of the sine and cosine weights in equation (2), the one quantity any two schedules share), with no retraining.
sCM: the continuous-time version
Discrete consistency training compares two points a fixed gap apart on the trajectory. That gap is a nuisance: it introduces its own discretization error and forces a hand-tuned schedule for how the gap should shrink during training. sCM (Lu and Song, 2024), the continuous-time consistency model this paper scales up, takes the gap to zero. The two-point comparison becomes a derivative: instead of asking "do these two nearby points agree," it asks "how does my prediction change as I slide along the teacher's trajectory," and drives that rate of change to what the teacher dictates. The two-point loss turns into a loss on the tangent of the network along the path.
That tangent is a total derivative with two pieces: how the prediction moves because the point moves along the trajectory, and how it moves because time itself advances.
The first term is a Jacobian (all the partial derivatives of the network output) times the teacher's velocity vector. Computing a full Jacobian would be ruinous, but you never need it: you only need the Jacobian times one specific vector. That is a Jacobian-vector product, or JVP, and forward-mode automatic differentiation computes it in a single extra forward sweep, by pushing the velocity vector through the network alongside the activations. Think of it as reading a speedometer directly instead of timing two positions. One caveat the analogy hides: this tangent mixes the spatial term and the explicit time term, so it is a directional derivative in the joint space of point and time, not a single rate.
With TrigFlow, the network is exactly a velocity predictor , and the consistency target unrolls into a clean form that will matter in a moment:
Here is the student's own weights with the gradient switched off (a stop-gradient copy), so the network is chasing a target built from itself. The first term is supervision from the teacher, weighted by . The second is the network's own tangent feeding back into its target, weighted by , and it is exactly the JVP piece. Hold onto the two weights; they explain why the method later breaks.
One more sCM detail, because it is a small, satisfying piece of verification. sCM's loss normalizes that tangent so that the size of each example's loss is roughly constant, which removes the need for a separate learned weight per noise level. Calling the tangent , the paper states the normalized residual squares to
and it does, but only with the square root shown here. Because and are the same weights, their difference is zero in value, so the residual is just the normalized tangent, and its squared length is , which sits near one whenever the tangent is large. The printed paper divides by with no root, which would square the denominator and drag the value toward zero instead. We checked it by hand and in code at several dimensions; it reads as a typo in the printed math, and the conclusion (loss near one, no learned weight needed) holds either way. It is also a slightly different operation from the look-alike normalizers in the original sCM and in MeanFlow, which divide by the first power of the norm and weight a squared loss respectively; the three agree only when the tangent is large.
The infrastructure wall
The JVP is cheap in theory and hostile in practice, and clearing that is a real contribution of this paper, not a footnote. Forward-mode autodiff did not compose with the stack that large models are actually trained on: BF16 precision, FlashAttention, and the sharding and sequence-splitting that spread one model across many GPUs. So sCM had only ever been shown on models up to about 1.5 billion parameters. The authors write a custom FlashAttention-2 kernel that carries the tangent through attention alongside the ordinary forward pass, and they make it work under fully-sharded data parallelism (each GPU holds only a slice of the weights, so the JVP is computed inside each layer) and under context parallelism (each GPU holds only part of the sequence, so the tangents are split the same way the keys, queries and values already are). That plumbing is what lets continuous-time consistency run at 10-billion-plus parameters and on video for the first time.
Why it breaks at scale
Once sCM runs on big text-to-image and text-to-video models, a problem surfaces that small benchmarks hid. The samples are sharp, but in the hard cases (small text, fine texture, objects holding their shape across video frames) they distort: smeared lettering, geometry that drifts and interpenetrates from frame to frame. Scaling the model up does not fix it. The cause is error accumulation, and equation (4) shows the mechanism.
A consistency model is effectively learning the entire integral of the teacher's velocity from the data end out to time , so any error made at small rides along and compounds as grows. Now look at the two weights in (4). The teacher-supervision term carries weight , and the network's own self-feedback term, the fragile JVP piece, carries weight . Near the data end the teacher term dominates and the student tracks it closely. As climbs toward , the ratio falls to zero: the teacher's steadying signal fades out, and the learning is left in the hands of the self-feedback term, which is numerically delicate in BF16. Errors seeded near the data end get amplified, with no teacher left to pull them back. Drag the probe across time and watch the amber teacher weight hand off to the teal JVP weight, with the accumulated error climbing once the JVP wins.
There is a second, deeper reason the samples are soft, and it is the divergence story from before. sCM lives on the forward, mode-covering side. It is trained only against offline data, and its consistency check is local: it links predictions at infinitesimally close times. A relay where each runner only checks the handoff with the runner one inch ahead can have every handoff look clean while the team drifts off course over the full distance. Local consistency cannot see that full-distance drift. What sCM is missing is a signal that scores the student's finished sample, the full skip from noise to data, against the teacher.
rCM: a long-skip regularizer
That missing signal is precisely what the reverse, mode-seeking camp provides. rCM adds a small amount of score distillation, in the form of DMD (distribution matching distillation), as a regularizer. The two pieces partner naturally: sCM works on offline data along the trajectory, DMD works on the student's own generated samples, so they cover the forward and reverse data paths respectively.
Score distillation minimizes a reverse divergence, the same mode-seeking measure from before, between student and teacher. But computing it needs the student's own score, the gradient of its log-density, and for a one-to-few-step generator that score is intractable. So the method trains a second network, a fake score, as a small diffusion model fit online to the student's current outputs. It plays the role the critic plays in a GAN: retrained every step on what the generator is producing now, it tracks a moving target.
With the teacher's score and this fake score in hand, the generator's update moves the student in the direction of the teacher score minus the fake score: toward where the teacher places more density than the student currently does, and away from the student's own ruts. (The gradient is written the other way around, fake minus teacher, because a descent step flips the sign; it is the same motion.) DMD packages that nudge into a plain regression target, so the practical loss is an MSE between the student's sample and a stop-gradient target built from the score difference. The full objective adds the two losses, with one weight:
The weight is tiny on purpose. sCM already gets the broad structure right and stays diverse; DMD only has to supply the long-range quality signal that fixes the accumulated drift. At that is exactly what happens, and the mode-seeking term is too small to drag the student into collapse. The name the paper gives this, a long-skip regularizer, is literal: the consistency loss acts on the short, local steps, and this term covers the long skip from noise all the way to the finished sample.
To compute the DMD loss you need samples from the student, which a consistency model produces by the same denoise-and-renoise rollout used at inference. rCM draws the number of rollout steps randomly and only backpropagates the DMD loss through the final jump, and it walks the timesteps down a random, monotonically decreasing schedule rather than DMD2's fixed schedule, which only ever trains the student at a handful of fixed times and leaves the rest of the range unpracticed. None of this needs a GAN loss, a multi-stage recipe, low-rank adapters, or a hyperparameter search; the student and the fake score are both initialized from the teacher and tuned in full.
What one training step looks like
Make it concrete. Take the Wan2.1 14-billion-parameter video teacher, a flow-matching transformer. The student starts as a copy of it. One training step draws a clean latent and a TrigFlow time, say radians, where and , so the self-feedback term already outweighs the teacher term. It noises the latent to , runs the student once forward plus one JVP sweep to get the prediction and its tangent, and forms the sCM loss. Separately it rolls out the student from pure noise for a random one to four steps to get a generated sample, renoises that sample to a fresh random time, queries both the fake score and the teacher there, and forms the DMD loss. The two are added with weight , and a gradient step updates the student through one combined backward pass. Then the fake score takes its own step, fitting itself to the student's latest sample. The architecture is unchanged from the teacher: the same layers, the same parameter count, the same attention.
# one rCM step: update student θ, then fake-score φ
x0 = sample_data() # clean latent (real or teacher-made)
t = sample_time(p_G) # TrigFlow time in [0, π/2]
eps = randn_like(x0)
xt = cos(t)*x0 + sin(t)*eps # noise it
# sCM term (forward divergence) on this offline point.
# F and sg(F) are equal in value; only F carries gradient.
F, tangent = student_with_jvp(xt, t, teacher) # 2 forward sweeps
g = cos(t) * tangent # w(t) = cos(t)
L_sCM = sq_norm(F - sg(F) - g / sqrt(sq_norm(g) + 0.1))
# DMD term (reverse divergence) on the student's OWN sample.
x0_hat = rollout(student, N=randint(1, N_max)) # denoise+renoise
s, eps2 = sample_time(p_D), randn_like(x0_hat)
xs = cos(s)*x0_hat + sin(s)*eps2 # renoise to time s
gap = fake(xs, s) - teacher(xs, s) # score difference
target = sg(x0_hat - gap / mean(abs(x0_hat - teacher(xs, s))))
L_DMD = sq_norm(x0_hat - target) # last step only
(L_sCM + 0.01 * L_DMD).backward(); opt_theta.step()
# the fake score chases the student, like a GAN critic
v_tgt = cos(s)*eps2 - sin(s)*x0_hat # velocity target
sq_norm(fake(xs, s) - v_tgt).backward(); opt_phi.step()Generation afterward is the plain rollout, with classifier-free guidance already folded into the student during training, so each step is a single forward pass and the student's NFE equals the step count, with no doubling:
# generation: N steps, alternating jump and renoise
z = randn_like(x0) # pure noise at t = π/2
t = pi / 2
for k in range(N): # N = 1..4
x0_hat = student(z, t) # JUMP straight to a clean guess
if k == N - 1: break
t = next_time(k, N) # a smaller time
z = cos(t)*x0_hat + sin(t)*randn_like(z) # RENOISE back up
return x0_hat # the sampleThe one knob: λ
Everything rests on a single weight, and it behaves the way the divergence picture predicts. Larger leans on the mode-seeking DMD term: quality climbs, diversity falls. Smaller leans on mode-covering sCM: diversity stays, quality softens. The paper sweeps across and lands on as the sweet spot, the smallest weight that already buys good quality while keeping diversity high. Drag below and watch the operating point slide along the trade-off, away from the diverse-but-soft pure-sCM end and toward the sharp-but-collapsed DMD end, with the sweet spot ringed.
A single small weight works because the two losses are not fighting over one target: they run on different data, the offline trajectory for sCM and the student's own generations for DMD, so a flexible student takes diversity from one and sharpness from the other instead of averaging them into mush.
What it does
rCM is validated on two production-scale model families: Cosmos-Predict2 for text-to-image (0.6B, 2B, 14B parameters) and Wan2.1 for text-to-video (1.3B, 14B), covering videos up to five seconds. The image models are scored on GenEval, which checks compositional prompts (object counting, spatial relations, attribute binding) rather than the raw image statistics that the older FID score measures and that misses things like text rendering. The video models are scored on VBench, which rates motion quality and semantic alignment.
On GenEval, the 14B image model reaches an overall in four steps, a hair under the teacher's at dozens of steps, and even the one-step score is . On VBench, the 14B video student scores at four steps and at two, above the 480p teacher's , at a throughput of 4.5 to 8.3 frames per second against the teacher's 0.18. The paper is careful here: beating the teacher on a benchmark does not make the student strictly better, especially in diversity and physical consistency. It means a few-step student can hold quality, not that distillation rediscovered the teacher.
The comparison that matters is against DMD2, the previous state of the art for large-scale distillation, which is a pure mode-seeking method propped up with a GAN loss. rCM holds level with or slightly beats DMD2 on the quality scores, then separates on diversity: DMD2's outputs collapse, with objects converging to similar positions and orientations across seeds, while rCM keeps the spread of the teacher. A mode-seeking method alone buys quality by giving up variety; pairing it with a mode-covering one, in the right small amount, keeps both.
The method has real limits. rCM still needs a strong teacher to distill from; it does not train a fast model from scratch. It also leans entirely on its JVP infrastructure: without that custom kernel, continuous-time consistency does not reach this scale at all. And the distilled student trades away some of the teacher's range for its speed, just less of it than the alternatives. What rCM shows is that the two camps of distillation, long treated as competitors, are better read as two halves of one objective: a forward divergence on offline data to stay diverse, a reverse divergence on your own samples to stay sharp, and a single small weight to hold the balance.
Questions you might still have
Why does adding just 1% of the score-distillation loss fix quality without killing diversity?
sCM already gets the layout right and stays diverse; what it lacks is a signal computed on the full noise-to-data path. A small dose of that signal (λ = 0.01) is enough to sharpen detail, because it only has to correct the accumulated drift, not retrain the model. Push λ higher and the mode-seeking score term starts to dominate, and diversity collapses the way it does for a pure score-distillation method.
If consistency and score distillation pull in opposite ways, why don’t they cancel?
They act on different data. The consistency loss is computed on offline points (real or teacher data with fresh noise); the score-distillation loss is computed on the student’s own generated samples. One holds the model to the full data distribution, the other sharpens its own outputs. They are complementary jobs on different inputs, not two forces on the same target.
What actually made continuous-time consistency hard to scale before this paper?
The continuous-time objective needs a Jacobian-vector product: a directional derivative of the network, computed by forward-mode autodiff in one extra forward sweep. That operation did not compose with the standard large-model stack (BF16 precision, FlashAttention, sharded and sequence-parallel training). The paper builds a FlashAttention-2 JVP kernel and makes it work under those parallelisms, which is what lets the method run at 10-billion-plus parameters and on video.
Is the distilled model strictly better than the teacher?
No. On the benchmark scores it matches or even edges past the teacher in a few steps, but the paper is explicit that this does not make it strictly superior, especially in diversity and physical consistency. A few-step student trades some of the teacher’s range for speed; rCM’s point is that it trades away much less of it than the previous fast methods did.
Footnotes & further reading
- The paper: Zheng, Wang, Ma, Chen, et al., Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (Tsinghua / NVIDIA, 2025). Project page.
- Continuous-time consistency and the TrigFlow schedule: Lu and Song, Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models (sCM). The discrete original: Song et al., Consistency Models, and the pure stop-gradient variant, Song and Dhariwal, Improved Techniques for Training Consistency Models.
- Score distillation: Yin et al., One-step Diffusion with Distribution Matching Distillation (DMD) and Improved DMD (DMD2); the variational view, Wang et al., ProlificDreamer (VSD); and the Fisher-divergence variant, Zhou et al., Score Identity Distillation (SiD).
- Mode-covering versus mode-seeking is the classic asymmetry of forward and reverse KL; Minka's Divergence Measures and Message Passing is the readable reference. Reverse divergence seeks the largest-mass mode, not necessarily the tallest peak.
- The probability-flow ODE uses exactly half the score coefficient of the reverse-time SDE, versus ; the deterministic flow drops the noise term, hence the half. Song et al., Score-Based Generative Modeling through SDEs. The schedule-to-coefficient maps come from the variational-diffusion / DPM-Solver line, not the original score-SDE paper.
- The JVP kernel builds on FlashAttention-2 (Dao, 2023) and on EDM's design space, Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models.
How could this explainer be improved? Found an error, or something unclear? I read every message.