Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Generate robot actions by denoising.
Imitation learning has long had the same uncomfortable problem: a demonstration set with two equally good routes, and a policy that averages them into one bad one. A diffusion model over action sequences covers both routes without averaging, trains stably, and still runs at control rates.
Explaining the paperDiffusion Policy: Visuomotor Policy Learning via Action DiffusionA T-block sits on a table. Push it into the target. Two collision-free routes around the block work, going left and going right, and averaging them sends the end-effector straight into the block. That failure mode is what the paper sets out to fix.
Behavior cloning is the simplest way to teach a robot from demonstrations: collect pairs (what the cameras saw, what the human did) and train a network to map one to the other. It is supervised regression with a label. The trouble shows up the first time two demonstrations disagree at the same observation, which they almost always do. Humans are stochastic, tasks are multimodal, and the same scene can have several right answers. A regression network minimizes mean squared error, which means it returns the average answer, which means a policy that pushes through the T-block instead of around it.
The field knew this and tried two repairs. One: keep the policy explicit but give it a richer head, a mixture of Gaussians or a categorical over discretized actions. Two: switch to an implicit policy, an energy-based model that scores actions and picks the lowest-energy one. Both worked in narrow settings and broke in others. The mixtures had to pre-commit to a number of modes. The energy models were unstable to train and slow to query.
Diffusion Policy, from a Columbia / Toyota Research / MIT / Stanford collaboration, takes a different route: borrow a generative model that is already in widespread use for images, denoising diffusion, and point it at action sequences instead of pixels. The robot does not regress a single action. It samples a chunk of future actions from a conditional distribution by running a learned denoising process from Gaussian noise down to clean motor commands. Once the distribution is rich enough, the two-route problem becomes a non-problem: both routes are modes of that distribution, each rollout draws one, and the robot picks a side.
The recipe is a handful of pieces, none of them heavy. A diffusion model defined on action vectors. Observation features fed in as conditioning rather than as joint inputs to denoise. A receding-horizon execution loop that buys temporal consistency from sequence prediction without giving up reactivity. A DDIM sampler that runs the inference loop ten times instead of a hundred. And, surprisingly, position control instead of the field-standard velocity control.
The two-route problem
Look at the push-T figure below first. The end-effector starts below a T-shaped block; the goal is to push the T into the dashed target outline at the top. The shortest collision-free routes go left or right around the block. Demonstrations from any reasonable human include both. So at this exact observation the next action is not one vector. It is a distribution over actions, and that distribution has two peaks.
Toggle the three policies. Unimodal regression averages the two modes and parks the action between them, which is the worst place. The energy-based policy, given a small budget of negative samples to estimate its partition function, locks onto one mode and starves the other. Diffusion Policy keeps both peaks in its sample cloud, and on any single rollout it commits to one side cleanly:
Two things to notice. First, "covers both modes" is not the same as "mixes both modes within a single rollout." A policy that smoothly interpolates between left and right at every timestep is exactly the bad behavior. What Diffusion Policy does instead is sample a path: one execution picks left, the next picks right, neither alternates mid-run. Second, the diagnostic in the corner reports mode coverage across rollouts, not within one. A multimodal policy that always rolls left would still average the distribution incorrectly across initial conditions; coverage and per-rollout commitment have to both hold.
A diffusion model, briefly
For the rest of the explainer we need one piece of background: how a denoising diffusion model works. If you have read the DDPM paper this will be familiar; we keep it short.
Pick the data you want to generate, in our case a clean action sequence . Add Gaussian noise to it in small steps until after iterations you are left with pure noise . The forward step has a fixed schedule:
The schedule is a sequence that decreases from (almost no noise) to (the signal is gone). This is the standard DDPM convention; the paper writes it more compactly with one symbol per step, but the mechanics are identical. Noising itself is cheap, deterministic given , and requires no learning.
Generation runs the same chain backwards. Train a network to predict the noise that was added, and at inference start from pure Gaussian noise and remove it one step at a time:
That is the textbook DDPM reverse update; the paper writes it as a single compact equation with symbolic , , and playing the roles of the schedule terms. The learning objective drops out of the same chain and is short to state: ask the network to predict the noise that was added, and penalize the squared error.
One identity is going to matter later. The noise the network learns to predict is, up to a known scale, the negative gradient of the log-density of the noised data,
so each reverse step is gradient ascent on the data log-density (the "score"), nudged by a fresh injection of Gaussian noise. That is the lens the paper uses to compare against energy-based methods later: the noise predictor already estimates the score directly, while an EBM has to recover it through a normalizing constant it cannot integrate.
Denoise actions, not pixels
To make this a policy, two changes. The thing being denoised becomes a future action sequence, not an image. And the denoising is conditioned on the current observation , which the network sees but never has to predict. Concretely the conditional reverse update is:
and the loss adds as an extra argument to the noise predictor:
Two design choices are buried in there and matter at deployment. The denoiser writes only the action sequence; observations enter as context and are never noised or reconstructed. That keeps the model from having to imagine future states (as planning-via-diffusion methods like Janner et al. have to) and lets the vision encoder run exactly once per control step. The second choice is that the action being denoised is a sequence of future actions, not a single one. Diffusion handles high-dimensional outputs gracefully, and a long predicted chunk buys temporal consistency we will lean on in a minute.
What does the reverse process actually look like at iterations? Scrub through the figure below. The trajectory starts as uncorrelated Gaussian noise and refines, step by step, into a smooth curve that gets from a start point around an obstacle to a goal:
Two implications follow. First, the dimension being denoised is the full action sequence; a 16-step path of 2D actions is a 32-vector. Diffusion's scalability to high dimensions, the same property that makes it work for 256×256 images, is what lets the policy commit to a long chunk of future motion rather than one timestep. That sequence-level coherence is what avoids the "jitter between modes" failure of frame-by-frame policies, where consecutive actions sample from different modes and the robot wobbles. Second, watch the trajectory near : it is essentially scribble, no information about routes or goals in it. The structure has to come from somewhere, and it comes from . Every reverse step asks "given what I am seeing, which way does the noise point?", and the noise predictor has learned that conditioning thoroughly.
Predict a chunk, execute part of it
The policy predicts future actions but executes only the first of them, then looks at fresh observations and re-plans. The remaining predicted steps are thrown away. This is receding-horizon control, the same trick that runs model-predictive controllers, and it is doing two jobs at once.
One: it keeps action consistency. Each block of executed steps comes from a single coherent denoised plan, so the robot does not flip between modes within a chunk. Two: it preserves responsiveness. The policy re-observes every steps, so a surprise (a slipped grasp, a moved block) gets seen and replanned soon, not after a full 16-step prediction has played out. The interesting design knob is itself: tiny means re-plan constantly and risk picking a different mode every chunk, large means longer, smoother plans but a sluggish reaction to changes. Drag the slider:
The paper's default , for state-based tasks, with two observation steps as context. That choice is the result of an ablation, not an inevitability. The paper shows the success-rate curve over and it is shaped like a hill: very small values lose to noisy re-planning, very large values lose to staleness. The peak sits well above either extreme. The figure's "jitter at chunk boundaries" effect is exactly what kills frame-by-frame mixture policies like BC-RNN with GMM heads: each step samples from a multimodal head, consecutive steps can land in different modes, and the executed trajectory weaves between routes that should each have been committed to.
One bonus the paper highlights: a long action chunk handles idle frames gracefully. Human teleoperators pause mid-task, producing long stretches of zero-velocity actions in the demonstration set. A frame-by-frame policy trained on those frames learns to imitate the pause: it sees a pause-shaped observation and freezes. A sequence policy looks ahead far enough that the pause is one feature of a longer plan, not the whole plan, and the robot does not get stuck.
Why training stays smooth
Both an EBM and a diffusion model can express multimodal action distributions in theory. In practice training the EBM is rough, and the paper's claim is that diffusion sidesteps the rough part.
An implicit / energy-based policy writes the action density as a normalized Boltzmann distribution,
where is an integral over the entire action space and is intractable. To train, you approximate with negative samples and minimize an InfoNCE-style loss:
The denominator is the Monte-Carlo estimate of , and the quality of that estimate depends entirely on how representative the negatives are. With too few negatives the estimate is noisy and the energy landscape oscillates; with many you spend most of your compute on negatives instead of the real example. The paper points to a known failure mode where the training loss for the energy function looks healthy but the evaluation success rate jumps around chaotically, making checkpoint selection a nightmare.
Diffusion Policy never has to estimate , and the reason is one line of calculus. The score of the EBM density splits into a gradient of the energy and a gradient of ; the second term vanishes because does not depend on , so the score is just the negative energy gradient and the noise predictor learns it directly:
The noise predictor is (up to the schedule scale) the score. There is no integral to estimate, no negative samples to draw, no anywhere in either the training or the inference graph. The objective collapses to ordinary regression of one Gaussian on another (Eq 5), and ordinary regression has the training behavior of ordinary regression. Watch the two regimes side by side, with the EBM's wobble shrinking as grows but never going away:
The practical consequence is the one the paper opens with: you do not have to evaluate every checkpoint on hardware to find a good one. The training loss curve is informative, and a model picked by training loss tracks the model picked by hardware success rate. That is a real cost saving when each rollout takes minutes on a physical robot.
CNN, transformer, schedule, DDIM
The math above is agnostic to the architecture of . The paper tries two and recommends both, with one rule of thumb. Walk through the choices, briefly, because they are where most of the engineering goes.
CNN, with FiLM conditioning. The default backbone is a 1D temporal convolutional network in the style of Janner et al.'s diffuser, taking the noisy action sequence as its input and producing a denoised guess. Observations condition every conv layer through Feature-wise Linear Modulation (FiLM), which means each channel's output is multiplied and shifted by a per-observation scalar before the next layer. This is the same conditioning mechanism DALL-E's decoder and many diffusion models use, and it is light: an MLP from the observation features produces two scalars per channel per layer, and that is the entire observation handoff. The CNN works well out of the box on most tasks but, as 1D CNNs do, prefers low-frequency signals, so it can over-smooth action sequences that need sharp velocity changes.
Transformer, when the CNN over-smooths. The alternative is a minGPT-style transformer decoder. Noisy actions become input tokens, observations become a key/value sequence (via cross-attention), the denoising iteration is a sinusoidal embedding prepended to the action tokens, and causal masking constrains each output token to attend only to itself and earlier actions in the sequence. The transformer reaches better peak performance on state-based tasks with sharp action dynamics, at the cost of being touchier to tune. The paper's recommendation is the same one that has shown up everywhere in the diffusion literature: start with the CNN; switch to the transformer if you need it.
Visual encoder, end-to-end. The vision encoder is a plain ResNet-18 trained from scratch with the policy. Two substitutions matter: global average pooling is replaced by spatial softmax (a soft argmax over feature locations, so the encoder reports where each feature peaked rather than averaging over the image), and BatchNorm is replaced by GroupNorm. The GroupNorm swap is more important than it looks. Diffusion models lean on Exponential Moving Average shadow weights for sample quality, and BatchNorm's running statistics interact badly with EMA (the shadow weights see one statistic, the live forward sees another, the network drifts). GroupNorm computes its statistics per group of channels per sample, with no running estimate, and is invariant to that mismatch.
Noise schedule. The choice of across iterations decides how much noise is added at each step and, by extension, what each step of the reverse process actually has to do. The paper finds the cosine schedule from iDDPM works best for control: noise grows slowly at first, accelerates, then plateaus near pure noise. The linear DDPM schedule wastes too many steps in the near-clean regime where there is little to fix.
DDIM at inference. The biggest deployment-time lever is that training and inference iteration counts can differ. DDIM reuses the same noise predictor inside a deterministic (or near-deterministic) reverse process that can take much larger steps, so a model trained with denoising iterations can be sampled with . On a 3080 the resulting inference latency is about 100ms per control step, well inside the real-time budget. The training-time iteration count still matters: with more iterations the score field is learned at finer noise resolutions and the model fits the data more carefully, but you do not pay for those at inference.
Putting the loop together, one training step and one inference step, on the cosine schedule with DDIM at , look like this:
# one training step (per the paper's Eq 5)
O_t, A_t = sample_demo() # observations, action sequence
k = randint(1, K) # random noise iteration
eps = randn_like(A_t) # fresh Gaussian noise (matched shape)
A_noisy = sqrt(abar[k]) * A_t + sqrt(1 - abar[k]) * eps
eps_hat = eps_theta(O_t, A_noisy, k) # noise-prediction net, conditioned on O_t
loss = mse(eps_hat, eps) # train the noise predictor
loss.backward(); opt.step()# control loop at time t (T_a-step receding horizon, K_inf = 10)
O_t = recent_obs(t, horizon=T_o) # the last T_o observations
A = randn(T_p, action_dim) # start: pure noise, T_p future actions
for k in ddim_schedule(K_train, K_inf): # 100-step train, 10-step inference
eps_hat = eps_theta(O_t, A, k)
A = ddim_step(A, eps_hat, k) # one denoising update of Eq 4
# commit to A[:T_a] and re-plan from O_{t+T_a}One last detail the appendix dwells on but the headline numbers depend on: the action vectors are normalized per dimension to across the training set before noise is added (Eq 1 assumes unit-scale data). And actions represent positions, not velocities; that choice has a story of its own.
Why position control wins
Most behavior cloning work uses velocity commands: the policy outputs how fast and which way the end-effector should move, the low-level controller integrates them into a position. Diffusion Policy does the opposite, outputting absolute target positions, and it works better.
This is genuinely surprising, because velocity control has historically been kinder to policy learning. The argument the paper offers has two parts. First, position-space action distributions are more multimodal than velocity-space ones (the same goal admits several positions to be in next; one velocity to that goal is much more constrained), and Diffusion Policy is the only tested baseline that handles multimodality without collapsing or jittering. Second, predicting a long chunk of positions accumulates less error than predicting a long chunk of velocities, because velocity prediction errors compound over the integration that turns them into positions, while position errors do not.
The empirical result on Push-T is the cleanest single statement of the gap: switching from velocity to position drags BC-RNN down from 72% to 50%, drags BET down from 45% to 36%, and lifts Diffusion Policy from 85% to 95%. Same task, same data, same network architectures, different action parameterization. Two policies get worse, one gets better:
One small note. The bars show the maximum success rate across training, the same metric the paper's ablations report. Average success rates are uniformly worse for the unstable baselines and uniformly close to maximum for Diffusion Policy, which is the training-stability point in numerical form. If you cared about deployment reliability and not just peak benchmark, the position gain widens further.
What the numbers say
Across 15 tasks from 4 simulation benchmarks (Robomimic, Push-T, the Block-Pushing benchmark from BET, and Franka Kitchen), Diffusion Policy reports an average relative success-rate improvement of 46.9% over the strongest prior baseline per task. That number is the headline of the paper, and it has more texture than it looks: across a diverse spread of manipulation tasks, the gain is not concentrated on a single benchmark. On the harder tasks (Transport, Tool Hang, multi-task Franka Kitchen) the improvement is larger; on simpler ones (Lift, Can) almost every method saturates so the gain is small.
The Push-T result on its own is the cleanest demonstration of the multimodality claim, because Push-T is the simplest task in the benchmark and yet the failure of unimodal baselines is the most obvious. Going from a 67% best baseline (BC-RNN, ph) to 95% with Diffusion Policy means the two-route problem really was a major source of failure, and a policy that handles it removes most of it.
Real-world results follow the simulation pattern. On a physical Push-T setup, Diffusion Policy reaches 95% success on average (the best baseline reaches 70%). On more complex multi-stage tasks (mug flipping, sauce pouring, sauce spreading), the gap widens because the multimodality compounds: each stage of the task has its own modes, and a unimodal policy loses on each one multiplicatively. The bimanual results added in the journal extension push the same idea further, with two-arm manipulation tasks (egg beater, mat unrolling, shirt folding) that any single-mode policy would collapse on.
The limitations the paper itself flags are the right ones. Diffusion Policy inherits the data hunger of imitation learning, so a task with five demonstrations is still a task with five demonstrations. Inference latency, even with DDIM, is in the tens-to-hundreds of milliseconds, fine for manipulation but too slow for some dynamic tasks. And the action chunk has to be long enough to absorb idle frames, but short enough to react to perturbations, which is a hyperparameter you set per task. None of those is an indictment of the framing, and the framing is the contribution: a generative model over action sequences, conditioned on observations, executed with a receding horizon, beats the careful regression and energy-based methods that the field had spent years tuning.
Questions you might still have
Why does this not just average the two routes around the T-block?
A unimodal regression policy would: its loss is minimized by the conditional mean of the demonstrations. Diffusion does not optimize toward a mean. Each rollout draws a different starting noise A^K and walks down its own basin, so one rollout commits to "go left" and another to "go right". Both modes are covered across rollouts, neither is averaged within one.
If observations are baked into the noise predictor, why is inference still fast?
Because O_t is passed in as conditioning, not denoised. The vision encoder runs once per control step; only the action tokens A_t^k cycle through the noise predictor K_inf times. With DDIM at K_inf = 10, that loop costs about 100ms on a 3080, fast enough for real-time control.
Where does the position-control gain actually come from?
Two effects stack. Action sequences are MORE multimodal in position space (the same goal admits more distinct paths than the same velocity does), and Diffusion Policy is the only baseline that can keep both modes in its action distribution. Position control also accumulates less compounding error over a long predicted chunk than velocity control does, which suits sequence prediction.
How does this differ from a vanilla "diffusion model for actions"?
Three practical things. The denoiser predicts only the action sequence, not the joint observation-action trajectory, so it does not have to imagine the future. Visual encoders are trained end-to-end with the policy (GroupNorm, not BatchNorm, to play well with the EMA shadow weights diffusion needs). And inference uses DDIM to spend 10 denoising iterations instead of 100, dropping per-control-step latency to ~100ms.
Footnotes & further reading
- The paper: Chi, Xu, Feng, Cousineau, Du, Burchfiel, Tedrake, Song, Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (RSS 2023; extended in IJRR 2024). Project page and code at diffusion-policy.cs.columbia.edu.
- The diffusion-model backbone: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (DDPM, 2020). The parameterization, simplified MSE loss, and forward/reverse process all come from here.
- Fast inference: Song, Meng, Ermon, Denoising Diffusion Implicit Models (DDIM, 2021). Cosine noise schedule: Nichol & Dhariwal, Improved Denoising Diffusion Probabilistic Models (iDDPM, 2021).
- Score matching and the denoising / score connection: Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011); Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019), which is the closer relative to DDPM's annealed-Langevin-style reverse process than Welling & Teh's SGLD.
- The energy-based baseline: Florence et al., Implicit Behavioral Cloning (IBC, CoRL 2021). The categorical / quantized baseline: Shafiullah et al., Behavior Transformers (BET, NeurIPS 2022).
- FiLM conditioning: Perez et al., FiLM: Visual Reasoning with a General Conditioning Layer (AAAI 2018). Receding-horizon control: Mayne & Michalska, "Receding horizon control of nonlinear systems" (1990).
- Planning by diffusion (the joint observation-action approach Diffusion Policy contrasts against): Janner et al., Planning with Diffusion for Flexible Behavior Synthesis (Diffuser, 2022).
How could this explainer be improved? Found an error, or something unclear? I read every message.