VerifiedarXiv:2303.0413726 min
Robot learning · Diffusion

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Generate robot actions by denoising.

Imitation learning has long had the same uncomfortable problem: a demonstration set with two equally good routes, and a policy that averages them into one bad one. A diffusion model over action sequences covers both routes without averaging, trains stably, and still runs at control rates.

Explaining the paperDiffusion Policy: Visuomotor Policy Learning via Action DiffusionChi, Xu, Feng, Cousineau, Du, Burchfiel, Tedrake, Song · Columbia · TRI · MIT · Stanford · RSS 2023 / IJRR 2024 · arXiv:2303.04137

A T-block sits on a table. Push it into the target. Two collision-free routes around the block work, going left and going right, and averaging them sends the end-effector straight into the block. That failure mode is what the paper sets out to fix.

Behavior cloning is the simplest way to teach a robot from demonstrations: collect (Ot,at)(\mathbf{O}_t, \mathbf{a}_t) pairs (what the cameras saw, what the human did) and train a network to map one to the other. It is supervised regression with a label. The trouble shows up the first time two demonstrations disagree at the same observation, which they almost always do. Humans are stochastic, tasks are multimodal, and the same scene can have several right answers. A regression network minimizes mean squared error, which means it returns the average answer, which means a policy that pushes through the T-block instead of around it.

The field knew this and tried two repairs. One: keep the policy explicit but give it a richer head, a mixture of Gaussians or a categorical over discretized actions. Two: switch to an implicit policy, an energy-based model that scores actions and picks the lowest-energy one. Both worked in narrow settings and broke in others. The mixtures had to pre-commit to a number of modes. The energy models were unstable to train and slow to query.

Diffusion Policy, from a Columbia / Toyota Research / MIT / Stanford collaboration, takes a different route: borrow a generative model that is already in widespread use for images, denoising diffusion, and point it at action sequences instead of pixels. The robot does not regress a single action. It samples a chunk of future actions from a conditional distribution by running a learned denoising process from Gaussian noise down to clean motor commands. Once the distribution is rich enough, the two-route problem becomes a non-problem: both routes are modes of that distribution, each rollout draws one, and the robot picks a side.

The recipe is a handful of pieces, none of them heavy. A diffusion model defined on action vectors. Observation features fed in as conditioning rather than as joint inputs to denoise. A receding-horizon execution loop that buys temporal consistency from sequence prediction without giving up reactivity. A DDIM sampler that runs the inference loop ten times instead of a hundred. And, surprisingly, position control instead of the field-standard velocity control.

The two-route problem

Look at the push-T figure below first. The end-effector starts below a T-shaped block; the goal is to push the T into the dashed target outline at the top. The shortest collision-free routes go left or right around the block. Demonstrations from any reasonable human include both. So at this exact observation the next action is not one vector. It is a distribution over actions, and that distribution has two peaks.

Toggle the three policies. Unimodal regression averages the two modes and parks the action between them, which is the worst place. The energy-based policy, given a small budget of negative samples to estimate its partition function, locks onto one mode and starves the other. Diffusion Policy keeps both peaks in its sample cloud, and on any single rollout it commits to one side cleanly:

Figure 1 · the two-route problem
Push-T: the end-effector below the T-block has two valid routes to the target. A unimodal regression head averages them into one bad action; an energy-based head locks onto one side and abandons the other; Diffusion Policy keeps both peaks and commits to one per rollout.

Two things to notice. First, "covers both modes" is not the same as "mixes both modes within a single rollout." A policy that smoothly interpolates between left and right at every timestep is exactly the bad behavior. What Diffusion Policy does instead is sample a path: one execution picks left, the next picks right, neither alternates mid-run. Second, the diagnostic in the corner reports mode coverage across rollouts, not within one. A multimodal policy that always rolls left would still average the distribution incorrectly across initial conditions; coverage and per-rollout commitment have to both hold.

A diffusion model, briefly

For the rest of the explainer we need one piece of background: how a denoising diffusion model works. If you have read the DDPM paper this will be familiar; we keep it short.

Pick the data you want to generate, in our case a clean action sequence A0\mathbf{A}^0. Add Gaussian noise to it in small steps until after KK iterations you are left with pure noise AK\mathbf{A}^K. The forward step has a fixed schedule:

Ak=αˉkA0+1αˉkϵ,ϵN(0,I)\mathbf{A}^k = \sqrt{\bar{\alpha}_k}\,\mathbf{A}^0 + \sqrt{1-\bar{\alpha}_k}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
(1)

The schedule is a sequence αˉk\bar{\alpha}_k that decreases from αˉ01\bar{\alpha}_0 \approx 1 (almost no noise) to αˉK0\bar{\alpha}_K \approx 0 (the signal is gone). This is the standard DDPM convention; the paper writes it more compactly with one symbol α\alpha per step, but the mechanics are identical. Noising itself is cheap, deterministic given ϵ\boldsymbol{\epsilon}, and requires no learning.

Generation runs the same chain backwards. Train a network ϵθ\boldsymbol{\epsilon}_\theta to predict the noise that was added, and at inference start from pure Gaussian noise AK\mathbf{A}^K and remove it one step at a time:

Ak1=1αk ⁣(Ak1αk1αˉkϵθ(Ak,k))+σkz,zN(0,I)\mathbf{A}^{k-1} = \frac{1}{\sqrt{\alpha_k}}\!\left(\mathbf{A}^k - \frac{1-\alpha_k}{\sqrt{1-\bar{\alpha}_k}}\,\boldsymbol{\epsilon}_\theta(\mathbf{A}^k, k)\right) + \sigma_k\,\mathbf{z}, \qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
(2)

That is the textbook DDPM reverse update; the paper writes it as a single compact equation with symbolic α\alpha, γ\gamma, and σ\sigma playing the roles of the schedule terms. The learning objective drops out of the same chain and is short to state: ask the network to predict the noise that was added, and penalize the squared error.

L=EA0,k,ϵ[ϵθ(αˉkA0+1αˉkϵ,k)ϵ2]\mathcal{L} = \mathbb{E}_{\mathbf{A}^0,\,k,\,\boldsymbol{\epsilon}}\Big[\,\big\lVert \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_k}\,\mathbf{A}^0 + \sqrt{1-\bar{\alpha}_k}\,\boldsymbol{\epsilon},\,k) - \boldsymbol{\epsilon} \big\rVert^2 \,\Big]
(3)

One identity is going to matter later. The noise the network learns to predict is, up to a known scale, the negative gradient of the log-density of the noised data,

ϵθ(Ak,k)1αˉkAlogpk(Ak)\boldsymbol{\epsilon}_\theta(\mathbf{A}^k, k) \approx -\sqrt{1-\bar{\alpha}_k}\,\nabla_{\mathbf{A}}\log p_k(\mathbf{A}^k)

so each reverse step is gradient ascent on the data log-density (the "score"), nudged by a fresh injection of Gaussian noise. That is the lens the paper uses to compare against energy-based methods later: the noise predictor already estimates the score directly, while an EBM has to recover it through a normalizing constant it cannot integrate.

Denoise actions, not pixels

To make this a policy, two changes. The thing being denoised becomes a future action sequence, not an image. And the denoising is conditioned on the current observation Ot\mathbf{O}_t, which the network sees but never has to predict. Concretely the conditional reverse update is:

Atk1=1αk ⁣(Atk1αk1αˉkϵθ(Ot,Atk,k))+σkz\mathbf{A}_t^{k-1} = \tfrac{1}{\sqrt{\alpha_k}}\!\left(\mathbf{A}_t^k - \tfrac{1-\alpha_k}{\sqrt{1-\bar{\alpha}_k}}\,\boldsymbol{\epsilon}_\theta(\mathbf{O}_t, \mathbf{A}_t^k, k)\right) + \sigma_k\,\mathbf{z}
(4)

and the loss adds Ot\mathbf{O}_t as an extra argument to the noise predictor:

L=E[ϵθ(Ot,αˉkAt0+1αˉkϵ,k)ϵ2]\mathcal{L} = \mathbb{E}\Big[\,\big\lVert \boldsymbol{\epsilon}_\theta(\mathbf{O}_t,\,\sqrt{\bar{\alpha}_k}\,\mathbf{A}_t^0 + \sqrt{1-\bar{\alpha}_k}\,\boldsymbol{\epsilon},\,k) - \boldsymbol{\epsilon} \big\rVert^2 \,\Big]
(5)

Two design choices are buried in there and matter at deployment. The denoiser writes only the action sequence; observations enter as context and are never noised or reconstructed. That keeps the model from having to imagine future states (as planning-via-diffusion methods like Janner et al. have to) and lets the vision encoder run exactly once per control step. The second choice is that the action being denoised is a sequence of TpT_p future actions, not a single one. Diffusion handles high-dimensional outputs gracefully, and a long predicted chunk buys temporal consistency we will lean on in a minute.

What does the reverse process actually look like at KK iterations? Scrub through the figure below. The trajectory starts as uncorrelated Gaussian noise and refines, step by step, into a smooth curve that gets from a start point around an obstacle to a goal:

Figure 2 · denoising an action sequence
k = 10
A 16-step end-effector path lives in R32\mathbb{R}^{32} (16 steps × 2D). It starts at k=10k=10 as pure Gaussian noise and refines, one denoising iteration at a time, into a clean curve that avoids the obstacle and lands at the goal. The amber dashed line is what the training-data path looked like; the teal is the current denoised guess.

Two implications follow. First, the dimension being denoised is the full action sequence; a 16-step path of 2D actions is a 32-vector. Diffusion's scalability to high dimensions, the same property that makes it work for 256×256 images, is what lets the policy commit to a long chunk of future motion rather than one timestep. That sequence-level coherence is what avoids the "jitter between modes" failure of frame-by-frame policies, where consecutive actions sample from different modes and the robot wobbles. Second, watch the trajectory near k=10k=10: it is essentially scribble, no information about routes or goals in it. The structure has to come from somewhere, and it comes from Ot\mathbf{O}_t. Every reverse step asks "given what I am seeing, which way does the noise point?", and the noise predictor has learned that conditioning thoroughly.

Predict a chunk, execute part of it

The policy predicts TpT_p future actions but executes only the first TaT_a of them, then looks at fresh observations and re-plans. The remaining TpTaT_p - T_a predicted steps are thrown away. This is receding-horizon control, the same trick that runs model-predictive controllers, and it is doing two jobs at once.

One: it keeps action consistency. Each block of TaT_a executed steps comes from a single coherent denoised plan, so the robot does not flip between modes within a chunk. Two: it preserves responsiveness. The policy re-observes everyTaT_a steps, so a surprise (a slipped grasp, a moved block) gets seen and replanned soon, not after a full 16-step prediction has played out. The interesting design knob is TaT_a itself: tiny TaT_a means re-plan constantly and risk picking a different mode every chunk, large TaT_a means longer, smoother plans but a sluggish reaction to changes. Drag the slider:

Figure 3 · receding horizon: T_o, T_p, T_a
T_a = 8
The policy sees the last ToT_o observations (amber), predicts TpT_p actions (faint teal), executes the first TaT_a of them (bright teal), then re-plans from fresh observations. Small TaT_a re-plans often but stitches together a jittery action trajectory; large TaT_aruns long, smooth plans but reacts slowly.

The paper's default Tp=16T_p = 16, Ta=8T_a = 8 for state-based tasks, with two observation steps as context. That choice is the result of an ablation, not an inevitability. The paper shows the success-rate curve over TaT_aand it is shaped like a hill: very small values lose to noisy re-planning, very large values lose to staleness. The peak sits well above either extreme. The figure's "jitter at chunk boundaries" effect is exactly what kills frame-by-frame mixture policies like BC-RNN with GMM heads: each step samples from a multimodal head, consecutive steps can land in different modes, and the executed trajectory weaves between routes that should each have been committed to.

One bonus the paper highlights: a long action chunk handles idle frames gracefully. Human teleoperators pause mid-task, producing long stretches of zero-velocity actions in the demonstration set. A frame-by-frame policy trained on those frames learns to imitate the pause: it sees a pause-shaped observation and freezes. A sequence policy looks ahead far enough that the pause is one feature of a longer plan, not the whole plan, and the robot does not get stuck.

Why training stays smooth

Both an EBM and a diffusion model can express multimodal action distributions in theory. In practice training the EBM is rough, and the paper's claim is that diffusion sidesteps the rough part.

An implicit / energy-based policy writes the action density as a normalized Boltzmann distribution,

pθ(ao)=eEθ(o,a)Z(o,θ)p_\theta(\mathbf{a}\mid\mathbf{o}) = \frac{e^{-E_\theta(\mathbf{o},\mathbf{a})}}{Z(\mathbf{o},\theta)}

where Z(o,θ)Z(\mathbf{o},\theta) is an integral over the entire action space and is intractable. To train, you approximate ZZ with negative samples and minimize an InfoNCE-style loss:

LinfoNCE=logeEθ(o,a)eEθ(o,a)+j=1NnegeEθ(o,a~j)\mathcal{L}_{\text{infoNCE}} = -\log \frac{e^{-E_\theta(\mathbf{o},\mathbf{a})}}{e^{-E_\theta(\mathbf{o},\mathbf{a})} + \sum_{j=1}^{N_{\text{neg}}} e^{-E_\theta(\mathbf{o},\tilde{\mathbf{a}}^j)}}
(6)

The denominator is the Monte-Carlo estimate of ZZ, and the quality of that estimate depends entirely on how representative the negatives a~j\tilde{\mathbf{a}}^j are. With too few negatives the estimate is noisy and the energy landscape oscillates; with many you spend most of your compute on negatives instead of the real example. The paper points to a known failure mode where the training loss for the energy function looks healthy but the evaluation success rate jumps around chaotically, making checkpoint selection a nightmare.

Diffusion Policy never has to estimate ZZ, and the reason is one line of calculus. The score of the EBM density splits into a gradient of the energy and a gradient of logZ\log Z; the second term vanishes because ZZ does not depend on a\mathbf{a}, so the score is just the negative energy gradient and the noise predictor learns it directly:

alogp(ao)=aEθ(a,o)alogZ(o,θ)=0ϵθ(a,o)\nabla_{\mathbf{a}}\log p(\mathbf{a}\mid\mathbf{o}) = -\nabla_{\mathbf{a}}E_\theta(\mathbf{a},\mathbf{o}) - \underbrace{\nabla_{\mathbf{a}}\log Z(\mathbf{o},\theta)}_{=\,0} \approx -\boldsymbol{\epsilon}_\theta(\mathbf{a},\mathbf{o})
(7)

The noise predictor ϵθ\boldsymbol{\epsilon}_\theta is (up to the schedule scale) the score. There is no integral to estimate, no negative samples to draw, no ZZ anywhere in either the training or the inference graph. The objective collapses to ordinary regression of one Gaussian on another (Eq 5), and ordinary regression has the training behavior of ordinary regression. Watch the two regimes side by side, with the EBM's wobble shrinking as NnegN_{\text{neg}} grows but never going away:

Figure 4 · training stability: score sidesteps Z
N_neg = 4
Left: the action-energy landscape. The energy-based policy fits the true energy by sampling NnegN_{\text{neg}} negatives and estimating ZZ from them; the fit wobbles because the estimate is noisy. Diffusion learns the score directly and the fit is smooth at any NnegN_{\text{neg}} (it uses none). Right: the downstream evaluation success rate oscillates for the EBM and climbs smoothly for the diffusion model.

The practical consequence is the one the paper opens with: you do not have to evaluate every checkpoint on hardware to find a good one. The training loss curve is informative, and a model picked by training loss tracks the model picked by hardware success rate. That is a real cost saving when each rollout takes minutes on a physical robot.

CNN, transformer, schedule, DDIM

The math above is agnostic to the architecture of ϵθ\boldsymbol{\epsilon}_\theta. The paper tries two and recommends both, with one rule of thumb. Walk through the choices, briefly, because they are where most of the engineering goes.

CNN, with FiLM conditioning. The default backbone is a 1D temporal convolutional network in the style of Janner et al.'s diffuser, taking the noisy action sequence as its input and producing a denoised guess. Observations Ot\mathbf{O}_t condition every conv layer through Feature-wise Linear Modulation (FiLM), which means each channel's output is multiplied and shifted by a per-observation scalar before the next layer. This is the same conditioning mechanism DALL-E's decoder and many diffusion models use, and it is light: an MLP from the observation features produces two scalars per channel per layer, and that is the entire observation handoff. The CNN works well out of the box on most tasks but, as 1D CNNs do, prefers low-frequency signals, so it can over-smooth action sequences that need sharp velocity changes.

Transformer, when the CNN over-smooths. The alternative is a minGPT-style transformer decoder. Noisy actions become input tokens, observations become a key/value sequence (via cross-attention), the denoising iteration kk is a sinusoidal embedding prepended to the action tokens, and causal masking constrains each output token to attend only to itself and earlier actions in the sequence. The transformer reaches better peak performance on state-based tasks with sharp action dynamics, at the cost of being touchier to tune. The paper's recommendation is the same one that has shown up everywhere in the diffusion literature: start with the CNN; switch to the transformer if you need it.

Visual encoder, end-to-end. The vision encoder is a plain ResNet-18 trained from scratch with the policy. Two substitutions matter: global average pooling is replaced by spatial softmax (a soft argmax over feature locations, so the encoder reports where each feature peaked rather than averaging over the image), and BatchNorm is replaced by GroupNorm. The GroupNorm swap is more important than it looks. Diffusion models lean on Exponential Moving Average shadow weights for sample quality, and BatchNorm's running statistics interact badly with EMA (the shadow weights see one statistic, the live forward sees another, the network drifts). GroupNorm computes its statistics per group of channels per sample, with no running estimate, and is invariant to that mismatch.

Noise schedule. The choice of αˉk\bar{\alpha}_k across iterations decides how much noise is added at each step and, by extension, what each step of the reverse process actually has to do. The paper finds the cosine schedule from iDDPM works best for control: noise grows slowly at first, accelerates, then plateaus near pure noise. The linear DDPM schedule wastes too many steps in the near-clean regime where there is little to fix.

DDIM at inference. The biggest deployment-time lever is that training and inference iteration counts can differ. DDIM reuses the same noise predictor inside a deterministic (or near-deterministic) reverse process that can take much larger steps, so a model trained with K=100K = 100denoising iterations can be sampled with Kinf=10K_{\text{inf}} = 10. On a 3080 the resulting inference latency is about 100ms per control step, well inside the real-time budget. The training-time iteration count still matters: with more iterations the score field is learned at finer noise resolutions and the model fits the data more carefully, but you do not pay for those at inference.

Putting the loop together, one training step and one inference step, on the cosine schedule with DDIM at Kinf=10K_{\text{inf}}=10, look like this:

# one training step (per the paper's Eq 5)
O_t, A_t  = sample_demo()              # observations, action sequence
k         = randint(1, K)              # random noise iteration
eps       = randn_like(A_t)            # fresh Gaussian noise (matched shape)
A_noisy   = sqrt(abar[k]) * A_t + sqrt(1 - abar[k]) * eps
eps_hat   = eps_theta(O_t, A_noisy, k) # noise-prediction net, conditioned on O_t
loss      = mse(eps_hat, eps)          # train the noise predictor
loss.backward(); opt.step()
# control loop at time t (T_a-step receding horizon, K_inf = 10)
O_t  = recent_obs(t, horizon=T_o)      # the last T_o observations
A    = randn(T_p, action_dim)          # start: pure noise, T_p future actions
for k in ddim_schedule(K_train, K_inf): # 100-step train, 10-step inference
    eps_hat = eps_theta(O_t, A, k)
    A       = ddim_step(A, eps_hat, k)  # one denoising update of Eq 4
# commit to A[:T_a] and re-plan from O_{t+T_a}

One last detail the appendix dwells on but the headline numbers depend on: the action vectors are normalized per dimension to [1,1][-1, 1] across the training set before noise is added (Eq 1 assumes unit-scale data). And actions represent positions, not velocities; that choice has a story of its own.

Why position control wins

Most behavior cloning work uses velocity commands: the policy outputs how fast and which way the end-effector should move, the low-level controller integrates them into a position. Diffusion Policy does the opposite, outputting absolute target positions, and it works better.

This is genuinely surprising, because velocity control has historically been kinder to policy learning. The argument the paper offers has two parts. First, position-space action distributions are more multimodal than velocity-space ones (the same goal admits several positions to be in next; one velocity to that goal is much more constrained), and Diffusion Policy is the only tested baseline that handles multimodality without collapsing or jittering. Second, predicting a long chunk of positions accumulates less error than predicting a long chunk of velocities, because velocity prediction errors compound over the integration that turns them into positions, while position errors do not.

The empirical result on Push-T is the cleanest single statement of the gap: switching from velocity to position drags BC-RNN down from 72% to 50%, drags BET down from 45% to 36%, and lifts Diffusion Policy from 85% to 95%. Same task, same data, same network architectures, different action parameterization. Two policies get worse, one gets better:

Figure 5 · position-control synergy
Push-T maximum success rate when the action space is switched from velocity to position. BC-RNN and BET drop. Diffusion Policy improves. Numbers reproduced from the paper's Fig 4 caption; the bracketed number after each position bar is the percentage-point change from velocity to position.

One small note. The bars show the maximum success rate across training, the same metric the paper's ablations report. Average success rates are uniformly worse for the unstable baselines and uniformly close to maximum for Diffusion Policy, which is the training-stability point in numerical form. If you cared about deployment reliability and not just peak benchmark, the position gain widens further.

What the numbers say

Across 15 tasks from 4 simulation benchmarks (Robomimic, Push-T, the Block-Pushing benchmark from BET, and Franka Kitchen), Diffusion Policy reports an average relative success-rate improvement of 46.9% over the strongest prior baseline per task. That number is the headline of the paper, and it has more texture than it looks: across a diverse spread of manipulation tasks, the gain is not concentrated on a single benchmark. On the harder tasks (Transport, Tool Hang, multi-task Franka Kitchen) the improvement is larger; on simpler ones (Lift, Can) almost every method saturates so the gain is small.

The Push-T result on its own is the cleanest demonstration of the multimodality claim, because Push-T is the simplest task in the benchmark and yet the failure of unimodal baselines is the most obvious. Going from a 67% best baseline (BC-RNN, ph) to 95% with Diffusion Policy means the two-route problem really was a major source of failure, and a policy that handles it removes most of it.

Real-world results follow the simulation pattern. On a physical Push-T setup, Diffusion Policy reaches 95% success on average (the best baseline reaches 70%). On more complex multi-stage tasks (mug flipping, sauce pouring, sauce spreading), the gap widens because the multimodality compounds: each stage of the task has its own modes, and a unimodal policy loses on each one multiplicatively. The bimanual results added in the journal extension push the same idea further, with two-arm manipulation tasks (egg beater, mat unrolling, shirt folding) that any single-mode policy would collapse on.

The limitations the paper itself flags are the right ones. Diffusion Policy inherits the data hunger of imitation learning, so a task with five demonstrations is still a task with five demonstrations. Inference latency, even with DDIM, is in the tens-to-hundreds of milliseconds, fine for manipulation but too slow for some dynamic tasks. And the action chunk has to be long enough to absorb idle frames, but short enough to react to perturbations, which is a hyperparameter you set per task. None of those is an indictment of the framing, and the framing is the contribution: a generative model over action sequences, conditioned on observations, executed with a receding horizon, beats the careful regression and energy-based methods that the field had spent years tuning.

Provenance Verified against primary literature
Ho et al. (2020), DDPMThe forward/reverse process, simplified MSE objective, and the noise-prediction parameterization Diffusion Policy reuses.
Song et al. (2021), DDIMDecouples training and inference iteration counts so the same noise predictor can be sampled in 10 steps instead of 100.
Nichol & Dhariwal (2021), iDDPMThe cosine noise schedule the paper picks for control tasks.
Perez et al. (2018), FiLMChannel-wise affine feature modulation: how observations condition the CNN backbone.
Florence et al. (2021), IBCThe energy-based behavior-cloning baseline whose Z-normalization instability motivates the score-based fix.
correctionWe use the textbook DDPM coefficients (1/√α_t, scaled noise) rather than the paper's symbolic α-γ-σ shorthand, and we cite Song & Ermon's NCSN for the score / annealed-Langevin link the paper attributes to Welling & Teh 2011.

Questions you might still have

?

Why does this not just average the two routes around the T-block?
A unimodal regression policy would: its loss is minimized by the conditional mean of the demonstrations. Diffusion does not optimize toward a mean. Each rollout draws a different starting noise A^K and walks down its own basin, so one rollout commits to "go left" and another to "go right". Both modes are covered across rollouts, neither is averaged within one.

?

If observations are baked into the noise predictor, why is inference still fast?
Because O_t is passed in as conditioning, not denoised. The vision encoder runs once per control step; only the action tokens A_t^k cycle through the noise predictor K_inf times. With DDIM at K_inf = 10, that loop costs about 100ms on a 3080, fast enough for real-time control.

?

Where does the position-control gain actually come from?
Two effects stack. Action sequences are MORE multimodal in position space (the same goal admits more distinct paths than the same velocity does), and Diffusion Policy is the only baseline that can keep both modes in its action distribution. Position control also accumulates less compounding error over a long predicted chunk than velocity control does, which suits sequence prediction.

?

How does this differ from a vanilla "diffusion model for actions"?
Three practical things. The denoiser predicts only the action sequence, not the joint observation-action trajectory, so it does not have to imagine the future. Visual encoders are trained end-to-end with the policy (GroupNorm, not BatchNorm, to play well with the EMA shadow weights diffusion needs). And inference uses DDIM to spend 10 denoising iterations instead of 100, dropping per-control-step latency to ~100ms.

Footnotes & further reading

  1. The paper: Chi, Xu, Feng, Cousineau, Du, Burchfiel, Tedrake, Song, Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (RSS 2023; extended in IJRR 2024). Project page and code at diffusion-policy.cs.columbia.edu.
  2. The diffusion-model backbone: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (DDPM, 2020). The ϵθ\boldsymbol{\epsilon}_\theta parameterization, simplified MSE loss, and forward/reverse process all come from here.
  3. Fast inference: Song, Meng, Ermon, Denoising Diffusion Implicit Models (DDIM, 2021). Cosine noise schedule: Nichol & Dhariwal, Improved Denoising Diffusion Probabilistic Models (iDDPM, 2021).
  4. Score matching and the denoising / score connection: Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011); Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019), which is the closer relative to DDPM's annealed-Langevin-style reverse process than Welling & Teh's SGLD.
  5. The energy-based baseline: Florence et al., Implicit Behavioral Cloning (IBC, CoRL 2021). The categorical / quantized baseline: Shafiullah et al., Behavior Transformers (BET, NeurIPS 2022).
  6. FiLM conditioning: Perez et al., FiLM: Visual Reasoning with a General Conditioning Layer (AAAI 2018). Receding-horizon control: Mayne & Michalska, "Receding horizon control of nonlinear systems" (1990).
  7. Planning by diffusion (the joint observation-action approach Diffusion Policy contrasts against): Janner et al., Planning with Diffusion for Flexible Behavior Synthesis (Diffuser, 2022).