π₀: A Vision-Language-Action Flow Model for General Robot Control
A robot policy that denoises its next 50 actions.
Start from a pre-trained vision-language model. Bolt a smaller action expert onto its self-attention. Train it to turn Gaussian noise into a chunk of 50 future joint-angle commands. Run ten Euler steps per decision and a 3.3-billion-parameter policy plays a 50 Hz dexterous robot.
Explaining the paperpi-0: A Vision-Language-Action Flow Model for General Robot ControlMost robots that ship with learned policies do one task. pi-0 is a 3.3B-parameter model that does laundry, busses tables, packs eggs, and assembles a cardboard box from flat, on seven different physical robots, with the same weights.
Take an off-the-shelf 3B vision-language model (PaliGemma: a SigLIP image encoder plus a Gemma 2B language model). Add an extra 300M-parameter action expert, a second set of weights inside the same transformer that handles the proprioceptive state and the noisy action tokens. Train it on 903M timesteps of real robot data (laundry, bussing, drawer-packing) plus a slice of the open-source mixture. Use flow matching to turn the action expert into a vector-field network that walks pure Gaussian noise into a 50-step chunk of joint commands in ten Euler steps.
A few ideas explain it all: what an action chunk is and why robots need them; what flow matching does once you flip its convention so τ=0 is noise and τ=1 is data; why pi-0 samples its training timesteps from a beta distribution that puts most weight on the high-noise end; how a single transformer can host two disjoint sets of weights that meet only at self-attention; and how a blockwise causal mask makes the observation prefix cacheable across all ten denoising steps.
Why robot generalists are stuck
In language and vision, the playbook is settled. Pre-train one giant model on the web. Fine-tune it for anything. The pre-trained model carries enough general competence that downstream tasks become a matter of small datasets and good prompts. Robotics has wanted that playbook for years and could not have it, for three reasons. Robot data is scarce (you cannot scrape it). Robots have wildly different bodies (a 6-DoF arm and a 14-DoF dual-arm mobile manipulator are not the same problem). And the output is continuous, high-frequency motor commands at 20 to 50 Hz, not a token.
pi-0 treats scarcity as the only one of those three still unsolved, and treats the body and frequency problems as engineering. Physical Intelligence collected 10,000 hours of teleoperated data across seven robot configurations and 68 task families, then trained a single 3.3B-parameter model on all of it together. That handles the body problem by zero-padding every robot up to the largest action vector (18 dimensions, enough for two 6-DoF arms, two grippers, a holonomic base, and a vertically actuated torso) and letting the model learn which dimensions are live. The frequency problem they handle with two design choices that together set up the rest of the paper.
The unit of work is a chunk of actions
A naive policy maps observation to one action: see the world at time t, emit one motor command, repeat. That worked for older Atari-style work but it breaks at 50 Hz. Every step needs a network forward pass, and a 3.3B-parameter model running on an RTX 4090 takes about 73 ms per inference. At 50 Hz the budget per action is 20 ms, and the policy misses every deadline.
ACT (Zhao et al., 2023) settled the workaround pi-0 inherits: predict an entire chunk of H future actions at once, execute them open-loop one per control tick, and only call inference again when the chunk runs out. pi-0 picks, runs inference every 25 ticks at 50 Hz (so every 0.5 s), and every 16 ticks at 20 Hz (every 0.8 s). The control loop never has to wait for the network.
Drag H below to feel the trade. At the policy has no chunk, every action triggers an inference, the teal track packs solid (and a 73 ms network can never feed a 20 ms control loop). At the teal track thins out to two ticks per second, and the amber track of executed actions runs uninterrupted between them.
Going purely open-loop is a deliberate departure from ACT. pi-0 tried ACT's temporal-ensembling workaround early on (query every tick, average the overlapping chunks). It hurt performance, so the paper drops it and executes each 50-step chunk to completion. That puts a stronger commitment on chunk quality: the model has to be right for 0.5 seconds in advance, with no chance to correct mid-chunk.
That choice changes what the network has to model. It is not "the next action given the world." It is "the joint distribution of the next 50 actions given the world." That distribution is multimodal (there is more than one good way to grasp a sock), continuous (no token vocabulary), and high-dimensional ( numbers per decision). Discrete autoregression on motor commands, which is how earlier VLAs like RT-2 and OpenVLA do it, struggles with all three. pi-0 reaches for diffusion's continuous-distribution machinery instead, in its modern flow-matching form.
Flow matching, on actions
Flow matching, introduced by Lipman et al. (2022), is a way to train a generative model by teaching a neural network the velocity field that transports a base distribution into a data distribution along a chosen probability path. It is closely related to diffusion, with one clean difference: instead of learning the score (the gradient of the log density) and integrating a stochastic differential equation, you learn the velocity directly and integrate a plain ordinary differential equation, with no random noise during sampling.
For pi-0 the data is an action chunk conditioned on an observation . The base distribution is standard Gaussian. The probability path between them is the straight line from a noise sample to the clean chunk:
Read the convention. In standard diffusion is clean and is noise. pi-0 (following Lipman) flips that: is pure Gaussian noise, is the clean action chunk. Generation walks from 0 up to 1. It is a small detail and easy to invert by accident, so the figure below labels both ends and crossfades color with . Slide it from 0 to 1 and watch the teal noisy chunk collapse onto the amber target curve.
Why the straight line? On a straight path, the velocity is constant in time: differentiate eq (1) with respect to and the noise term falls out (it has no dependence) and the data term keeps its coefficient,
So the network's job is simple. Show it the noisy chunk and the observation , and ask it to predict the vector pointing from to . Train with plain L2:
Two notes on what eq (3) is and is not. It is the conditional flow matching loss, conditional on a specific clean sample : the velocity target is for that one path. Lipman et al. showed it gives the same gradient (in expectation) as the unconditional marginal velocity, which is the quantity you actually need at inference but cannot easily sample. So the conditional loss is the one you train on, and the marginal field is what you get back. And by aligning the path with the optimal transport line, the velocity field is as straight as possible, which is why ten Euler steps suffice at sampling (the curve is barely curving).
At inference, generate by integrating eq (2) forward in . Start from , then take Euler steps,
with , so the loop runs ten times from to . The sampler stops there. No stochastic noise injection between steps, no schedule of variances, no classifier-free guidance scale. Compare to vanilla Diffusion Policy (Chi et al., 2023), which integrates a DDPM with ~100 steps: pi-0's ten-step ODE is a ten-fold cut to the per-decision compute and a much cleaner control loop. DDIM can also accelerate DDPM, so the gap is not quite ten-to-one in practice, but the trajectory pi-0 follows is genuinely straighter.
Sample low τ, not the middle
One subtlety: how should you pick during training? Lipman's original paper uses uniform,. Stable Diffusion 3 (Esser et al., 2024) argued for a logit-normal that puts most of the mass on the middle of the interval. Their reasoning: at the optimal prediction is the mean of the base distribution (trivially zero), and at it is the mean of the data given the prompt; both endpoints are easy, so concentrate compute on the hard middle.
pi-0 inverts that. The paper's hypothesis (and they explicitly mark it as a hypothesis, not a theorem): predicting , the mean action given the observation, is much harder than predicting the mean image given a text prompt. The observation constrains the action distribution far more tightly than a text prompt constrains an image. So learning the mean action is the hard end, which sits at (high noise). They emphasize that end and cut off the very-low-noise tail entirely.
Concretely the density is
and it is zero for . A Beta(1.5, 1) density is on , which rises monotonically toward . Substituting flips the axis: high maps to low . So rises monotonically toward small , peaks at zero, and is truncated at . The cutoff is calibrated so that the Euler step size always exceeds , which means a single ten-step run still reaches without ever sampling a training point from that low-noise region during fitting.
Toggle the schedules below and slide . The pi-0 mass piles at low , the SD3 logit-normal piles in the middle, the Lipman default is flat. At pi-0's cutoff line disappears off the right edge (which is the degenerate case the paper avoids by setting).
That this is a hypothesis matters. The paper does not run a controlled ablation against the logit-normal on action prediction, so the schedule is a piece of engineering taste rather than a verified theorem. If you reproduce pi-0 and your numbers come in soft, this is a knob to revisit.
Two weight sets in one transformer
The architecture has to satisfy two requirements at once. First, it has to carry the semantic competence of a pre-trained vision-language model: a robot that can read "fold the shirt" needs an actual language model, not a small classifier head. Second, it has to provide a velocity-field network that can emit a 900-dimensional continuous chunk in ten cheap forward passes, conditioned on the observation. Those two networks have different shapes.
pi-0 compromises by running one transformer with two disjoint MLPs: at every layer, image and language tokens go through PaliGemma's MLP weights, while state and noisy-action tokens go through a smaller MLP belonging to the action expert. The two streams meet only when the layer does self-attention: queries, keys, and values are computed in each expert's own projections, then they all attend together. So an action token can read a language token's K and V, but the two are never multiplied by the same MLP.
Hover a token in the figure to highlight its route.
The paper credits this design to Transfusion (Zhou et al., 2024). One nuance, though: the architectures are not identical. Transfusion uses one shared transformer with modality-specific embed and unembed layers; pi-0's routed-MLP design is closer to a mixture-of-transformers (or a two-expert mixture-of-experts where which expert you go to is decided by the token's modality, not by a learned router). The substance is the same (one self-attention, modality-specific compute everywhere else), but the citation is approximate.
Two practical reasons to split the MLPs rather than share them. The action tokens have not been seen during PaliGemma pre-training, so pushing them through frozen-style language weights would waste those weights on out-of-distribution inputs. And the action expert can be smaller and cheaper: width 1024 and mlp_dim 4096, against the backbone's width 2048 and mlp_dim 16384. That shrinks the per-flow-step forward pass, which the model takes ten times per decision, while the expensive backbone runs once per decision and then sits in a cache.
The action expert is also where the flow matching timestep enters the model. For each noisy action the embedding fed to the transformer is
where is a sinusoidal positional encoding of the scalar timestep (the same mechanism Transformer attention uses for sequence position), and are learned projections inside the expert. The same scalar tells every noisy-action token where on the noise-to-clean ladder the input sits, so the network can swing its predictions accordingly.
Blocks, causality, and the cache
The transformer also has a non-standard attention mask, and that mask is what makes the inference budget in the next section work. pi-0 groups the input into three contiguous blocks: (1) the image and language tokens, (2) the proprioceptive state , and (3) the H noisy action tokens . Inside each block, attention is bidirectional. Across blocks it is causal: block 2 can read block 1, block 3 can read both, but block 1 cannot read block 2 or 3.
Two reasons for this layout, both load-bearing. First, the images-and-language block is the one PaliGemma was trained on, with PaliGemma's own attention pattern. By preventing it from attending to the new blocks (the proprioceptive state and the noisy actions) the model keeps that block's input distribution faithful to pre-training, so the backbone's weights are not asked to absorb a sudden distribution shift the first time a state vector shows up.
Second, this mask makes the prefix cacheable. The observation (images, language, state) is fixed for an entire decision. The noisy action chunk changes ten times: once per Euler step. If the prefix never depends on the noisy actions, you can run it through the transformer once, save the K and V tensors for every token, and then for each of the next nine Euler steps only recompute K and V for the H action tokens, plugging in the cached prefix. The cost of one decision becomes "one expensive prefix pass plus ten cheap action passes," which is exactly the timing breakdown in the next section.
Drag the query position below. Inside the block the row is bright (full bidirectional attention). Reaching back into an earlier block the row is dim but live. Future blocks are crossed out.
With that mask in place, the pre-trained backbone stays on inputs it has seen before, and the ten-step sampling loop reduces to one big forward pass plus ten small ones.
Ten Euler steps per decision
The paper's Table I times pi-0 on an NVIDIA RTX 4090. Image encoders run once and cost 14 ms. The observation forward pass (which fills the prefix cache) costs 32 ms. Then ten Euler steps of the action expert run, costing 2.7 ms each for a total of 27 ms. Total on-board: 73 ms per decision. On a mobile robot, where inference happens off-board over Wi-Fi, add 13 ms of network latency for 86 ms total.
Slide N below and you can see the shape. The flow-step bar grows linearly with N, but it never crosses the prefix cost until N is past 16. That asymmetry is what the cache buys: at the paper's, the network spends more time on the prefix (46 ms) than on the flow (27 ms). Cutting in half would shave only about 13 ms.
Two things sit behind those numbers. First, 73 ms for a 3.3B-parameter model is a lot in absolute terms, but it sits well inside the control loop's budget. At 50 Hz a chunk of actions lasts 1 s; an inference every 0.5 s leaves an entire second of headroom per chunk. Second, off-board inference matters here: the mobile robots in the paper run their NVIDIA hardware separately and stream video and commands over Wi-Fi, and the 13 ms latency adder is built into the budget.
The data, and the n^α reweighting
Architecture accounts for only half of pi-0; the rest sits in how the dataset is assembled and sampled. Total pre-training data: 903 million timesteps of teleoperated robot trajectories from Physical Intelligence's own fleet, plus a 9.1% slice of public data drawn from Open-X-Embodiment (specifically the filtered "OXE Magic Soup" subset from OpenVLA), Bridge v2, and DROID. The Physical Intelligence data covers 7 robot configurations (a single-arm UR5e, a dual-arm UR5e, a Franka, a bimanual Trossen ALOHA, bimanual ARX/AgileX, two mobile ALOHA variants, and a mobile holonomic Fibocom) and 68 task families.
That mix sits far from uniform: the big dual-arm and mobile platforms each contribute hundreds of millions of timesteps, and the single-arm Franka contributes a fraction of that. Sample uniformly across timesteps and the biggest platforms drown out the smaller ones. Sample uniformly across platforms and the biggest platforms are starved of the gradient signal they have earned. pi-0 splits the difference with a power-law reweighting: each platform-task combination is sampled in proportion to where is its raw timestep count, and .
The choice of 0.43 is not cited to any prior recipe. (Similar exponents show up in multilingual LM mixture work and language-weighting power laws in Llama; pi-0 does not attribute its 0.43 to any of them.) Treat it as a pi-0-specific constant, calibrated empirically. Slide in the figure: at the sampling share is the raw share; at every platform is sampled equally; the paper's 0.43 sits closer to equal than to raw.
On top of pre-training, pi-0 runs a separate post-training phase. The pre-training mixture should be as diverse as possible, even at the cost of quality, so the model learns to recover from messy real-world states (a dropped sock, a tipped cup). The post-training mixture should be small and high-quality, the gold-standard demonstration of one task done well, so the model learns to perform that task fluently and consistently. The split mirrors the pre-train / post-train split in language models. Some pi-0 tasks use 5 hours of post-training data; the laundry-folding tasks use over 100.
The training step itself fits in a few lines:
# pi-0 training step (one block in the action expert)
o_t = (images, language, q_t) # the observation
A_t = future_actions[t : t+H] # the clean chunk (H = 50)
tau = sample_tau() # Beta((s - tau)/s ; 1.5, 1), s=0.999
eps = randn_like(A_t) # N(0, I), same shape as A_t
A_tau = tau * A_t + (1 - tau) * eps # noisy chunk on path
v = action_expert(A_tau, o_t, tau) # predicted velocity
u = A_t - eps # ground-truth velocity
loss = mse(v, u) # plain L2 on velocity
loss.backward() # grads only to actively-trained weights
opt.step()And inference:
# pi-0 inference: one decision = 10 Euler steps
encode_images_and_obs(o_t) # pay once: image encoders 14 ms, obs 32 ms
cache_KV_for_prefix(o_t) # the prefix is FROZEN across 10 steps
A = randn(H, action_dim) # A^0 ~ N(0, I)
for k in range(10): # delta = 0.1
tau = k * 0.1
v = action_expert(A, o_t, tau) # only the action tokens recompute K,V
A = A + 0.1 * v # forward Euler (eq 2)
return A # the H = 50 action chunk, executed open-loopWhat it actually does
The paper reports three sets of evaluations, each pointed at a different question: how strong the pre-trained model is on its own, how much the language backbone earns its keep, and how well fine-tuning transfers to tasks that were not in the pre-training mixture.
Pre-training, evaluated cold. Without any task-specific fine-tuning, pi-0 is run on five tasks present in pre-training: shirt folding, easy bussing (7 objects), hard bussing (12 objects), grocery bagging (7 items), and pulling toast out of a toaster. Scored against OpenVLA (7B, autoregressive), Octo (93M, diffusion head, no LM), and pi-0's own non-VLM baseline (a 470M model that uses DistilBERT for language and a DiT-style action expert), pi-0 wins by large margins on every task. The "compute parity" version of pi-0 (only 160k training steps, matching the baselines, against 700k for the full model) still beats every baseline. OpenVLA's failure mode is the one the architecture predicts: it has no action chunking, so it cannot keep up with high-frequency dexterous tasks.
Language following. The paper fine-tunes pi-0 on bussing, table setting, and grocery bagging, and compares "flat" prompting (one overall task command like "bag the groceries") against intermediate commands from a human expert and intermediate commands from a high-level VLM planner (analogous to SayCan). With human intermediate commands the model's task progress jumps; with autonomous high-level planning it jumps less, but still substantially. The non-VLM 470M baseline, by contrast, barely benefits from either kind of intermediate command, since its language understanding caps out at the level of DistilBERT-style classification. The 3B PaliGemma backbone is doing real semantic work.
Hard fine-tuning. The paper picks downstream tasks chosen to vary in distance from pre-training: stacking bowls and folding a towel (easy, similar to pre-train), Tupperware-in-microwave (mid, new container shapes), paper towel replacement and Franka items-in-drawer (hard, new objects and new motions). pi-0 fine-tuned beats Diffusion Policy, ACT, OpenVLA, and Octo across the spread, sometimes by 2x on the same dataset. And the multi-stage tasks (laundry folding from a bin, table bussing with novel objects, assembling a cardboard box from flat, packing eggs, packing a to-go box) get pi-0 above 50% success on every task with 5 to 100 hours of post-training data each, tasks that the paper could not solve with any other method.
The pre-training does work. On hard tasks the pre-trained model beats the from-scratch baseline by very large margins; on easy ones the gap is smaller. Pre-training is most useful exactly when you have the least task-specific data.
A few things the paper does not yet claim. It does not show fine-grained scaling laws (no clean ablation of pre-training data size, model size, or post-training data size against final task performance). It does not solve every task: laundry on the mobile robot, in particular, comes in below the static version, indicating that adding a navigation challenge to the task is still a hard generalization. And the recipe has not been shown to transfer cleanly to non-manipulation domains: driving, navigation, and legged locomotion are explicitly left for future work.
What pi-0 demonstrates is that the language-model pre-training playbook works for robotics too, once you commit to a particular stack of engineering moves: action chunking to outrun the high-frequency control loop, flow matching to model continuous multimodal action distributions, and an action expert glued onto a genuine VLM through shared self-attention so the expensive forward passes are paid once per decision and the cheap ones run ten times. One model, one set of weights, seven robots, sixty-eight task families.
Questions you might still have
Why flip the convention so τ = 0 is noise and τ = 1 is data?
pi-0 follows Lipman et al. (2022), whose conditional flow path starts at the noise base and ends at the data. Standard diffusion goes the other way (clean → noise as t grows). The paper does not switch back, so every equation here uses Lipman's τ convention, with τ = 0 at noise and τ = 1 at the clean action chunk.
Why ten Euler steps?
Two reasons. First, the optimal-transport probability path is straight, so the velocity field is almost constant in τ and a low-order integrator works well. Second, ten is the smallest step count for which the cutoff s = 0.999 is safe: a single Euler step δ = 0.1 always exceeds 1 − s, so the model never has to integrate inside the cutoff region it was not trained on.
Why not just one transformer instead of two routed weight sets?
The action and state tokens are new to PaliGemma: they were not in pre-training. Pushing them through the same MLPs would either degrade the language weights or force them to absorb out-of-distribution inputs. Splitting the MLPs lets the small action expert handle the new tokens with fresh weights, while the big backbone keeps doing what it was pre-trained for. Self-attention is still shared, so the action expert can read the language and image K and V.
Why is the action expert smaller than the backbone?
It runs ten times per decision instead of once. The backbone fills the prefix cache (32 ms) and then sits; the action expert is what every Euler step actually executes (2.7 ms each). A smaller expert is the only way to keep ten flow steps cheap.
Why open-loop chunks instead of temporal ensembling like ACT?
pi-0 tried temporal ensembling early on and found it hurt performance, so they execute each H = 50 action chunk to completion. That deviates from ACT's recipe in substance: it puts more weight on chunk quality (the model has to be right for 0.5 s ahead) and less on reactive correction inside a chunk.
How is this different from RT-2 or OpenVLA?
RT-2 and OpenVLA both tokenize continuous actions into discrete tokens and decode autoregressively. That works at low frequencies but does not give you a chunk of 50 actions in a single inference call. pi-0 leaves discrete tokens for language and images; actions stay continuous and come out as one chunk per call, drawn from a multimodal distribution by ten Euler steps. The model also keeps the full PaliGemma backbone for its semantic competence, instead of repurposing the language head to emit motor tokens.
Footnotes & further reading
- The paper: Black, Brown, Driess, et al., π₀: A Vision-Language-Action Flow Model for General Robot Control (Physical Intelligence, RSS 2025). Project page with videos.
- Flow matching itself: Lipman, Chen, Ben-Hamu, Nickel, Le, Flow Matching for Generative Modeling (ICLR 2023). The linear / optimal-transport conditional path is their Example II.
- The logit-normal timestep schedule pi-0 deliberately inverts: Esser et al., Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3).
- Action chunking: Zhao, Kumar, Levine, Finn, Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT, RSS 2023). pi-0 inherits the H-step chunk but drops temporal ensembling.
- The VLM backbone: Beyer et al., PaliGemma: A versatile 3B VLM for transfer.
- The architecture pi-0 cites for the two-experts-one-attention design: Zhou et al., Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.
- Diffusion-based action policies in the same family: Diffusion Policy (Chi et al., 2023) and Octo (Octo Model Team, 2024).
- Earlier VLAs: RT-2 (Brohan et al., 2023) and OpenVLA (Kim et al., 2024). Both tokenize actions and autoregress; pi-0's flow-matching head is the continuous counterpoint.
- Open-source data sources pi-0 mixes in: Open-X-Embodiment, Bridge v2, and DROID.
- The high-level VLM-planning analogue: SayCan (Ahn et al., 2022).
How could this explainer be improved? Found an error, or something unclear? I read every message.