RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Encode a robot action as a short string of integers, and a vision-language model can output it.
RT-2 turns each robot action into eight integer bin ids, joins them with spaces, and trains a vision-language model to emit that string. The web knowledge the model already had comes along: it can place an apple on the number 3, move a can to a logo, or pick up a rock to hammer with.
Explaining the paperRT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ControlA 55B-parameter vision-language model, trained on web images and text, learns to output robot end-effector deltas as text tokens. It still answers visual questions. It also drives an arm.
For a while, vision-language models have been able to look at a picture of a dog and tell you it is a dog. Robots have been able to pick up a block when you ask them to pick up a block. RT-2 lives in the gap between those two abilities. A pickup-the-block policy does not know that a strawberry belongs in the fruit bowl, because nobody put strawberries in its training set. A vision-language model knows it. The question the paper takes on is how to let the policy borrow that knowledge without retraining it from scratch.
The earlier answer was a pipeline: the language model planned in words, then handed off to a separate low-level controller that turned each step into motion. That works, except the controller never saw the language model's knowledge during its own training. So everything subtle, the strawberry-belongs-with-fruit kind of reasoning, died at the handoff.
RT-2 collapses the handoff. It takes a large pretrained vision-language model (PaLI-X or PaLM-E, both Google DeepMind's) and trains the same network to also emit robot actions, encoded as text. The action becomes one short response in the model's normal output stream, alongside captions and answers to visual questions. The web knowledge and the motor commands now share parameters, so semantic concepts the model learned from billions of images and sentences can flow into how it moves the arm.
Five ideas, in dependency order, hold the method together: discretize the eight action values into bin ids, map those ids onto existing language tokens, constrain decoding to the action vocabulary at output, co-fine-tune with the original web data so nothing is forgotten, and serve the 55B model over a network from a TPU pod so the robot can run it.
Actions are eight words
Start with the action. The policy controls a 7-DoF mobile manipulator and outputs, at each step, an 8-tuple: a 6-DoF end-effector displacement (three position deltas , three rotation deltas ), a continuous gripper extension level, and a discrete terminate command that signals successful completion. Eight numbers per timestep, seven continuous and one discrete.
A language model emits tokens, not floats. RT-2 borrows the trick from RT-1: take every continuous dimension and chop its range into 256 uniform bins. A bin index is an integer between 0 and 255. Replace each float by its bin id and the action becomes a sequence of eight integers, which can be written as a short space-separated string. A possible target is 1 128 91 241 5 101 127 217.
That is the entire action representation. Eight integer tokens, formatted as a VQA-style response. The prompt is also VQA-shaped: an image of the scene plus a question of the form Q: what action ... ? A:, and the model answers with the eight integers. Training reduces to standard next-token cross-entropy.
Two questions sit on the binning. First: does 256 bins give enough resolution? For a manipulation task with workspace on the order of a meter, a Δpos bin is a few millimeters wide, which is the same scale as the robot's own positional repeatability, so further resolution is wasted. Second: does discretization cost accuracy? Some, but the policy runs in closed loop at several Hz, so each individual action is a small nudge that the next observation can correct. A coarse step rate that re-plans every 200ms is more forgiving than a single one-shot continuous prediction would be.
Drag any slider below to see how a continuous value snaps to its bin, and watch the assembled token string update on the right:
The training pair is now plain text in, plain text out. One image, one instruction, one eight-token answer. The same cross-entropy loss the model already uses for "name three things in this picture" trains it to output 1 128 91 241 5 101 127 217:
# input: image + a VQA-format instruction
prompt = (
"Q: what action should the robot take to "
"[task instruction]? A:"
)
# target: an 8-token string the model learns to emit
# bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8
# terminate Δpos_x Δpos_y Δpos_z Δrot_x Δrot_y Δrot_z grip
target = "1 128 91 241 5 101 127 217"
# training step: vanilla next-token cross-entropy
loss = next_token_ce(VLM(image, prompt), target)Where the action ids live in the vocabulary
Calling the action a string is only half a plan: there has to be a token id for every bin. RT-2 uses two different VLM backbones, which tokenize text differently, so it needs two answers.
PaLI-X has the easy case. Its tokenizer already gives every integer from 0 to 999 its own dedicated single token. So the bin ids 0 through 255 are already in the vocabulary as integer tokens. No new entries, no overwrites, just point the training target at the existing integer tokens and the model can emit them.
PaLM-E does not have that property. Numbers like 128 get broken across several subword pieces, which would mean an action token costs the model multiple decode steps and makes constrained decoding awkward. The workaround is symbol tuning, a technique from Wei et al. (2023): pick the 256 least-frequent existing tokens in the vocabulary and overwrite their embeddings to mean the new action ids. Those tokens already carry almost no statistical weight (they are the dregs of the original tokenizer), so reusing their slots costs little, and the co-fine-tuning teaches the model what their new meaning is.
Toggle between the two backbones to see which tokens become the action vocabulary. PaLI-X reuses a contiguous block of integer tokens near the head of the vocabulary; PaLM-E scatters its 256 reused slots across the tail, wherever the rarest tokens happened to sit:
Symbol tuning works because vocabulary tokens are not equal citizens. The bottom 256 in a SentencePiece tokenizer account for a vanishing fraction of training tokens, so overwriting them barely perturbs the language distribution. The same reasoning underwrites symbol tuning as an in-context-learning method: forcing the model to associate arbitrary labels with arbitrary semantic categories works precisely when the labels are unloaded slots the model can rewrite without losing useful structure elsewhere.
Only legal tokens at output
One last detail keeps the action stream clean. The same RT-2 model still answers visual questions in natural language. If you ask it "what is in the bowl" you want it to say "an apple," not 128 91. And when you ask for an action, you want exactly eight integers, no stray articles.
RT-2 enforces that by masking the output vocabulary at decode time. On a robot-action prompt the sampler is restricted to the 256 action ids only; every other token id has its logit driven to negative infinity, so its softmax probability is zero. The model is still free to rank the 256 action ids against each other (that is what carries the policy), but cannot accidentally emit "the." On a vision-language prompt the mask is lifted and the full vocabulary is in play again.
Toggle the mask below to see the redistribution. Without the mask, common word tokens dominate the softmax because the model is reading something that looks like the beginning of a sentence; with the mask, all that probability mass collapses onto the action ids, and their relative ranking, which is what the policy actually cares about, becomes the decision:
Two consequences follow. The mask gives a hard guarantee, not a soft preference, so RT-2 cannot emit a syntactically invalid action even when the model is confused. And the ranking inside the band carries the policy: a 0.3 vs 0.2 split between bin 128 and 129 says the model has a fine-grained motor preference, even when both numbers look small in absolute terms.
Co-fine-tune to keep the web
The naive next step would be to fine-tune the VLM on the robot dataset and call it done. The paper says that loses most of the value. Fine-tuning on robot data alone teaches the model to emit good action strings, but it forgets the visual concepts that made the VLM interesting in the first place. The strawberry-belongs-with-fruit reasoning starts to fade, and so does generalization to objects the robot has never seen.
Co-fine-tuning fixes that by mixing the original web data back in. Each training batch holds both robot trajectories and standard vision-language examples (VQA, captioning, the same mixture the underlying PaLI-X or PaLM-E was trained on). The mixing ratio is tilted toward the robot data so a single fine-tuning run can shift the model into a useful action regime: about 50% robot for PaLI-X, about 66% for PaLM-E. The original objective and the new one share weights and a loss surface, so the model holds onto its visual concepts while learning to act.
Table 6 isolates each of the three ingredients. Train a 5B PaLI-X from scratch on robot data only and it collapses to 9% on the unseen-task average. Fine-tune the pretrained 5B and the same evaluation jumps to 42%. Co-fine-tune and it edges up to 44%. Move to the 55B model and the gap widens: 52% with fine-tuning, 63% with co-fine-tuning. The paper skips the 55B from-scratch run because the 5B from-scratch number already settled that question.
One sentence on why these numbers compose this way. Pretraining is what gives the model concepts; fine-tuning is what attaches actions to those concepts; co-fine-tuning is what keeps the concepts from being smudged out while the attachment is learned. Each layer earns its share of the final number, which is why removing any of them visibly hurts.
Running a 55B model on a robot
The 55B PaLI-X cannot fit on a robot. That class of model needs a multi-TPU pod to run, so a closed-loop policy that calls it once per step has to call it across the network. RT-2 hosts the model on a Google TPU cloud service and the robot sends each image to it. With that arrangement the 55B variant runs at 1 to 3 Hz of control, and the smaller 5B variant at about 5 Hz. The smallest, the 3B PaLI used for the Language-Table benchmark, also runs at 5 Hz.
What 1-3 Hz means in practice. A pick-and-place on a 7-DoF arm typically takes ten to thirty seconds of motion, so the policy has somewhere between ten and a hundred decisions over an episode. That is enough for reactive correction on slow manipulation, and not enough for anything contact-rich or fast (the failure cases in Appendix G are exactly those: pushing a banana whose center of mass surprises the model, or grabbing a handle that needs precise alignment). The paper flags inference cost as the next bottleneck for that reason.
# closed-loop control over a network call to a TPU pod
while episode_running():
image = camera.read()
prompt = render(prompt_template, instruction)
# forced-choice decoding over the 256 action ids only
ids = VLM.generate(image, prompt, vocab_mask=ACTION_IDS, n=8)
a = [bin_to_value(d, ids[d]) for d in range(8)]
if a[0] == TERMINATE: break
robot.step(a) # at 1-3 Hz for 55B, 5 Hz for 5BThe cloud-served control loop is a real architectural choice, not just a deployment detail. The model that runs the arm is the same checkpoint that answers a VQA query about the image; the robot is one client of a service that other robots, and other applications, can share. The control rate is determined by network round-trip plus 55B decode time, not by anything the robot itself can speed up.
New objects, backgrounds, places
With those pieces in place the first thing to measure is plain generalization: how well does the policy work on objects, backgrounds, and environments the robot training set never showed it? The paper evaluates on three axes, each split into easy and hard cases, against four baselines (R3M, VC-1, RT-1, MOO). The results land in Table 4.
On seen tasks every method is roughly competitive: RT-1 hits 92%, RT-2 hits 91 to 93%. The story is on the unseen columns. RT-2 averages 62% across them, against RT-1's 32% and the visual-representation baselines (R3M, VC-1) at 10 to 12%. The headline number, "2x over RT-1 and 6x over the others," is the unseen average comparison.
Pick an axis below to see where the gap is widest. Unseen objects hard (toys the robot never saw) and unseen environments (a visually distinct office desk, not the kitchen the dataset was collected in) are where the VLM-pretrained models pull furthest ahead, because those are the cases where general world knowledge from the web has the most to contribute:
Two notes on reading these. PaLI-X-55B and PaLM-E-12B average to the same 62% but split their wins: the PaLM-E variant is stronger on harder generalization (hard objects, hard backgrounds), the PaLI-X variant is stronger on the easier splits. The paper attributes that to PaLM-E's pre-training mix, which leans more on broad knowledge tasks, against PaLI-X's heavier visual focus. And the baselines that use a pretrained visual backbone but no language pretraining (R3M, VC-1) score in single digits on unseen environments, which is the cleanest evidence that visual representations alone are not what carries the generalization.
Symbols, reasoning, faces
Past generalization, the more interesting claim is that some abilities of the underlying VLM transfer into the policy even though they were never in the robot dataset. The paper bundles these into three category averages from Table 5: symbol understanding (move the apple onto the number 3, push the can onto the heart), reasoning (math, logos, nutrition, multilingual instructions), and person recognition (move the can to the person with glasses, or to Taylor Swift).
None of these tasks were ever in the robot training data. The robot data is the RT-1 mixture: pick, knock, place upright, move near, open/close drawer, place into receptacle, take from receptacle. There is no robot demonstration of "the apple to a number," let alone "the can to Taylor Swift." Whatever knowledge the policy uses to succeed on these has to be coming from the web pretraining, not from the robot data, with the action head pulling it in.
RT-2-PaLI-X averages 60% across these categories; RT-1 averages 17%. That is the 3x multiplier the paper cites. Pick a category to see the breakdown:
One thing to pin down: what "emergent" means here, since the word carries baggage. The paper uses it in the transfer sense: capabilities that emerge in the policy by way of pretraining, not capabilities that would otherwise require new gradient steps. It does not mean the model suddenly acquires new motor skills. The picking-up is still the same picking-up RT-1 learned; what RT-2 adds is the ability to point that picking-up at a strawberry it never saw, in a kitchen it never saw, because someone asked it to.
Chain-of-thought, for a robot
The last variation in the paper plugs in chain-of-thought prompting. In its language-only form, the method inserts intermediate reasoning text between the question and the answer, so the model can think on the page. RT-2 adapts that to the robot context by inserting a "Plan: ..." sentence between the instruction and the action.
Augmenting the data is straightforward: take an existing training trajectory and prepend a natural-language plan describing what the robot is about to do, then the eight action tokens. A few hundred gradient steps on this format is enough to flip the model into the new behavior. At inference, the same model emits the plan first, then conditions on its own plan to emit the action.
Walk through a paper example below. With CoT off, the prompt "I am hungry" goes straight to action tokens, which is a leap of inference the model has to make end-to-end. With CoT on, the model first writes "Plan: pick rxbar chocolate." and the action tokens follow from that intermediate fact, which is the same kind of multi-step inference the underlying VLM is good at on text alone:
The scope of this result is narrower than it can sound. The paper calls it "initial evidence," not a measured improvement on the main benchmarks. The CoT model is only evaluated qualitatively, on a handful of rollouts, and the paper does not claim a quantitative win over the non-CoT variant. What it does show is that the same VLA model can do both, with roughly a few hundred fine-tuning steps in between, which suggests the planner-and-controller split a separate paper line would have used can be folded into one network instead.
What it does not do
The paper's own limits section is short. Three things RT-2 does not buy you.
One, no new motor skills. The model's physical repertoire is the RT-1 skill set, end of story. If you want it to wipe with a towel, fold laundry, or insert a peg, you need new robot data. The transfer is semantic.
Two, inference cost. The 55B model runs at 1-3 Hz over a cloud TPU pod. Anything that needs faster control (assembly, human handoff, fast contact) is out of reach until someone quantizes or distills the model down.
Three, the model pool is small. RT-2 needs a pretrained VLM whose weights or fine-tuning API is accessible. In 2023 that meant PaLI-X and PaLM-E, neither of which is open. The recipe ports cleanly to any open VLA-shaped model in principle, but the published numbers are tied to closed checkpoints.
None of the ingredients was invented in this paper. RT-1 already discretized eight continuous action values into integer bin ids. PaLI-X and PaLM-E already gave a vision-language model that could read an image and emit text. Symbol tuning already showed that the rarest 256 tokens could be repurposed without hurting language. Co-fine-tuning on a mixture is standard practice. Constrained decoding masks have been used in structured generation for years. What RT-2 contributes is the wiring: write the action as text, share one set of weights with the web data, and let the same network that can describe a strawberry also pick one up.
Questions you might still have
If the action is text, why does the policy not drift out of action space?
At decoding time the sampler is masked to the 256 action ids only when the prompt asks for an action. The model still ranks action tokens against each other; the constraint just zeroes the rest.
Why does co-fine-tuning beat plain fine-tuning?
Fine-tuning only on robot data lets the model forget what its web pretraining taught it. Co-fine-tuning keeps the VQA, captioning, and language batches mixed in (50% robot for PaLI-X, 66% for PaLM-E), so the web concepts stay sharp and transfer to the action head.
Does the model learn new motions, or just deploy known ones in new contexts?
Only the latter. The paper is explicit: emergent capabilities are semantic (recognizing a strawberry, picking the smallest object). Physical skills are still bounded by the RT-1 dataset of seven manipulation skills. New motions need new robot data.
Why is symbol tuning safe for PaLM-E?
The 256 least-frequent tokens carry almost no signal in the original distribution, so overwriting their embeddings barely hurts language performance. The model is then forced to learn what each repurposed token means from the co-fine-tuning data.
Footnotes & further reading
- The paper: Brohan et al., RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Google DeepMind, 2023). Project page with rollout videos: robotics-transformer2.github.io.
- Where the 256-bin discretization comes from: Brohan et al., RT-1: Robotics Transformer for Real-World Control at Scale. The same action format, mobile manipulator, and seven-skill dataset RT-2 reuses.
- The PaLI-X backbone (22B ViT + 32B encoder-decoder UL2, ≈55B): Chen et al., PaLI-X: On Scaling Up a Multilingual Vision and Language Model.
- The PaLM-E backbone (decoder-only LLM with a ViT-4B visual projector): Driess et al., PaLM-E: An Embodied Multimodal Language Model.
- Symbol tuning (the "overwrite the 256 rarest tokens" trick): Wei et al., Symbol Tuning Improves In-Context Learning in Language Models.
- Chain-of-thought prompting: Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, which RT-2 adapts by inserting a "Plan: ..." sentence before the action tokens.
- Failure cases the paper flags (pushing dynamics, grasping a handle, dexterous folding) come from Appendix G of the paper itself. A useful sober counterweight to the emergent-capabilities highlight reel.
How could this explainer be improved? Found an error, or something unclear? I read every message.