Robotics · Multimodal

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Encode a robot action as a short string of integers, and a vision-language model can output it.

RT-2 turns each robot action into eight integer bin ids, joins them with spaces, and trains a vision-language model to emit that string. The web knowledge the model already had comes along: it can place an apple on the number 3, move a can to a logo, or pick up a rock to hammer with.

Explaining the paperRT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic ControlBrohan et al. · Google DeepMind · 2023 · arXiv:2307.15818 ↗

A 55B-parameter vision-language model, trained on web images and text, learns to output robot end-effector deltas (small changes to the position and orientation of the arm's gripper) as text tokens. It still answers visual questions. It also drives an arm.

For a while, vision-language models have been able to look at a picture of a dog and tell you it is a dog. Robots have been able to pick up a block when you ask them to pick up a block. RT-2 addresses the gap between those two abilities. A pickup-the-block policy does not know that a strawberry belongs in the fruit bowl, because nobody put strawberries in its training set. A vision-language model knows it. The question the paper takes on is how to let the policy borrow that knowledge without retraining it from scratch.

The earlier answer was a pipeline: the language model planned in words, then handed off to a separate low-level controller that turned each step into motion. That works, except the controller never saw the language model's knowledge during its own training. So everything subtle, the strawberry-belongs-with-fruit kind of reasoning, was lost at the handoff.

RT-2 collapses the handoff. It takes a large pretrained vision-language model (PaLI-X or PaLM-E, both Google DeepMind's) and trains the same network to also emit robot actions, encoded as text. The action becomes one short response in the model's normal output stream, alongside captions and answers to visual questions. The web knowledge and the motor commands now share parameters, so semantic concepts the model learned from billions of images and sentences can flow into how it moves the arm.

The method depends on five steps applied in order: discretize the eight action values into bin ids, map those ids onto existing language tokens, constrain decoding to the action vocabulary at output, co-fine-tune with the original web data so nothing is forgotten, and serve the 55B model over a network from a TPU pod (Google's ML-accelerator hardware) so the robot can run it.

Actions are eight words

Start with the action. The policy controls a 7-DoF (seven degrees of freedom, seven independently controllable axes) mobile manipulator and outputs, at each step, an 8-tuple: a 6-DoF end-effector displacement (three position deltas $\Delta\text{pos}_{x,y,z}$ , three rotation deltas $\Delta\text{rot}_{x,y,z}$ ), a continuous gripper extension level, and a discrete terminate command that signals successful completion. Eight numbers per timestep, seven continuous and one discrete.

A language model emits tokens, not floats. RT-2 borrows the method from RT-1: take every continuous dimension and chop its range into 256 uniform bins. A bin index is an integer between 0 and 255. Replace each float by its bin id and the action becomes a sequence of eight integers, which can be written as a short space-separated string. A possible target is 1 128 91 241 5 101 127 217.

The action representation is eight integer tokens, formatted as a VQA-style (visual question answering) response. The prompt is also VQA-shaped: an image of the scene plus a question of the form Q: what action ... ? A:, and the model answers with the eight integers. Training reduces to standard next-token cross-entropy.

\text{action} = (\text{term},\ \Delta\text{pos}_x,\ \Delta\text{pos}_y,\ \Delta\text{pos}_z,\ \Delta\text{rot}_x,\ \Delta\text{rot}_y,\ \Delta\text{rot}_z,\ \text{grip}) \;\rightarrow\; \text{8 token ids}

(1)

Two questions sit on the binning. First: does 256 bins give enough resolution? For a manipulation task with workspace on the order of a meter, a Δpos bin is a few millimeters wide, which is the same scale as the robot's own positional repeatability, so further resolution is wasted. Second: does discretization cost accuracy? Some, but the policy runs in closed loop at several Hz, so each individual action is a small nudge that the next observation can correct. A coarse step rate that re-plans every 200ms is more forgiving than a single one-shot continuous prediction would be.

Drag any slider below to see how a continuous value snaps to its bin, and watch the assembled token string update on the right:

Figure 1 · the action becomes a string

terminateΔpos xΔpos yΔpos zΔrot xΔrot yΔrot zgripper

Each continuous dimension is sliced into 256 uniform bins. The bin index is shown to the right of each track; concatenate the eight indices with spaces and the policy's output is one short text response, the same shape the model would use to answer any other VQA prompt.

The training pair is now plain text in, plain text out. One image, one instruction, one eight-token answer. The same cross-entropy loss the model already uses for "name three things in this picture" trains it to output 1 128 91 241 5 101 127 217:

# input: image + a VQA-format instruction
prompt = (
    "Q: what action should the robot take to "
    "[task instruction]? A:"
)

# target: an 8-token string the model learns to emit
#   bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8
#   terminate Δpos_x Δpos_y Δpos_z Δrot_x Δrot_y Δrot_z grip
target = "1 128 91 241 5 101 127 217"

# training step: vanilla next-token cross-entropy
loss = next_token_ce(VLM(image, prompt), target)

Where the action ids live in the vocabulary

Calling the action a string is not enough by itself: there has to be a token id for every bin. RT-2 uses two different VLM (vision-language model) backbones, which tokenize text differently, so it needs two answers.

PaLI-X is the simpler of the two. Its tokenizer already gives every integer from 0 to 999 its own dedicated single token. So the bin ids 0 through 255 are already in the vocabulary as integer tokens. No new entries and no overwrites: point the training target at the existing integer tokens and the model can emit them.

PaLM-E does not have that property. Numbers like 128 get broken across several subword pieces, which would mean an action token costs the model multiple decode steps and makes constrained decoding awkward. RT-2 handles this with symbol tuning, a technique from Wei et al. (2023): pick the 256 least-frequent existing tokens in the vocabulary and overwrite their embeddings to mean the new action ids. Those tokens already carry almost no statistical weight (they sit at the bottom of the original tokenizer's frequency distribution), so reusing their slots costs little, and the co-fine-tuning teaches the model what their new meaning is.

Toggle between the two backbones to see which tokens become the action vocabulary. PaLI-X reuses a contiguous block of integer tokens near the head of the vocabulary; PaLM-E scatters its 256 reused slots across the tail, wherever the rarest tokens happened to sit:

Figure 2 · the action vocabulary

Teal cells are the 256 action ids. PaLI-X reuses a contiguous block (the integer tokens already in the vocab); PaLM-E overwrites the 256 rarest existing tokens, scattered across the tail. Either way the model emits and reads action ids the same way it handles language.

Symbol tuning works because vocabulary tokens are not equally important. The bottom 256 in a SentencePiece tokenizer (the subword vocabulary PaLM-E uses) account for a vanishing fraction of training tokens, so overwriting them barely perturbs the language distribution. The same reasoning underwrites symbol tuning as an in-context-learning method: forcing the model to associate arbitrary labels with arbitrary semantic categories works precisely when the labels are unloaded slots the model can rewrite without losing useful structure elsewhere.

Only legal tokens at output

Another component governs the output vocabulary. The same RT-2 model still answers visual questions in natural language. If you ask it "what is in the bowl" you want it to say "an apple," not 128 91. And when you ask for an action, you want exactly eight integers, no stray articles.

RT-2 enforces that by masking the output vocabulary at decode time. On a robot-action prompt the sampler is restricted to the 256 action ids only; every other token id has its logit driven to negative infinity, so its softmax probability is zero. The model is still free to rank the 256 action ids against each other (this ranking determines the action), but cannot accidentally emit "the." On a vision-language prompt the mask is lifted and the full vocabulary is in play again.

Toggle the mask below to see the redistribution. Without the mask, common word tokens dominate the softmax because the model is reading something that looks like the beginning of a sentence; with the mask, all that probability mass concentrates onto the action ids, and their relative ranking, which determines the action, becomes the decision:

Figure 3 · constrained decoding

only action ids can be emitted

Top-12 candidate next tokens for a robot-action prompt. With the action mask on, only the 256 action ids carry non-zero probability; the word tokens are zeroed and the action ids rebalance to total 1.0. The model's ranking inside the action band drives the arm.

Two consequences follow from this. The mask gives a hard guarantee, not a soft preference, so RT-2 cannot emit a syntactically invalid action even when the model is confused. And the ranking inside the band determines the action: a 0.3 vs 0.2 split between bin 128 and 129 says the model has a fine-grained motor preference, even when both numbers look small in absolute terms.

Co-fine-tune to keep the web

The naive next step would be to fine-tune the VLM on the robot dataset and call it done. The paper says that loses most of the value. Fine-tuning on robot data alone teaches the model to emit good action strings, but it forgets the visual concepts that made the VLM interesting in the first place. The strawberry-belongs-with-fruit reasoning starts to fade, and so does generalization to objects the robot has never seen.

Co-fine-tuning fixes that by mixing the original web data back in. Each training batch holds both robot trajectories and standard vision-language examples (VQA, captioning, the same mixture the underlying PaLI-X or PaLM-E was trained on). The mixing ratio is tilted toward the robot data so a single fine-tuning run can shift the model into a useful action regime: about 50% robot for PaLI-X, about 66% for PaLM-E. The original objective and the new one share weights and a loss surface, so the model holds onto its visual concepts while learning to act.

Table 5 isolates each of the three ingredients. Train a 5B PaLI-X from scratch on robot data only and it collapses to 9% on the unseen-task average. Fine-tune the pretrained 5B and the same evaluation jumps to 42%. Co-fine-tune and it edges up to 44%. Move to the 55B model and the gap widens: 52% with fine-tuning, 63% with co-fine-tuning. The paper skips the 55B from-scratch run because the 5B from-scratch number already settled that question.

Figure 4 · size and training ablation

55B · co-fine-tune

The unseen-task average rises with both size (5B to 55B) and training regime (from-scratch, then fine-tune, then co-fine-tune). Co-fine-tuning beats plain fine-tuning at both scales; the 55B with co-fine-tuning at 63% is the full RT-2-PaLI-X result. Click any bar to see the verbatim number.

The numbers decompose into three contributions. Pretraining gives the model its concepts, fine-tuning attaches actions to those concepts, and co-fine-tuning preserves the concepts while the attachment is learned. Each layer contributes to the final number, which is why removing any of them visibly hurts.

Running a 55B model on a robot

The 55B PaLI-X cannot fit on a robot. That class of model needs a multi-TPU pod to run, so a closed-loop policy that calls it once per step has to call it across the network. RT-2 hosts the model on a Google TPU cloud service and the robot sends each image to it. With that arrangement the 55B variant runs at 1 to 3 Hz of control, and the smaller 5B variant at about 5 Hz. The smallest, the 3B PaLI used for the Language-Table benchmark (a separate tabletop pushing evaluation), also runs at 5 Hz.

What 1-3 Hz means in practice. A pick-and-place on a 7-DoF arm typically takes ten to thirty seconds of motion, so the policy has somewhere between ten and a hundred decisions over an episode. That is enough for reactive correction on slow manipulation, and not enough for anything contact-rich or fast (the failure cases in Appendix G are exactly those: pushing a banana whose center of mass the model misjudges, or grabbing a handle that needs precise alignment). The paper flags inference cost as the next bottleneck for that reason.

# closed-loop control over a network call to a TPU pod
while episode_running():
    image = camera.read()
    prompt = render(prompt_template, instruction)

    # forced-choice decoding over the 256 action ids only
    ids = VLM.generate(image, prompt, vocab_mask=ACTION_IDS, n=8)

    a = [bin_to_value(d, ids[d]) for d in range(8)]
    if a[0] == TERMINATE: break
    robot.step(a)                       # at 1-3 Hz for 55B, 5 Hz for 5B

The model that runs the arm is the same checkpoint that answers a VQA query about the image; the robot is one client of a service that other robots, and other applications, can share. The control rate is determined by network round-trip plus 55B decode time, not by anything the robot itself can speed up.

New objects, backgrounds, places

With those pieces in place the first thing to measure is plain generalization: how well does the policy work on objects, backgrounds, and environments the robot training set never showed it? The paper evaluates on three axes, each split into easy and hard cases, against four baselines (R3M, VC-1, RT-1, and MOO, an object-conditioned RT-1 variant). The results land in Table 3.

On seen tasks every method is roughly competitive: RT-1 hits 92%, RT-2 hits 91 to 93%. The methods separate on the unseen columns. RT-2 averages 62% across them, against RT-1's 32% and the visual-representation baselines (R3M, VC-1) at 10 to 12%. The headline number, "2x over RT-1 and 6x over the others," is the unseen average comparison.

Pick an axis below to see where the gap is widest. On unseen objects hard (toys the robot never saw) and unseen environments (a visually distinct office desk, not the kitchen the dataset was collected in), the VLM-pretrained models pull furthest ahead, because those are the cases where general world knowledge from the web has the most to contribute:

Figure 5 · generalization

RT-2 PaLI-X 55B: 62%

Success rate on seven evaluation columns from Table 3. RT-2 (teal) holds at or above RT-1 on seen tasks, and pulls clearly above every baseline on the unseen splits, with the biggest gap on environments and hard objects.

Two notes on reading these. PaLI-X-55B and PaLM-E-12B average to the same 62% but split their wins: the PaLM-E variant is stronger on harder generalization (hard objects, hard backgrounds), the PaLI-X variant is stronger on the easier splits. The paper attributes that to PaLM-E's pre-training mix, which leans more on broad knowledge tasks, against PaLI-X's heavier visual focus. And the baselines that use a pretrained visual backbone but no language pretraining (R3M, VC-1) score in single digits on unseen environments, which is the cleanest evidence that visual representations alone do not produce the generalization.

Symbols, reasoning, faces

Past generalization, the paper's further claim is that some abilities of the underlying VLM transfer into the policy even though they were never in the robot dataset. The paper bundles these into three category averages from Table 4: symbol understanding (move the apple onto the number 3, push the can onto the heart), reasoning (math, logos, nutrition, multilingual instructions), and person recognition (move the can to the person with glasses, or to Taylor Swift).

None of these tasks were ever in the robot training data. The robot data is the RT-1 mixture: pick, knock, place upright, move near, open/close drawer, place into receptacle, take from receptacle. There is no robot demonstration of "the apple to a number," let alone "the can to Taylor Swift." Whatever knowledge the policy uses to succeed on these has to be coming from the web pretraining, not from the robot data, and the shared weights are what route it into the action output.

RT-2-PaLI-X averages 60% across these categories; RT-1 averages 17%. That is the 3x multiplier the paper cites. Pick a category to see the breakdown:

Figure 6 · emergent capabilities

Emergent-evaluation category averages from Table 4. RT-2 (teal) is 3 to 5 times the RT-1 score on every category except math reasoning, where the smaller PaLM-E variant edges out PaLI-X because its pretraining mixture leans more on math.

The word "emergent" carries baggage. The paper uses it in the transfer sense: capabilities that emerge in the policy by way of pretraining, not capabilities that would otherwise require new gradient steps. It does not mean the model suddenly acquires new motor skills. The picking-up is still the same picking-up RT-1 learned; RT-2 adds the ability to point that picking-up at a strawberry it never saw, in a kitchen it never saw, because someone asked it to.

Chain-of-thought, for a robot

The last variation in the paper plugs in chain-of-thought prompting. In its language-only form, the method inserts intermediate reasoning text between the question and the answer, so the model can think on the page. RT-2 adapts that to the robot context by inserting a "Plan: ..." sentence between the instruction and the action.

Augmenting the data is straightforward: take an existing training trajectory and prepend a natural-language plan describing what the robot is about to do, then the eight action tokens. A few hundred gradient steps on this format is enough to flip the model into the new behavior. At inference, the same model emits the plan first, then conditions on its own plan to emit the action.

Walk through a paper example below. With CoT off, the prompt "I am hungry" goes straight to action tokens, which is a leap of inference the model has to make end-to-end. With CoT on, the model first writes "Plan: pick rxbar chocolate." and the action tokens follow from that intermediate fact, which is the same kind of multi-step inference the underlying VLM is good at on text alone:

Figure 7 · chain-of-thought rollout

example 1 / 4

Verbatim examples from the paper's Figure 7. With the Plan toggle on, the model emits a short natural-language plan before the eight action tokens. The same checkpoint produces both. The plan acts as a bridge between visual reasoning and motor commands.

This result is narrow in scope. The paper calls it "initial evidence," not a measured improvement on the main benchmarks. The CoT model is only evaluated qualitatively, on a handful of rollouts, and the paper does not claim a quantitative win over the non-CoT variant. What it does show is that the same vision-language-action (VLA) model can do both, with roughly a few hundred fine-tuning steps in between, which suggests that the planner-and-controller split used by a separate line of work can be folded into one network instead.

What it does not do

The paper's own limits section is short. RT-2 leaves three things unsolved.

One, no new motor skills. The model's physical repertoire is the RT-1 skill set, and no more. If you want it to wipe with a towel, fold laundry, or insert a peg, you need new robot data.

Two, inference cost. The 55B model runs at 1-3 Hz over a cloud TPU pod. Anything that needs faster control (assembly, human handoff, fast contact) is out of reach until someone quantizes or distills the model down.

Three, the model pool is small. RT-2 needs a pretrained VLM whose weights or fine-tuning API is accessible. In 2023 that meant PaLI-X and PaLM-E, neither of which is open. The recipe ports cleanly to any open VLA-shaped model in principle, but the published numbers are tied to closed checkpoints.

None of the ingredients was invented in this paper. RT-1 already discretized eight continuous action values into integer bin ids. PaLI-X and PaLM-E already gave a vision-language model that could read an image and emit text. Symbol tuning already showed that the rarest 256 tokens could be repurposed without hurting language. Co-fine-tuning on a mixture is standard practice. Constrained decoding masks have been used in structured generation for years. What RT-2 contributes is the wiring: write the action as text, share one set of weights with the web data, and let the same network that can describe a strawberry also pick one up.

Provenance Verified against primary literature

RT-1 (2022)The 256-bin uniform discretization of 6-DoF end-effector deltas, gripper, and terminate that RT-2 reuses verbatim.

PaLI-X (2023)The 22B-ViT plus 32B-encoder-decoder VLM backbone (≈55B total), which RT-2-PaLI-X co-fine-tunes.

PaLM-E (2023)The 12B decoder-only VLM with a ViT-4B visual encoder, which RT-2-PaLM-E co-fine-tunes.

Symbol tuning (Wei 2023)Overwriting the embeddings of the 256 least-frequent vocabulary tokens to mean "action bin n" for PaLM-E.

CoT (Wei 2022)Chain-of-thought prompting that RT-2 adapts by inserting a "Plan: ..." sentence before the action tokens.

correctionThe paper writes "vision-language-action" as if RT-2 invented integrating actions with VLMs end-to-end, but it explicitly builds on RT-1's discretization (and on PaLM-E's multimodal tokens). The assembly is new (text-tokenize the action, co-fine-tune with web data); the architecture and the action math are not.

Questions you might still have

If the action is text, why does the policy not drift out of action space?
At decoding time the sampler is masked to the 256 action ids only when the prompt asks for an action. The model still ranks action tokens against each other; the constraint zeroes the rest.

Why does co-fine-tuning beat plain fine-tuning?
Fine-tuning only on robot data lets the model forget what its web pretraining taught it. Co-fine-tuning keeps the VQA, captioning, and language batches mixed in (50% robot for PaLI-X, 66% for PaLM-E), so the web concepts stay sharp and transfer to the action head.

Does the model learn new motions, or just deploy known ones in new contexts?
Only the latter. The paper is explicit: emergent capabilities are semantic (recognizing a strawberry, picking the smallest object). Physical skills are still bounded by the RT-1 dataset of seven manipulation skills. New motions need new robot data.

Why is symbol tuning safe for PaLM-E?
The 256 least-frequent tokens carry almost no signal in the original distribution, so overwriting their embeddings barely hurts language performance. The model is then forced to learn what each repurposed token means from the co-fine-tuning data.

Footnotes & further reading

The paper: Brohan et al., RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (Google DeepMind, 2023). Project page with rollout videos: robotics-transformer2.github.io.
Where the 256-bin discretization comes from: Brohan et al., RT-1: Robotics Transformer for Real-World Control at Scale. The same action format, mobile manipulator, and seven-skill dataset RT-2 reuses.
The PaLI-X backbone (22B ViT + 32B encoder-decoder UL2, ≈55B): Chen et al., PaLI-X: On Scaling Up a Multilingual Vision and Language Model.
The PaLM-E backbone (decoder-only LLM with a ViT-4B visual projector): Driess et al., PaLM-E: An Embodied Multimodal Language Model.
Symbol tuning (the "overwrite the 256 rarest tokens" trick): Wei et al., Symbol Tuning Improves In-Context Learning in Language Models.
Chain-of-thought prompting: Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, which RT-2 adapts by inserting a "Plan: ..." sentence before the action tokens.
Failure cases the paper flags (pushing dynamics, grasping a handle, dexterous folding) come from Appendix G of the paper itself. A useful sober counterweight to the emergent-capabilities results.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.