Robotics · Multimodal

OpenVLA: An Open-Source Vision-Language-Action Model

A language model that writes the robot's next move.

Tokenize the seven numbers that control a robot arm, hand them to a 7B model that already speaks images and English, and the same machinery that predicts the next word predicts the next action. The open weights and the recipe matter more than the headline gap.

Explaining the paperOpenVLA: An Open-Source Vision-Language-Action ModelKim, Pertsch, Karamcheti, Xiao, Balakrishna, Nair, Rafailov, Foster, Lam, Sanketi, Vuong, Kollar, Burchfiel, Tedrake, Sadigh, Levine, Liang, Finn · Stanford / UC Berkeley / TRI / Google DeepMind / Physical Intelligence / MIT · CoRL 2024 · arXiv:2406.09246 ↗

Robotics has lacked the open, fine-tunable backbone that language modeling got from Llama. This paper builds it.

For a year before OpenVLA, the best generalist robot policy was RT-2: a vision-language model fine-tuned on robot demonstrations, with the continuous control signal turned into discrete tokens so the language model could write actions the same way it writes words. RT-2 worked. It generalized to unseen objects. It could be given a new instruction and follow it. And it was closed. The weights were never released, the recipe was sketched but not reproducible end-to-end, and adapting it to a new robot meant going through an API that nobody outside Google could call. Robotics, which is bottlenecked on the cost of new data, was looking at its best model from the outside.

OpenVLA is the open version of that idea, with a few choices that change the numbers. The recipe it inherits from RT-2 is the same one in spirit: take a 7B vision-language model, train it on robot trajectories, treat each action as a short string of tokens. What OpenVLA adds is the open Llama 2 backbone, a curated 970,000-trajectory slice of the Open X-Embodiment dataset (versus the 350k subset RT-2-X used), and a visual encoder that fuses two pretrained backbones instead of one. The headline result is that with one-seventh the parameters of RT-2-X (7B versus 55B), it wins 16.5 percentage points of average success rate across 29 manipulation tasks. The contribution that probably matters more in the long run is the fine-tuning story: with LoRA at rank 32, a single A100 fine-tunes OpenVLA on a new robot setup in 10 to 15 hours, at 1.4% of the parameters, within the error bar of full fine-tuning's success rate.

To see how this works, a few pieces carry the story: what a vision-language-action model is and how the three parts of OpenVLA fit together, how continuous robot actions become discrete tokens a language model can predict, what the training loop actually does, the gap to RT-2-X and where it comes from, and the two efficiency results, LoRA and 4-bit quantization, that put a 7B robot policy on a single GPU.

A language model that writes actions

A robot controller, at the lowest useful level, is a function that takes one camera image and one instruction string and returns seven numbers. The seven numbers are the next move, expressed as small changes the end-effector should apply: three for translation along x, y, z, three for rotation in roll, pitch, yaw, and one for the gripper (open versus closed). Call that 7-vector $\mathbf{a}_t \in \mathbb{R}^7$ . A policy $\pi(\mathbf{a}_t \mid \mathbf{o}_t, \ell)$ maps the current observation $\mathbf{o}_t$ (the image) and the instruction $\ell$ (English text) to a distribution over those seven numbers. Run the policy in a loop, integrate the small moves, and the arm goes somewhere.

Classical imitation learning is bottlenecked by its supervision signal. You collect demonstrations: a person teleoperates the arm through a task and you record the camera and the actions, hundreds of times. Then you train a network to map the camera to the actions. The networks were small, the data was small, and the resulting policies fell apart the moment the scene changed: a new background, a new object, a new way of asking. They had never seen the Internet.

A vision-language model has. CLIP (the original image-text contrastive encoder), SigLIP, and DINOv2 read images by training on billions of image-text or image-only pairs scraped from the web; Llama 2 reads English the same way through text. Glue them together and you have a model that already knows what a coffee cup is, what "to the left of the bowl" means, and that a green apple is different from a red one. The vision-language-action recipe, due to RT-2, asks: what if the policy were a vision-language model fine-tuned to emit actions? The Internet-scale knowledge is already in the weights, and the fine-tuning only has to teach the model what an action is.

For that to work, the actions have to be something the language model can already write. A language model writes tokens, integers picked from a fixed vocabulary, one at a time. So OpenVLA turns each of the seven action numbers into one such integer, discretizing each dimension into 256 bins. Then the action becomes a 7-token string the LM can autoregress alongside the image and the instruction.

The three parts

OpenVLA is three modules stacked left to right (the paper's Figure 1, redrawn below). On the left, a visual encoder turns an image into a sequence of patch features. In the middle, a small projector maps those features into the language model's embedding space so the LM can read them as if they were text tokens. On the right, the language model takes the patches, the instruction, and the partial action sequence, and decodes the next action token.

Figure 1 · OpenVLA pipeline

A 224×224 image is patchified and run through both SigLIP (semantic features) and DINOv2 (spatial features) in parallel. Their per-patch outputs are concatenated channel-wise, projected by a 2-layer MLP into Llama 2 7B's embedding space, then mixed with the tokenized instruction. The model decodes 7 action tokens: Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper. Hover any block.

The visual encoder design materially changes the headline gap to RT-2-X. OpenVLA does not use a CLIP or a SigLIP backbone in isolation. It uses both SigLIP and DINOv2, in parallel, on the same image. Each model produces a per-patch feature vector; the two vectors are concatenated along the channel dimension (so each spatial location gets one wider feature). SigLIP is contrastively trained on image-text pairs and is good at "what is this"; DINOv2 is self-supervised, meaning it learns from images alone with no labels, and is good at "where is what." Image-text contrastive training teaches the encoder to compress out spatial detail that does not change the caption (a cup is a cup whether it is two centimeters left or right), which is exactly the detail a controller needs. Stapling DINOv2 alongside puts that detail back. The recipe is not OpenVLA's own; the fused encoder, the 2-layer MLP projector, and the channel-wise concat are the Prismatic VLM's design (Karamcheti et al., 2024), and OpenVLA chose Prismatic over two other open VLMs, IDEFICS-1 and LLaVA, in early experiments. Prismatic beat LLaVA by about 10 percentage points on Bridge tasks, in absolute success rate — one factor behind OpenVLA's lead over RT-2-X, which the paper credits to a combination of more and cleaner data, the fused encoder, and robot-only fine-tuning.

The projector is a 2-layer MLP that maps the fused per-patch feature into the same dimension Llama 2 uses for its token embeddings. Without it, the visual features live in their own space and the LM cannot read them. With it, the patches become "visual tokens" that the LM sees in exactly the same way it sees text tokens, and the same attention mixes both. This is the standard vision-language-model (VLM) construction LLaVA popularized.

The language model is Llama 2 7B. The choice is consequential mostly because it is open: the weights are public, the tokenizer is documented, the inference stack is the same one a thousand other projects use. The choice is not, on its own, what makes the policy good; the Bridge ablation showed that swapping in IDEFICS-1 or LLaVA gives a worse model on the same recipe. Llama 2 7B is the size where the pretrained world knowledge is wide enough to recognize Internet objects without bloating inference past the point where a single GPU can serve it at 6 Hz, six camera-to-action cycles per second, the rate a real-time controller runs the policy.

Continuous actions, 256 bins, one Llama token each

A language model writes integers from a fixed vocabulary. Robot actions are floats. To make the two compatible, OpenVLA discretizes each of the seven action dimensions independently into 256 uniform bins, then assigns each bin to a specific Llama token id. The recipe is:

For each dimension, compute the 1st and 99th percentile of that dimension's values across the entire training dataset. Call them $q_1$ and $q_{99}$ .
Divide the interval $[q_1, q_{99}]$ into 256 equal-width bins. Any value outside the interval saturates to the nearest end bin.
Pick the 256 least-used token ids in the Llama tokenizer (the last 256 of its 32,000) and overwrite their embeddings to mean "bin $k$ of dimension $d$ ." That way an action vector becomes 7 token ids the LM can produce by ordinary next-token prediction.

Two details matter here. The 1st-to-99th-percentile clip, rather than the min-to-max range RT-1 used, calibrates the bin width to typical motion instead of to the worst data point; one operator who flailed the arm to the limits of the workspace can no longer stretch every bin into uselessness. The least-used-token overwrite exists because the Llama 2 SentencePiece tokenizer only leaves 100 free slots for tokens added during fine-tuning, and 256 will not fit in 100. The tokens it overwrites essentially never appear in Internet text; the pretraining knowledge lost by repurposing them is negligible.

Drag a dimension across its range below and watch where its discretized value lands. At the edges of the slider, past the 99th percentile, the value saturates: the same token comes out no matter how much further you push it. Saturating outliers is the deliberate cost of the quantile-based scheme, and a strong argument against ever clipping at the min/max:

Figure 2 · how a continuous action becomes a single token

value0.017

Pick a control dimension and drag the value. The lit cell is its bin; the dot at the right end of the rail is the corresponding token id, drawn from the last 256 ids of the Llama 2 vocabulary (the rightmost teal sliver). Past the 99th-percentile dashed line the value saturates and feeds the same token regardless of magnitude. That is intentional: outliers should not be allowed to widen every bin.

With the encoding fixed, the rest of the architecture stays as a vision-language model. The image flows through the encoders and the projector and arrives at Llama as a sequence of "visual tokens," one per spatial patch. The instruction flows through the tokenizer and arrives as a few dozen text tokens. The previous action tokens, if any, arrive too. Llama predicts the next action token from all three. At training time, the loss is a vanilla per-token cross-entropy evaluated only on the seven action tokens; the model is not asked to reconstruct the image or the instruction, only to write the right action. In code:

# one training step (per-token cross-entropy on action tokens only)
img, instr, action = sample_batch()        # img 224x224, action in R^7

# fused visual encoder (Prismatic): both backbones, per-patch concat
feat_s = siglip(img)                       # [N_s, d_s]
feat_d = dinov2(img)                       # [N_d, d_d]
feat   = concat_channels(feat_s, feat_d)   # per spatial location
visemb = mlp_projector(feat)               # mapped into Llama embed space

# discretize each of the 7 action dims into one of 256 bins
clipped = clip_to_percentiles(action, p1, p99)
ids     = discretize(clipped, 256)         # 7 ints in [0, 255]
toks    = action_token_ids(ids)            # last-256 ids of Llama vocab

text_ids = tokenize(instr)
seq      = [visemb, embed(text_ids), embed(toks)]  # one long sequence

logits = llama2_7b(seq)
loss   = cross_entropy(logits[-7:], toks)  # only the 7 action tokens
loss.backward(); opt.step()                # vision encoder is NOT frozen

How it is trained

OpenVLA starts from a Prismatic-7B VLM that has already been jointly trained on the LLaVA 1.5 image-text mixture (a standard visual-instruction dataset that pairs images with question-answer text), so the visual encoder, projector, and Llama 2 backbone are already speaking the same language at initialization. From there, the full stack is fine-tuned on 970k robot demonstrations curated from Open X-Embodiment, with cross-entropy on the action tokens. Three choices in the recipe are not obvious until they are written down.

Should the vision encoder be frozen or fine-tuned along with the rest? The conventional answer for VLM training is to freeze it: the Prismatic paper, which gave OpenVLA its initialization, found freezing gave better question-answering numbers, because the pretrained encoder already captures what the text task cares about. The OpenVLA authors started there too, and it did not work: the frozen-vision ablation (Table 1) gets 47.0% rollout success versus 69.7% for full fine-tuning. The hypothesis is that image-text contrastive pretraining throws away exactly the spatial precision a controller needs, a few centimeters of offset between a gripper and a handle, and the only way to put it back is to let the encoder learn from the robot demonstrations. The Sandwich and full-fine-tune rows in Table 1 outperform Last-layer-only for the same reason: only the strategies that touch the encoder give it room to learn the controller's spatial geometry.

The model trains for 27 epochs through the 970k-trajectory mixture, not the one or two passes typical of LLM and VLM pretraining. That is a lot of repetition. The authors found that real-world success kept climbing as long as training action-token accuracy did, and that did not plateau until accuracy was past 95%. A standard one-pass schedule would have stopped well short of the policy you can actually deploy. On physical tasks, the bottleneck is not how many trajectories you train on; it is how precisely you fit the trajectories you already have.

The learning rate is held fixed at $2 \times 10^{-5}$ , the same one Prismatic used for VLM fine-tuning, with no warm-up. The compute budget is 21,500 A100-hours: 64 A100s, 14 days, batch size 2048. That is roughly the cost of a small open-source LLM pretraining run, and exactly the cost the open-source community absorbs once and then forks. The inference footprint is 15 GB of GPU memory at bfloat16, which fits on a single consumer card.

What it actually does

The evaluation runs across three suites. The first is the BridgeData V2 WidowX setup: 17 tasks, 170 rollouts per method, designed to test four axes of generalization that a person would distinguish: visual (new backgrounds), motion (unseen positions), physical (unseen object shapes), semantic (categories never demonstrated), plus a language-grounding axis (naming the right object among several). The second is a Google robot with 12 tasks and 60 rollouts, split into in-distribution and out-of-distribution. The third is Franka fine-tuning, where every method is fine-tuned on the same 10-to-150 demonstrations and the comparison is against the data-efficient state of the art, Diffusion Policy (a from-scratch imitation policy that generates actions with a diffusion model and has no language backbone).

On BridgeData V2, OpenVLA averages about 71% success, RT-2-X about 51% — a 20 percentage-point gap with a seventh of the parameters. (The widely-quoted 16.5-point headline is the average across all 29 tasks on both robots; the Bridge-only gap is wider because the Google-robot suite is close.) That gap is not uniform across axes. OpenVLA leads on visual generalization (new backgrounds, distractors), motion (unseen positions), physical (unseen shapes), and language (multiple objects, name the right one). It trails on semantic generalization: the case where the instruction refers to a category the robot never saw in its demonstrations and has to lean on Internet world knowledge. RT-2-X edges OpenVLA there because its PaLI-X backbone is co-fine-tuned with Internet image-text data alongside robot actions, which keeps more of the pretraining knowledge alive. OpenVLA fine-tunes only on robot data and pays a small semantic tax for it.

Toggle the suites below. Bridge shows the widest gap (about 20 points), Google is the near-tie that confirms the result is not a Bridge-specific artifact (the 16.5-point headline averages the two), and the Franka fine-tune asks the more interesting question: is OpenVLA also the right starting point when you have new data and want to adapt? Drag the axis slider to read off the gap per category. The semantic column on Bridge is the only one where the bigger model still wins.

Figure 3 · rollout success per generalization axis

axisaverage

Average success rate by category on three evaluation suites. OpenVLA (teal) versus the strongest baseline named in the paper for that suite. On Bridge the gap is about 20 percentage points on average and OpenVLA leads every axis except semantic. On Google the two are comparable (OpenVLA 85 vs 78; the OOD subset ties). On Franka fine-tune, Diffusion Policy wins narrow single-instruction tasks like "Pour Corn into Pot" while OpenVLA wins the diverse multi-instruction tasks and the average.

On the Franka fine-tune, when the task is narrow and dexterous, a single instruction on a single object that has to be done precisely, Diffusion Policy still wins. The likely reason is that Diffusion Policy uses action chunking (it predicts a chunk of future actions and executes the first few open-loop) and temporal smoothing, both of which give it more stable trajectories than OpenVLA's one-step autoregressive output. When the task is diverse, multiple objects, multiple instructions in the same setup, Diffusion Policy collapses because it has no way to ground language. OpenVLA wins those by 20 absolute points on average. The authors are explicit that this is not a clean "OpenVLA always wins" result. Diffusion Policy is the right tool for narrow dexterous tasks; OpenVLA is the right tool for diverse language-grounded ones; OpenVLA alone clears 50% on every task tested.

Where does the Bridge gap actually come from? The authors break it into three sources. The training set is larger and more carefully cleaned (970k trajectories versus RT-2-X's 350k, with all-zero Bridge actions filtered out instead of left in). The encoder is fused DINO+SigLIP, which adds spatial precision RT-2-X does not have. And OpenVLA is fine-tuned only on robot data, so it does not waste capacity preserving Internet-text behavior the robot will never need. The first two help everywhere; the third costs them semantic generalization but pays for itself elsewhere.

Fine-tuning, cheaply

Out-of-the-box performance is half the story; the other half is whether a lab with a different robot can pick this up. Full fine-tuning of OpenVLA on a new Franka task takes 8 A100s for 5 to 15 hours, depending on dataset size. The paper's second contribution is showing this can be cut by roughly 8× without measurable quality loss using LoRA at rank 32.

LoRA inserts a small rank- $r$ update $\Delta W = B A$ into each weight matrix, training only those rank- $r$ factors and leaving the original weights frozen. For OpenVLA at $r=32$ , that comes out to 97.6M trainable parameters, about 1.4% of the 7.2B-parameter total. The trade-off the paper measures (Table 1) is success rate on the Franka-Tabletop tasks versus GPU memory at batch size 16. Plotted, LoRA $r=32$ sits in the top-left corner at 68.2%, statistically indistinguishable from full fine-tuning's 69.7% (the 1.5-point gap is well inside each method's ±7-point standard error across 33 rollouts), at 59.7 GB instead of 163.3 GB, on a single GPU instead of two sharded with FSDP (fully-sharded data parallelism, which splits the model across GPUs when it won't fit on one).

W_{\text{fine-tuned}} = W_{\text{pretrained}} + B A, \quad B \in \mathbb{R}^{d \times r},\ A \in \mathbb{R}^{r \times d},\ r=32

(LoRA)

The other strategies in the table show why. "Last layer only" (fine-tune just the final transformer block and the embedding table, freeze the rest) is cheap at 51.4 GB but barely halves the performance gap to a from-scratch model. "Frozen vision" (fine-tune the LM but freeze the encoder) does worse than full fine-tuning because, as the authors flagged earlier, the visual encoder needs to learn the controller's spatial geometry. "Sandwich" (fine-tune the encoder, the embedding table, and the last layer) is the runner-up at 62.1% and 64.0 GB; it does better than the frozen-vision row precisely because the encoder is unfrozen. LoRA wins because it lets every weight matrix adapt a little, including the encoder's, without paying for the full backward pass through 7B parameters.

4-bit inference, no rollout hit

Once the model is fine-tuned, the next question is what you need to run it. At bfloat16 (the default), OpenVLA takes 16.8 GB of memory at batch 1 and runs at about 6 Hz on an RTX 4090. That fits on a consumer card. It does not fit comfortably on something smaller. The paper benchmarks two LLM-style quantization recipes, 8-bit and 4-bit, on the same 8 BridgeData V2 tasks.

4-bit quantization halves the memory to 7.0 GB, and rollout success comes out at 71.9% versus the bfloat16 71.3%. Within the error bars, those are identical. 8-bit, which you might expect to be safer, comes out at 58.1%, and the explanation is not about precision. On the A5000 used for the rollouts, 8-bit inference runs at 1.2 Hz (versus 3 Hz for 4-bit and 5 Hz for bfloat16) because the 8-bit quantization kernels add more overhead than the memory bandwidth they save. The BridgeData V2 controller expects actions at 5 Hz. At 1.2 Hz, the closed loop is so out of phase that the rollouts fail. The paper confirms this by measuring offline action-token accuracy, which is comparable for all three precisions: the tokens 8-bit emits are correct, the robot just sees them too late.

Figure 4 · success vs GPU memory

Toggle fine-tune to see Table 1: six strategies plotted as success against fine-tune memory. LoRA r=32 sits in the top-left corner at 68.2% (within the error bar of Full FT's 69.7%) at a quarter of the memory. Toggle inference for Table 2: int4 matches bfloat16 at less than half the memory, while int8 collapses because the GPU only manages 1.2 Hz, slower than the 5 Hz controller. Hover any point.

That lesson does not show up in ordinary LLM serving. For text generation, an extra hundred milliseconds per token is a UX issue. For closed-loop control at 5 Hz, an extra two hundred milliseconds is the difference between grasping the cup and missing it; the system dynamics the policy was trained against require a particular cadence. Quantization for VLAs is a latency question on top of the usual memory and accuracy questions. 4-bit wins because it is the only setting fast enough to keep the closed loop in phase.

What it cannot do yet

The paper names its own ceiling. OpenVLA takes a single image observation: no history, no proprioception, no second camera. Real robot setups are heterogeneous; the next iteration would have to support all of them, and doing it through a VLA fine-tune likely needs a VLM pretrained on interleaved image-text data so it can read multiple images natively.

Inference is the second ceiling. 6 Hz on a 4090 is fine for tabletop manipulation and useless for the bimanual, dexterous tasks that an ALOHA-class setup runs at 50 Hz. The proposed escape routes are action chunking (predict $k$ actions at once, execute open-loop, which spreads one forward pass over $k$ control cycles and effectively multiplies the policy's control frequency by $k$ ) and speculative decoding (let a smaller model draft tokens, have the big model verify), both of which would have to be ported from LLM serving and re-evaluated for whether the open-loop assumption holds when a robot is touching things.

And the policy does not yet hit very-high reliability on the tested tasks. Most successes land in the 70-to-90% band, with few above 90%, which is the band where a method is interesting to robotics researchers and not yet ready to ship behind a customer-facing demo. That is exactly where the open weights, the open codebase, and the LoRA recipe give the next research group a place to start.

OpenVLA reuses existing parts: the fused vision encoder from Prismatic, the action discretization from RT-1 (packaged into a VLM by RT-2), the multi-embodiment data pool from Open X-Embodiment, and LoRA from the LLM-tuning world. What was missing was the assembly into one open system: an open backbone running open weights, with a 970k-trajectory training mix that beats the closed 55B-parameter incumbent, and a fine-tuning recipe that puts adaptation onto one GPU for a day. OpenVLA is that assembly, and it shortens the next paper's starting line.

Provenance Verified against primary literature

RT-2 (Brohan et al., 2023)Coined the "vision-language-action" framing and the next-token recipe: discretize actions and feed them to a VLM. OpenVLA inherits the recipe and replaces the closed PaLI-X with an open Llama 2 stack.

RT-1 (Brohan et al., 2022)Earlier robot transformer that established the 256-bin per-dimension action discretization. Both RT-2 and OpenVLA carry it forward.

Open X-Embodiment (Padalkar et al., 2023)The pooled, multi-embodiment robot dataset. >2M trajectories across 70+ datasets (22 embodiments). OpenVLA curates 970k of these for training.

Prismatic VLM (Karamcheti et al., 2024)The fused SigLIP + DINOv2 visual encoder (channel-wise concat) plus 2-layer MLP projector plus Llama 2 7B that OpenVLA fine-tunes for action prediction.

Llama 2 (Touvron et al., 2023)The 7B language-model backbone. Its tokenizer reserves only 100 "special" slots, too few for 256 action tokens, which forces the least-used-token trick.

LoRA (Hu et al., 2021)Low-rank adaptation. OpenVLA shows r=32 matches full fine-tuning on Franka tasks at 1.4% of the trained parameters.

Octo (Octo Model Team, 2024)The strongest prior open generalist policy. OpenVLA is positioned against it on Bridge, Google, and Franka fine-tuning suites.

Diffusion Policy (Chi et al., 2023)The from-scratch imitation baseline. Wins narrow single-instruction tasks; loses the diverse multi-instruction ones.

correctionA few things the paper does not invent but inherits, and a few terms that read crisper than they are. The VLA framing and the 256-bin / least-used-token trick are from RT-2 and RT-1, not OpenVLA. The fused DINO+SigLIP encoder, the 2-layer MLP projector, and the channel-wise concat are all the Prismatic recipe; OpenVLA picks Prismatic out of three VLM candidates rather than building it. OpenVLA's own contributions are the open weights, the 970k-trajectory Open-X curation, the fine-tuning recipes (especially LoRA r=32 matching full fine-tuning), and the result that 4-bit quantization preserves rollout success while 8-bit silently does not. The Llama tokenizer has 32,000 entries; the "100 special tokens" number is the count of free slots in the SentencePiece reserve. Numbers like the 16.5% headline gap are averages across 29 BridgeData V2 and Google robot tasks; per-axis ranking varies (semantic generalization, where RT-2-X leads, is the one exception).

Questions you might still have

Why discretize continuous actions instead of regressing them?
Because the language model is already a categorical next-token predictor and is great at it. A regression head would need a separate loss and a separate training story; mapping each of the 7 dimensions to one of 256 bins keeps everything inside the cross-entropy objective the LM was pretrained on. The cost is the quantization itself: 256 bins per dimension at the 1st-to-99th-percentile scale gives a step of about 0.04 cm for Δx on BridgeData V2, fine for tabletop grasping and coarse for surgical tasks.

Why overwrite the 256 least-used Llama tokens instead of adding new ones?
The Llama 2 SentencePiece tokenizer only reserves 100 "special" slots for tokens added during fine-tuning, which is fewer than the 256 the action discretization needs. Rather than retrain the tokenizer (which would force a partial vocabulary embedding rebuild and ripple through the model), the recipe overwrites the 256 ids that almost never appear in language data. The pretraining cost of losing those ids is essentially zero; the engineering cost is essentially zero. The downside is that text containing one of those rare ids would now be interpreted as a robot action, which never happens in normal text.

Did this paper invent the "vision-language-action" name?
No. RT-2 (Brohan et al., 2023) coined VLA and the recipe: discretize actions, treat them as language, fine-tune a VLM. OpenVLA's contributions are upstream of the recipe (the fused DINO+SigLIP encoder is Prismatic's; the 970k-trajectory curation is its own) and downstream of it (the open weights, the LoRA recipe that matches full fine-tuning, and the 4-bit quantization that does not hurt rollouts). The recipe itself is RT-2's.

If OpenVLA already runs Llama 2 inside, why is it only 7B parameters when RT-2-X is 55B?
RT-2-X is built on PaLI-X 55B (closed). OpenVLA is built on Llama 2 7B (open). The headline result, OpenVLA beating RT-2-X by 16.5% absolute across 29 tasks on both robots (and by 20 points on BridgeData V2 alone) with a seventh of the parameters, comes from three places at once: a larger curated training set (970k vs 350k trajectories), a fused vision encoder that gives spatial reasoning that pure CLIP-style encoders lack, and tighter data cleaning (filtering out all-zero actions in Bridge, for example). The 7B is not where the quality comes from; the data and the encoder are.

Why does 4-bit quantization actually beat 8-bit at rollout success?
The 8-bit kernels are slower than 4-bit on the GPU they evaluated on (A5000): 1.2 Hz vs 3 Hz at inference. The 5 Hz BridgeData V2 controller expects actions roughly that fast, so 1.2 Hz desynchronizes the loop and the recorded rollouts degrade. The paper shows that offline action-token accuracy is comparable for both quantizations; the 58.1% vs 71.9% gap is purely a system-dynamics artifact of being too slow. For closed-loop control, then, inference latency is not just a budget; it changes the task the model is trying to do.

Why is the vision encoder fine-tuned for robotics when freezing it is standard for VLMs?
Prismatic's VLM analysis showed that freezing the encoder helps for image-and-text question answering, and OpenVLA started there too. It did not work. The frozen-encoder ablation in Table 1 gets 47.0% rollout success vs 69.7% for full fine-tuning. The authors hypothesize the pretrained features capture object identity and rough localization, which language tasks reward, but not the fine-grained spatial relations (a few centimeters of offset between a gripper and a handle) that a controller needs. Unfreezing the encoder lets the model push those last details into its features.

Footnotes & further reading

The paper: Kim, Pertsch, Karamcheti et al., OpenVLA: An Open-Source Vision-Language-Action Model (CoRL 2024). Code, weights, and notebooks.
The VLA framing and the 256-bin / least-used-token action tokenization: Brohan et al., RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (2023), building on the same group's RT-1 (2022).
The pooled multi-embodiment robot dataset OpenVLA curates from: Padalkar et al., Open X-Embodiment: Robotic Learning Datasets and RT-X Models (2023).
The fused DINO+SigLIP visual encoder and the 2-layer MLP projector recipe: Karamcheti et al., Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models (2024).
The language-model backbone: Touvron et al., Llama 2 (2023).
The vision encoders: Zhai et al., SigLIP (2023), and Oquab et al., DINOv2 (2023).
Low-rank adaptation: Hu et al., LoRA (2021), and our explainer at /lora/.
The baselines: Octo Model Team, Octo (2024), and Chi et al., Diffusion Policy (2023).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.