Vision · World models

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

A robot that learned to plan by watching a million hours of internet video and 62 hours of itself.

Predict masked patches of a video in a learned representation space. Add actions on top. The same model that scored state of the art on action anticipation drives a Franka arm it has never seen, picking up a cup it is shown only as a goal image.

Explaining the paperV-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and PlanningAssran, Bardes, Fan, Garrido, et al. (Meta FAIR, Mila) · arXiv:2506.09985 · June 2025 · arXiv:2506.09985 ↗

A 1-billion-parameter video encoder, trained without a single label, that you can bolt 300 million parameters of action prediction onto and plan with.

Most of the recent wins in machine learning come from one idea: predict the missing piece. Hide a word and you get a language model. Hide a patch of an image and you get an image encoder. The obvious next step was video, which is mostly missing pieces by definition: a billion pixels of next-frame, all of which you could try to predict. Pixels, though, are mostly noise.

A leaf flutters in the wind. The exact position of every leaf in the next frame is unpredictable, and that unpredictability is irrelevant; what you wanted to know was that there was wind. A pixel-level loss spends most of its capacity on the wrong question. V-JEPA 2, out of Meta's FAIR lab, is a clean-up of an old answer to this: predict the missing piece in a learned representation space, where unpredictable noise has been dropped, instead of in pixels. Then scale it up. The result is a video encoder that, once frozen, can be aligned with a language model for video question answering, or have an action-prediction head bolted on top and be used to plan zero-shot manipulation on a robot arm it has never seen.

The model is called V-JEPA 2, the second video instance of LeCun's joint-embedding predictive architecture (JEPA). The robotics version is V-JEPA 2-AC, where AC stands for action-conditioned. The rest of this page works through why predict-in-features beats predict-in-pixels, what the mask-denoising loss actually is, how four scaling choices add up to a four-point gain, how you attach actions to a frozen feature space, and what planning means when your dynamics live in a latent rather than a pixel buffer.

Why not just predict pixels

Imagine the next frame of a video as a draw from some conditional distribution over images, given everything you have seen so far. Most of that distribution's entropy lives in details you do not care about: the exact configuration of leaves, the precise placement of every blade of grass, the speckle pattern in a shadow. A model trained to minimize pixel error has to spend its capacity learning to render all of that, because predicting the wrong fine grain costs it just as much as missing the actual content of the scene. Latent diffusion has the same problem and gets around it by running in a perceptual VAE's compressed code, which throws away high-frequency noise on purpose.

JEPA pushes the same idea further. Train an encoder $E_\theta$ that maps a frame to a representation. Now ask a predictor to predict the encoded version of the missing piece, not the pixels. A blade of grass in slightly the wrong place is the same vector as a blade in the right place if the encoder has learned to ignore the difference, so the predictor only ever has to fit predictable structure. The encoder and the predictor are trained together, and the loss is in representation space.

This does not buy a way to generate pixels, which is what pixel-based world models do. JEPA has no decoder, so it cannot draw the next frame. It can only tell you what the next frame would look like to its own encoder. That is fine for the downstream tasks the paper cares about: classification, anticipation, and planning by latent distance all consume the encoded version directly, so the fluttering leaves never have to be re-rendered.

Training encoder and predictor together has a well-known failure mode called representation collapse: the encoder can minimize the loss trivially by mapping every input to the same vector, which leaves the predictor nothing to predict. JEPA dodges this with two details that come back in the loss equation. The prediction target comes from a frozen copy of the encoder rather than the live one, and a stop-gradient on that target keeps the loss from pulling the encoder toward the easy constant.

The mask-denoising objective in representation space

One training step looks like this: a video clip is patchified into a sequence of tubelets, three-dimensional patches of size $2\times 16\times 16$ covering two adjacent frames at a time. A subset of those tubelets is dropped, leaving a masked view $x$ . The unmasked view $y$ is the full clip. The encoder reads $x$ ; the predictor takes the encoder's output and fills in embeddings at the masked positions, guided by a learnable mask token $\Delta_y$ that marks where to predict. The supervision comes from running the full clip $y$ through a separate copy of the encoder, an exponential moving average of the live weights, written $E_{\bar\theta}$ . The objective minimizes the L1 distance between the predictor's outputs and the EMA encoder's targets, at the masked positions only:

\min_{\theta,\,\phi,\,\Delta_y}\ \big\lVert P_\phi(\Delta_y,\,E_\theta(x))\;-\;\mathrm{sg}\big(E_{\bar\theta}(y)\big)\big\rVert_1

(1)

The terms break down as follows. $E_\theta(x)$ is the encoder run on the masked view, so it returns a sequence of token embeddings with the masked positions empty. $\Delta_y$ is a learned vector that gets inserted at each masked position, telling the predictor "a missing patch was here, predict its embedding." $P_\phi$ is the predictor, a smaller transformer that reads the encoded visible tokens plus the mask tokens and outputs an embedding for each masked position. On the other side, $E_{\bar\theta}(y)$ is the EMA encoder applied to the full, unmasked clip; its outputs at the masked positions are the targets.

The $\mathrm{sg}(\cdot)$ is the stop-gradient. Gradients of the L1 loss flow through the predictor and the student encoder, but not through the EMA (exponential moving average) teacher: the EMA copy is updated separately, by $\bar\theta \leftarrow (1-\tau)\bar\theta + \tau\theta$ at every step. Combined with the stop-gradient, this prevents the trivial collapse. The student weights cannot lower the loss by altering the targets, because the targets' gradient path is cut, and the teacher only ever drifts toward the student slowly through the EMA average. ^[1]

In the figure below the mask ratio is the slider. The amber patches are what the encoder sees, the teal patches are what the predictor has to fill in (in feature space), and the EMA teacher on the bottom is fed the full clip with its outputs at the masked positions used as the targets:

Figure 1 · masked feature prediction

mask ratio60%

The student encoder E_θ sees only the visible (amber) patches of the masked view x. The predictor P_φ fills in embeddings at the dropped positions, guided by the mask token Δ_y. The EMA teacher E_θ̄ runs on the full clip y; an L1 loss compares predictor outputs to teacher outputs only at the masked positions. Stop-gradient keeps the loss from collapsing.

The L1 norm rather than L2 is a robustness call: representation targets that are slightly off should not blow the loss up the way an L2 would. And the masking is multiblock, meaning the dropped tokens come as contiguous spatiotemporal blobs rather than as a salt-and-pepper sprinkle, which forces the predictor to actually use temporal context to fill in long stretches of motion rather than copy from an immediate neighbor. Both choices are inherited from V-JEPA v1; this paper's contribution is everything done once they are in place.

What makes it work at scale

V-JEPA v1 already trained well at a smaller scale. V-JEPA 2 takes the same objective and scales it across four axes simultaneously: more data, more parameters, longer training, and higher resolution at the very end. Each axis is worth something between half a point and two points of average accuracy across six probe tasks (Something-Something v2, Diving-48, Jester, Kinetics, COIN, ImageNet); together they cumulate to about four points over a ViT-L/16 baseline at 84.2.

The numbers below are what the paper reports for each ingredient added on top of the baseline, all from §2.2. Toggle them on and off to see the stack climb:

Figure 2 · four ingredients, four points

data 2M → 22Mmodel 300M → 1Btrain 90K → 252K itersres. 256/16 → 384/64

The ViT-L/16 baseline averages 84.2 across six understanding tasks. Adding data (2M → 22M clips), capacity (300M → 1B params via ViT-g), training length (90K → 252K iterations), and a progressive higher-resolution cooldown each contribute roughly one point, summing to 88.2. The cooldown lifts spatial resolution from 256 to 384 and temporal length from 16 to 64 frames during the final 12K iterations only.

The least glamorous of the four ingredients is the schedule. V-JEPA 2 uses warmup-constant-decay rather than cosine, with a long constant phase and a brief cosine-shaped cooldown at the end. That shape matters operationally because from a single constant-phase checkpoint you can launch a dozen different cooldowns at different resolutions and clip lengths without retraining the constant phase. Spending the high-resolution budget only in the final cooldown is what the paper calls progressive resolution training, and it cuts the ViT-g run from a naively projected 60 GPU-years to roughly 7 at 384 pixels and 64 frames (an 8.4× reduction). The anticipation, classification, and planning results all use that same encoder. ^[2]

Scaling at this level is a Meta-only result in practice: ViT-g at 22 million clips on their cluster is not reproducible in a university lab. The paper's contribution to the rest of us is the schedule and the progressive-resolution recipe; both transfer to smaller budgets and both are simple.

Adding actions: V-JEPA 2-AC

So far there are no actions anywhere. V-JEPA 2 is trained on internet video, plain sequences of frames; there is no robot in it and no notion of what someone did between frames. To control a robot you have to predict not just what comes next in a video, but what comes next given that I take this action. The post-training stage adds that on top.

The data is 62 hours of unlabeled robot video from the Droid dataset, a public collection of teleoperated 7-DoF Franka Panda arm trajectories. "Unlabeled" means the paper does not use what task each clip was trying to do, whether the demonstration succeeded, or any reward. What it uses is the raw video plus the seven-dimensional end-effector state at each frame: three numbers of Cartesian position, three of orientation, one for the gripper. The action between consecutive frames is the change in end-effector state, $a_k = s_{k+1} - s_k$ , which means the model learns velocity-style control even though nobody ever wrote that down as a label.

The V-JEPA 2 encoder is frozen; only a new predictor is trained on top. Call this predictor $P_\phi$ again, by analogy with the pretraining one, even though it is a different network: 300M parameters, 24 layers, 16 heads, 1024 hidden dimension, block-causal attention so each timestep can attend to past actions, past states, past patches. The state at time $k$ is the encoded frame $z_k = E(x_k)$ , a feature map of shape $16\times 16\times 1408$ . The predictor consumes the interleaved sequence of actions, proprioceptive states, and feature maps, and outputs a prediction for the next feature map.

Now training has two losses. The first is teacher forcing: at every step $k$ , feed the predictor the true history and ask it to predict $z_{k+1}$ , with the truth supplied by the frozen encoder. Average over $T$ steps:

\mathcal{L}_{\text{tf}}(\phi) = \frac{1}{T}\sum_{k=1}^{T}\big\lVert\hat z_{k+1} - z_{k+1}\big\rVert_1 = \frac{1}{T}\sum_{k=1}^{T}\big\lVert P_\phi\big((a_t, s_t, E(x_t))_{t\le k}\big) - E(x_{k+1})\big\rVert_1

(2)

The second is a rollout loss. Run the predictor autoregressively for $T$ steps starting from $(s_1, z_1)$ , feeding each output back in as the next input, and compare the final feature map to the true one:

\mathcal{L}_{\text{ro}}(\phi) = \big\lVert P_\phi(a_{1:T};\, s_1, z_1) - z_{T+1}\big\rVert_1

(3)

The training objective is the sum:

L(\phi) = \mathcal{L}_{\text{tf}}(\phi) + \mathcal{L}_{\text{ro}}(\phi)

(4)

Why two losses, when (2) already supervises every step? Because of how each one will be used at deployment. The teacher forcing in (2) trains the predictor to do one step accurately given a real previous frame; an MPC (model-predictive control) controller needs that one-step accuracy at the start of every plan. The rollout in (3) trains the predictor to recover from its own mistakes, because at planning time most rollout steps are conditioned on the predictor's own previous outputs, not on real frames. Without (3), one-step accuracy compounds badly. The paper uses $T=2$ for the rollout, so backpropagation only has to chain through one recurrent step; even with that mild rollout, the planning rollouts at inference time can run much longer because the predictor has been exposed to at least one of its own errors during training. In the figure below the rollout length T is interactive, and the drift visibly accumulates with T:

Figure 3 · teacher forcing vs rollout

rollout TT = 3

Top: in teacher forcing each step gets the true encoded state and predicts the next, eq (2). Bottom: in rollout only the first state is real and every subsequent state is the predictor's own previous output, eq (3). The drift in the rollout grows visibly with T because errors compound. Training on the sum of both losses (4) makes long autoregressive plans usable later.

The predictor uses 3D-RoPE for the video patch positions but only the temporal component of RoPE for the action and proprio tokens. Actions and end-effector states have no spatial location of their own, so giving them a fake one would just confuse the relative-position read. ^[3] The block-causal attention pattern is the same as in language models: at time $k$ , the patch features attend to all earlier patches and to the action and proprio tokens at the same and earlier times.

Planning by energy minimization

With a trained dynamics model the planning loop is short. The agent is given a goal image $x_g$ , encoded once into $z_g = E(x_g)$ . At control step $k$ , the current frame is encoded into $z_k$ and the current proprio state $s_k$ is read off the robot. A candidate plan is a sequence of $T$ actions $\hat a_{1:T}$ . We score it by how close the final rolled-out feature map ends up to the goal, in L1:

\mathcal{E}(\hat a_{1:T};\, z_k, s_k, z_g) = \big\lVert P_\phi(\hat a_{1:T};\, s_k, z_k) - z_g\big\rVert_1

(5)

The best plan is the one that minimizes that energy:

a^\star_{1:T} = \arg\min_{\hat a_{1:T}}\ \mathcal{E}(\hat a_{1:T};\, z_k, s_k, z_g)

This picture has no reward function, no task label, and no inverse model that asks "what action would get me to the goal" in closed form. Because the world model predicts the consequence of action sequences and the L1 energy scores the result, the controller reduces to an optimizer over those scores. The L1-in-feature-space "energy" replaces a hand-engineered task reward, and the energy minimum implicitly defines what success means: an action sequence that makes the world look the way the goal image does, to the encoder.

We never see the analytic minimum, because $P_\phi$ is a transformer and the energy in (5) is not differentiable in any way you would want to follow. So the paper optimizes (5) with the cross-entropy method (CEM), a gradient-free population search. CEM keeps a Gaussian distribution over candidate action sequences, samples a batch from it, scores them with (5), refits the Gaussian to the top-K best, and repeats. Eight or ten iterations of that, with a population of 800 sequences each iteration, is enough to land near the minimum. CEM is old (Rubinstein 1997), simple, parallelizable, and the only thing that needed adapting for V-JEPA 2-AC was that each sampled action is clipped to an L1 ball of radius 0.075, about 13 cm of end-effector displacement, because larger jumps go out of the training distribution.

The paper visualizes (5) on a single-action slice for a reaching task: sweep $\Delta x$ and $\Delta y$ , hold $\Delta z = 0$ , and the energy bowl is smooth and locally convex with a minimum near the ground-truth direction $(0, -0.05)$ . CEM exploits that smoothness. Below, advancing the CEM iteration slider shows the sample population (white dots) collapsing onto the elite top-K (teal) as the current Gaussian (teal ellipse) shrinks onto the bowl:

Figure 4 · the energy bowl and CEM

CEM step0

The energy from eq (5), evaluated at one-step actions for a Δy reach: a smooth convex bowl with minimum near the ground-truth action (0, −0.05). White dots are sampled candidate actions; the bright teal dots are the elite top-K used to refit the Gaussian; the teal ellipse is the current 2σ contour. As CEM iterates, the cloud collapses onto the minimum, and we execute its centre as a one-step action before re-planning.

The other planning knob is the horizon $T$ . With $T=1$ , you optimize one action, execute it, observe, repeat: this is visual servoing in latent space. With $T=4$ the planner rolls the dynamics model out four steps and commits only to the first action; if its fourth step is wrong, the next replan corrects. That is the receding-horizon control loop, classical MPC, with the dynamics swapped for a video model. The figure below shows it from above: a tabletop, a goal mark, the executed path so far in gray, the four-step plan in teal, and a ring on the one action the controller will actually execute before re-encoding and re-planning:

Figure 5 · the receding-horizon loop

control stepk = 4

Top-down view of the closed-loop MPC. At each step the world model rolls out a 4-step plan toward the goal, only the first action (ringed) is executed, the new frame is re-encoded, and the plan is recomputed. The distance to the goal closes monotonically; the executed path (gray) is short because each replan corrects the drift.

CEM with 800 samples and 10 refinement iterations takes about 16 seconds per action on a single RTX 4090, which is what makes a full pick-and-place trajectory feasible (40-or-so actions, about ten minutes). The Cosmos baseline, a latent video diffusion model that has to denoise an entire video for every scored plan, takes four minutes per action. That 15× planning-cost gap is, before accuracy even enters the picture, why JEPA-style latent dynamics beat pixel-generation dynamics for control.

Two arms, neither seen in training

Zero-shot means something specific here. V-JEPA 2-AC is trained on Droid, which is a teleoperation dataset collected at other labs, with other arms, in other rooms, under other lighting. The paper deploys the same checkpoint on two Franka arms in Meta's own labs, neither of which appears in Droid, with no additional fine-tuning. The goal is given as a single image; the robot sees only a monocular RGB feed; the controller plans with CEM and (5). Three skills are evaluated: grasp a cup or box, reach with an object held in the gripper, and pick-and-place using three sub-goal images stitched in sequence (gripper near object, object near goal, object at goal). ^[4]

The success rates below average across two labs and ten trials each. The Octo baseline is a vision-language-action transformer trained on the much larger Open-X Embodiment dataset (1M trajectories) and fine-tuned on Droid with behavior cloning. V-JEPA 2-AC is trained from 23k Droid clips, no behavior cloning, no language. On the harder skills the gap widens sharply:

Figure 6 · zero-shot manipulation, Octo vs V-JEPA 2-AC

object

Success rates on three Franka manipulation skills (with the trivial reach included as a sanity row). The bars are averaged across Lab 1 and Lab 2, ten trials each; the cup-vs-box buttons split them out. Octo and V-JEPA 2-AC both solve reach. On grasp, reach-with-object, and pick-and-place, V-JEPA 2-AC wins by 30 to 65 absolute points despite training on roughly 40× less data.

Numbers from the figure (averaging cup and box trials together): V-JEPA 2-AC reaches 45% on grasp (vs 7.5% for Octo), 75% on reach-with-object (vs 42.5%), and 72.5% on pick-and-place (vs 12.5%). All zero-shot, in environments the model has never seen. Pick-and-place is the task that matters for practice, because it requires composition: a sequence of sub-goals must each be hit in turn, and a failure on any of them sinks the trajectory.

Aligning the frozen encoder with a language model

The same V-JEPA 2 encoder, frozen, also drives understanding benchmarks. The recipe is the standard one for video LLMs: project encoder outputs into the embedding space of a language model with a small MLP, then train the projector plus the language model on video-instruction pairs while leaving the encoder fixed. With an 8B-class language model behind it, the paper hits a state-of-the-art 84.0 on PerceptionTest, 76.9 on TempCompass, 44.5 on MVP, 36.7 on TemporalBench, and 40.3 on TOMATO. These all live in the band of benchmarks that need temporal reasoning, not just frame-by-frame appearance. ^[5]

V-JEPA 2 was pretrained without any language supervision at all. Conventional wisdom fromCLIP-trained encoders is that an encoder that has never seen text cannot be aligned to a language model competitively. V-JEPA 2 says otherwise. The encoder gets language alignment in a separate, cheap, late stage; the language model meets the video encoder for the first time during instruction tuning, and the resulting video-LLM still leads the 8B class on tasks that depend on motion understanding. Language is not required for learning to see, even when the downstream task is to talk about what was seen. That implication runs alongside the planning result and is the second claim of the paper.

On the probe benchmarks the encoder itself is evaluated on, a single frozen V-JEPA 2 ViT-g hits 77.3 on Something-Something v2 (motion understanding) and 39.7 recall-at-5 on Epic-Kitchens-100 (action anticipation), the latter a 44% relative improvement over the previous best task-specific anticipation model. Anticipation matters here, because from a clip ending at time $t$ the model has to predict what action will begin at $t + \tau$ for $\tau$ a second into the future. A pretrained representation that has learned to fill in missing pieces of video, by construction, encodes regularities about what tends to come next; the anticipation number indicates that the predictor module learned dynamics rather than texture alone.

Where it still breaks

The paper's own limitations section is clear-eyed, so this one is too.

Camera position is load-bearing. V-JEPA 2-AC takes Cartesian end-effector deltas as actions, but it has no camera calibration. To predict the visual consequence of an action, it has to implicitly infer which way the world's x-axis points in the image. When the robot base is out of frame, that inference is under-determined and the model mis-predicts which way a positive $\Delta x$ will move things on screen. The paper's appendix shows the success rate degrading sharply with camera angle perturbations. In practice the authors hand-tuned the camera position until everything worked, which is an asterisk on "zero-shot."

Long horizons compound. The autoregressive predictor was trained with a two-step rollout. At planning time it runs many more steps, and its representation-space drift grows with the rollout length. Combined with the exponential growth of the search space (a horizon of $T$ actions in a 7D continuous space is a vast hypercube for CEM to sample), this caps the practical planning horizon well below what you would want for, say, a multi-stage assembly task without sub-goals. The pick-and-place result relies on three sub-goal images precisely because the single final goal does not yield enough signal to plan all the way through. Pick-and-place from a single end goal, without sub-goals, remains unsolved in this paper.

Goals must be images. The energy in (5) is L1 distance to an encoded goal frame. That requires you to have a goal frame, which is fine in a lab and awkward in the world. Asking a robot to "pour me a coffee" in natural language is not what this model can do; that is left to future work that would marry a language model's goal-encoding to V-JEPA 2-AC's dynamics.

No way to render predictions. V-JEPA 2 has no decoder, so it cannot draw an imagined future frame back into pixels for a human to inspect. For planning by latent distance this is the right architectural call (everything you save by not predicting pixels you can spend on dynamics), but it does mean any sanity-check on the world model has to happen in feature space, which is hard to read. The energy landscape figure above is partly a workaround, and partly evidence that there is no neater way to do it.

The contribution is the pipeline itself. Internet-scale video, plus a feature-space prediction objective, plus a small amount of robot interaction data, plus model-predictive control, plus a single goal image, gives a robot that picks up cups it has never seen in a lab it has never been in. The individual pieces are not new; the assembly is new, and the pick-and-place gap to the next baseline is large enough to count as the paper's contribution.

Provenance Verified against primary literature

JEPA (LeCun, 2022)The joint-embedding-predictive architecture position paper that frames everything here.

I-JEPA (Assran et al., 2023)The masked-feature-prediction objective from images; V-JEPA is its video cousin.

V-JEPA v1 (Bardes et al., 2024)Where the multiblock masking and the EMA-teacher recipe come from.

DINOv2 (Oquab et al., 2023)The cluster-based retrieval recipe used to curate the YT-1B subset.

CEM (Rubinstein, 1997)The gradient-free planning loop V-JEPA 2-AC drives with.

Droid (Khazatsky et al., 2024)62 hours of Franka teleop video, used unlabeled as the post-training data.

caveatThe lead author publishes as both “Mido Assran” (informal) and “Mahmoud Assran” (the byline on this paper); we use “Assran et al.” The 77.3 SSv2 number is the ViT-g attentive-probe score, not the linear-probe score, so benchmarks elsewhere quoting V-JEPA 2 at different SSv2 numbers are using different probes.

Questions you might still have

Why does predicting in features beat predicting in pixels for a world model?
Pixel error spends most of its capacity on unpredictable fine grain (the exact configuration of leaves, speckle in shadows) that the model cannot improve at and that does not matter downstream. The encoder is trained to ignore that grain, so a loss in representation space stops penalizing the model for failing to render noise.

How does the EMA teacher prevent representation collapse?
The targets for the loss come from a slowly-updated EMA copy of the encoder, and a stop-gradient cuts the loss's path through those targets. So the student cannot make its task easier by pushing the targets toward a constant; the only way to lower the loss is to make the predictor's outputs match real per-position features the EMA encoder produces from the unmasked clip.

What does the rollout loss buy that teacher forcing does not?
At deployment time the planner runs the predictor autoregressively on its own outputs, not on real frames. Teacher forcing trains the one-step prediction; the rollout loss trains the predictor to recover from its own previous mistakes. Without it, multi-step trajectories drift quickly.

Why use a 7D action space when most robot policies use joint torques?
The 7D action is the change in end-effector state (3 Cartesian + 3 orientation + 1 gripper). It matches the only labels Droid has reliably (end-effector telemetry per frame) and lets V-JEPA 2-AC learn velocity-style control implicitly, with no inverse kinematics in the loop.

Why does Cosmos take 240 seconds per action while V-JEPA 2-AC takes 16?
Cosmos is a latent video diffusion model: scoring one candidate action means denoising an entire video. V-JEPA 2-AC scores a candidate action by one forward pass through a 300M-parameter predictor that outputs a feature map and an L1 distance. Same CEM loop, two very different per-sample costs.

Where does V-JEPA 2 sit in the JEPA family?
Third in line. I-JEPA (Assran et al., 2023) introduced the masked-feature-prediction objective on still images and pinned down the multi-block masking and EMA-teacher recipe. V-JEPA v1 (Bardes et al., 2024) carried the same objective to video by predicting masked spatiotemporal tubes through frozen features. V-JEPA 2 keeps the same loss and recipe, scales pretraining by 10x on data and 3x on parameters, and adds a small action-conditioned predictor on top of the frozen feature space so the encoder doubles as a planner. The video objective, the EMA collapse-prevention, and the multiblock masking are all inherited unchanged; what is new is the four-axis scaling recipe and the action-conditioned post-training. The I-JEPA and V-JEPA v1 explainer pages link from the V-JEPA v1 mentions earlier on this page and from footnote 1.

Footnotes & further reading

Stop-gradient plus EMA-teacher is the same anti-collapse trick that BYOL, DINO, and I-JEPA use; LeCun's broader argument for why this beats contrastive losses is in his "A Path Towards Autonomous Machine Intelligence" position paper.
The warmup-constant-decay schedule comes from Hägele et al., Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations. The progressive-resolution idea goes back to Touvron et al., Fixing the train-test resolution discrepancy, and was used at scale by DINOv2.
The 3D-RoPE generalization of rotary positional embeddings splits the feature dimension into three roughly equal segments and rotates each by the temporal, height, and width offset respectively. The 1D RoPE original is Su et al., RoFormer; the explainer for that paper is here.
Octo: Octo Model Team et al., Octo: An Open-Source Generalist Robot Policy. Cosmos: Agarwal et al., Cosmos World Foundation Model Platform for Physical AI. The pick-and-place protocol uses three sub-goal images stitched in sequence, with planning horizon = 1 for each sub-goal, switching at fixed time-step counts (4, 10, 4).
Video question-answering benchmarks: PerceptionTest (Pătrăucean et al., 2023), TempCompass (Liu et al., 2024c), MVP (Cai et al., 2024), TemporalBench (Krojer et al., 2024), TOMATO (Shangguan et al., 2024). All sample several short clips from a video and ask multiple-choice questions whose answers require either motion or temporal-order reasoning.
The paper itself: Assran, Bardes, Fan, Garrido, et al., V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning. Code at facebookresearch/vjepa2. Blog at ai.meta.com/blog/v-jepa-2-world-model-benchmarks.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.