Revisiting Feature Prediction for Learning Visual Representations from Video
Learn video by predicting the features of masked spatio-temporal tubes.
No pixels, no labels, no text. A student encoder fills in the missing region in its own latent space, while a slow-moving copy of itself supplies the target.
Explaining the paperRevisiting Feature Prediction for Learning Visual Representations from VideoTrain a video encoder by hiding a block of patches, asking the model what those patches' features will look like, and supervising the answer with the model's own slowly-updated copy of itself.
Most ways to learn from video lean on something you do not really have. Supervised learning needs labels and video labels are expensive. Image-text models need captions and most video has none. Pixel-reconstruction models need to spend capacity copying every speck of texture in the frame, which is most of the loss budget on detail you would happily throw away. V-JEPA, from Meta's FAIR group, asks how far you can get with just the videos: no labels, no captions, no negative examples, no pixel-level reconstruction. A masked spatio-temporal block is hidden from the encoder, and the encoder is asked to predict the features of that hidden block from the visible rest. The supervision is the encoder's own answer, run through a slow-moving copy of itself.
That setup is the Joint-Embedding Predictive Architecture, or JEPA, a family of self-supervised recipes Yann LeCun proposed and that the same Meta group instantiated for still images in I-JEPA. V-JEPA carries the recipe across to the time axis, and the scaled-up V-JEPA 2 later extends it to robotic action conditioning. The headline numbers from this paper: a ViT-H/16 trained only on 2 million unlabeled videos reaches 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K, all with a frozen backbone. The same encoder handles appearance-heavy tasks (Kinetics, where what an action looks like is usually enough) and motion-heavy tasks (Something-Something-v2, where action labels are deliberately disambiguated by what is moving, not what it looks like).
The argument has a small number of moving parts: the JEPA layout, the choice of features as the target instead of pixels, the collapse-prevention recipe, the spatio-temporal masking that makes the task hard enough to teach a usable encoder, and the evaluation choice that lets a frozen encoder shine on tasks it was never shown.
A learner with no labels and no pixels
Start with what V-JEPA refuses to use. Three big self-supervised toolkits have shaped vision in the last six years, and the paper rules out all three.
Contrastive learning teaches an encoder to map two views of the same scene close together and two views of different scenes far apart. The far-apart ones are negative examples. To work, the negatives have to be plentiful and well-chosen, which on video means either a huge memory bank of clips or carefully scheduled batches. V-JEPA uses no negatives. Nothing in the loss says "keep these clips' representations apart."
Cross-modal alignment, the recipe behind CLIP, pairs each clip with a caption and trains an encoder so the clip-embedding matches the caption-embedding. It works wonderfully and inherits the encoder's knowledge from the language side. V-JEPA uses no text. The model is shown video and only video.
Masked pixel reconstruction, the recipe behind MAE and its video successors, hides patches of the input and trains a decoder to fill them in. The supervision is cheap and dense: every masked pixel is a regression target. V-JEPA hides patches the same way, but throws away the pixel target. The target is the features of those patches, not their RGB values.
What that buys, and what it costs, is the whole subject of the paper. A model trained without labels can be applied to any downstream task without changing weights, the same encoder doing action classification one minute and image classification the next. A model trained without text inherits no linguistic prior, so it must build its visual representations from scratch. And a model trained without pixel targets cannot be evaluated by "does it reconstruct?", so the only way to read out its quality is to attach a small probe on top and see how well that probe does at downstream tasks.
JEPA: regress representations, not images
JEPA is a layout with three pieces, sketched in Figure 2 of the paper. An encoder turns either input into a feature vector. A predictor maps the feature vector of one input to the feature vector of a related input , conditioned on a description of how and are related (in V-JEPA that description is just the spatio-temporal positions of the masked tokens). The loss is regression between predicted feature and actual feature:
The notation is dense, so unpack it. is the visible region of the video, is the hidden region the predictor must guess. is a copy of the encoder whose weights are an exponential moving average of the live encoder's weights, updated by with . The is a stop-gradient, which means the optimizer is allowed to differentiate through everything except the target side of the loss. And is the conditioning , the positional embeddings of the masked tokens, so the predictor knows what positions in the grid it is being asked to fill in.
One sentence summary: the encoder takes the visible context, the predictor maps it into the encoder's embedding of the missing region, and the target is what a slow-moving copy of the same encoder thinks of the full video at those missing positions. The loss is L1. (The paper notes L1 trained more stably than L2 in their setup. The theoretical analysis still works under L1, except the "optimal predictor" is now the conditional median of the target distribution given the context, not the mean.)
Notice what is not in eq (1). There is no reconstruction loss in pixel space. There is no contrastive term comparing this clip to other clips. There is no caption. The supervision is entirely the encoder disagreeing with a delayed copy of itself, and the encoder has to learn to predict what its delayed copy will say.
That is suspicious. If the encoder can output a constant, the predictor outputs the same constant, the difference is zero, and the loss is zero forever without learning anything. The recipe rests on stopping that shortcut.
Why predict features instead of pixels
Before the collapse worry, settle the choice of target. V-JEPA's headline finding is that predicting features beats predicting pixels, by a consistent margin in frozen evaluation, on every downstream task they test. Why would that be? A pixel target asks the model to copy everything in the masked patch, including pieces of the patch that cannot be inferred from the visible context: high-frequency texture, sensor noise, the precise position of a speck of dust. The squared-error loss has to spend capacity fitting those unpredictable bits, and capacity spent on noise is capacity not spent on structure.
A feature target asks the model to copy the encoder's representation of the masked patch. The encoder, having been trained on the same loss, has every incentive to drop pixel-level detail that does not help with the prediction task; the EMA teacher then bakes that simplification into the target, so the model regresses toward a representation that has already filtered out the unpredictable noise. Predict the structure, ignore the static.
The paper does this comparison cleanly. Same encoder (ViT-L/16), same VideoMix2M data, same masking, same number of iterations, same evaluation protocol: a frozen backbone with an attentive probe. The only thing that changes is the target. Predicting features gets +5.1 points on K400 (73.7 vs 68.6), holds even on SSv2 (66.2 vs 66.0), and gains +1.5 on IN1K (74.8 vs 73.3). The advantage holds under end-to-end fine-tuning too, though the gap narrows because fine-tuning lets the pixel model recover.
Be careful about what is being claimed. The argument is not that pixel reconstruction is broken; it is that pixel reconstruction is wasteful when your downstream task is recognition rather than synthesis. If you wanted to generate video, predicting pixels is the right loss. If you want a video encoder that downstream tasks can probe, predicting features keeps the capacity focused on the parts of the input that downstream tasks actually care about.
Why this does not collapse
Now back to the constant solution. The encoder outputs for every input; the predictor outputs ; the loss is zero. Two ingredients in eq (1) block that shortcut: a stop-gradient on the target, and a slow exponential moving average that controls how fast the teacher tracks the student.
The first is the stop-gradient. The optimizer never sees the gradient flowing back through , only through the prediction side . So the optimizer cannot pull the target toward the prediction; it can only pull the prediction toward the target. If the target moves, that is because the EMA tracking happens to move it.
The second is the EMA decay rate . With the teacher is glassy slow: it averages thousands of past student updates, so when the student performs a gradient step, the teacher follows imperceptibly. The argument from Grill et al. (2020), which the paper adapts for V-JEPA's L1 loss, runs as follows. Suppose the predictor is optimal: it returns the exact conditional median of the target distribution given the context. Substitute that optimum back into the loss and the encoder's gradient becomes the gradient of the conditional median absolute deviation, . That MAD is small only when the context already determines the target. So the encoder is pushed to make its outputs predictable from context, which is exactly the opposite of collapsing to a constant.
That "suppose the predictor is optimal" clause is doing real work. The hypothesis defended in the paper, which seems to hold empirically, is that the predictor can keep up with a near-stationary teacher faster than the encoder can drift toward a degenerate solution. The EMA acts as a leash. Loosen it (drop too far below 1) and the predictor cannot stay ahead, the encoder is free to head for the constant solution, and collapse happens. Hold it tight () and the leash takes hold.
The appendix adds one important detail about V-JEPA's targets: they are contextualized. The EMA teacher processes the full clip with no masking, and the prediction targets are the teacher's outputs at the masked positions. The teacher therefore sees neighboring patches when computing the embedding of the patch the student must predict, while the student sees only the unmasked complement. The target is a feature of y in context, not y in isolation, and giving the predictor a target that already encodes how the patch relates to its surroundings is harder to reach by collapsing than a context-free target would be.
The masking that makes the task hard enough
Hiding patches at random is the easy way and the wrong way. Video is so redundant in both space and time that a random scatter of holes leaves visible neighbors a couple of patches away from every hole; the predictor can interpolate without doing much work. The encoder learns little. The paper's ablation in Table 4 puts a number on it: random tubes at 90% drop only reach 46.4 on SSv2 versus 67.4 for the multi-block default. Same encoder, same data, same iterations.
V-JEPA's default is multi-block masking. Sample several spatial rectangles with aspect ratios uniform in (0.75, 1.5), and for each rectangle, drop every token at that spatial location across every frame. The hidden region is a spatio-temporal tube: a hole in the spatial plane that runs all the way through time. Two flavors stacked together: short-range tubes (eight rectangles, ~15% of each frame each) and long-range tubes (two rectangles, ~70% each), and the union of both lands at an average masking ratio of about 90%.
The tube structure forces the encoder out of a temporal shortcut. With per-frame random holes, the encoder can look at neighbor frames to fill a hole. With spatial tubes, the same patch is hidden across every frame, so the neighbor-frame trick is gone; the predictor must extrapolate from the visible surrounding region, the same argument the original MAE paper made about spatial redundancy. The temporal sweep at the top of the figure shows this directly: a masked spatial block stays masked frame after frame.
With the mask in hand, the architecture falls in line. A ViT splits the 16-frame clip into spatio-temporal patches of 16×16 pixels spanning 2 consecutive frames; a 16-frame, 224×224 clip becomes 8×14×14 = 1568 tokens. The -encoder, a ViT-L/16 or ViT-H/16, processes only the visible tokens (about 10% of the total at the default mask). The predictor is a narrow ViT-style transformer: 12 layers, embedding dimension 384. It takes the encoder's outputs concatenated with learnable mask tokens that carry the positional embedding of each hidden patch, and returns one prediction vector per hidden patch.
On the supervision side, the -encoder is the EMA copy of the -encoder. It processes the full unmasked clip and the loss takes its outputs at the hidden positions. So the regression target lives in the encoder's own (slowly drifting) representation space, and the predictor regresses toward those contextualized features under L1:
# one V-JEPA training step (PyTorch-ish pseudocode)
clip = sample_clip() # 16 frames, 224x224, 3 channels
blocks = sample_multiblock_masks() # union of spatial blocks, repeated in t
x_tokens = tokenize(clip) # 16x14x14 spatio-temporal patches
y_idx = tokens_in(blocks) # the masked spatio-temporal positions
x_idx = complement(y_idx) # the visible context
# x-encoder: process visible tokens only (masked tokens are dropped)
x_emb = x_encoder(x_tokens[x_idx])
# predictor: concatenate x_emb with learnable mask tokens carrying pos(y_idx)
y_hat = predictor(x_emb, mask_tokens_at(y_idx))
# y-encoder is the EMA copy of x-encoder; stop-grad on its output
with no_grad():
y_full = y_encoder_ema(x_tokens) # contextualized targets
y_tgt = y_full[y_idx]
loss = (y_hat - y_tgt).abs().mean() # L1 regression
loss.backward()
opt.step()
y_encoder_ema.update_from(x_encoder) # tau-EMA trackingTwo ablations confirm the design knobs are doing real work. First, the masking matters. Comparing four strategies under identical training, the multi-block default beats every alternative on every downstream task, and the random-tube strategy at 90% does the worst by a wide margin. Causal masking (the encoder gets the first 6 or 12 frames; everything after is masked) is in between and never wins, which is the paper's gentle argument against the autoregressive recipe imported from language.
multi-block · V-JEPA default. Large continuous spatio-temporal blocks anywhere in the clip, ~90% masked on average. The encoder must extrapolate across both time and space.
Second, the readout matters. The V-JEPA objective is unnormalized, so there is no a priori reason the encoder should produce a linearly-separable representation. A linear probe under-reads the encoder's quality. The paper instead uses an attentive probe: one cross-attention layer with a single learnable query token, then a small MLP, then a linear classifier. The probe pools the frozen feature map non-linearly. Across the board the attentive probe adds 16 to 17 points over average pooling, and it lifts every baseline they re-evaluate too, so the comparison stays fair. The attentive probe is not a bigger model sneaking in; it is the right tool for reading an unnormalized representation.
Numbers: motion, appearance, label scarcity
With the recipe in place, three regimes are worth singling out. Each one says something different about what V-JEPA learned.
Motion-heavy tasks. Something-Something-v2 is the benchmark designed to defeat appearance-based shortcuts: its labels are things like "pushing something from left to right" or "pretending to take something out of something," deliberately decoupled from what is in the frame. V-JEPA H/16 reaches 72.2% on SSv2 with a frozen backbone, more than 21 points ahead of the best image self-supervised models (DINOv2, OpenCLIP, I-JEPA), and 5 to 6 points ahead of every pixel-prediction video model they tested at matched capacity. Static image pretraining, no matter how big, cannot reach the same ground; you have to train on video.
Appearance-heavy tasks. Kinetics-400 is the opposite. Its labels are usually inferable from a single representative frame (an action like "playing trumpet" reveals itself in a still image of a trumpet), so big image models do well by accident. V-JEPA H/16 reaches 81.9% on K400, ahead of every video self-supervised baseline; DINOv2 ViT-g/14 still pulls ahead at 83.4% with the same attentive probe, but only by spending image data the V-JEPA model never saw. On ImageNet-1K itself, V-JEPA H/16 reaches 77.9% with a two-layer attentive probe (77.4 with a one-layer probe), narrowing the gap to image models trained on internet-scale image collections.
Label-scarce tasks. Drop the attentive probe's training labels from 100% to 5% and the difference becomes stark. V-JEPA H/16 loses 12 points on K400 (80.2 → 68.2) and 13.9 on SSv2 (67.9 → 54.0). VideoMAEv2, a ViT-g/14 pixel-prediction baseline, loses 30 points on K400 and 26 on SSv2 over the same drop. The paper's framing: the probe is being asked to read the same features, and the V-JEPA features carry more useful structure per unit of labeled data, so the probe can learn more from less.
One number for cost. V-JEPA H/16 reaches the K400 result above after roughly 90,000 iterations at batch size 2400; the comparable VideoMAEv2 baseline trains for 1,500,000 iterations on the same video budget. The paper measures the wallclock and reports about a 2× speedup over the largest pixel-prediction baselines at matched quality. Feature prediction is also cheaper than pixel prediction, not just better.
What the model has clearly not learned, and the paper says so: anything the videos do not show. V-JEPA does relatively worse on ImageNet than on K400, and the authors flag that the publicly-available video corpus they trained on is much less visually diverse than the image internet. Their explicit ask is more diverse public video data; V-JEPA 2 takes a step in that direction by scaling the corpus by an order of magnitude.
The JEPA family
V-JEPA sits in the middle of a three-paper arc, all from the same Meta group, all reusing the same core recipe.
I-JEPA (Assran et al., 2023) was the image-only version. Same architecture: an-encoder on the visible patches, a narrow predictor that fills in masked positions, an EMA teacher that supplies the target. The masking was 2D blocks in the image plane. The result was that feature prediction matched or beat the best pixel-MAE results on image classification, with significantly less compute. V-JEPA adds the time axis: spatio-temporal patches, spatio-temporal blocks, and a 16-frame clip in place of a single image.
V-JEPA 2 (Assran et al., 2025) scales the V-JEPA recipe up. The encoder grows to a ViT-g/16, the video corpus grows by roughly an order of magnitude, and a second predictor head is added that is conditioned on robot actions; the result is an encoder that not only does well on the same recognition benchmarks but also supports model-based planning in a robot. The base self-supervised objective is unchanged; V-JEPA 2 inherits this paper's loss, masking, and EMA teacher untouched.
What ties the three together is a claim about what self-supervised vision should optimize. MoCo and SimCLR optimize invariance to hand-crafted augmentations. MAE and VideoMAE optimize pixel reconstruction. CLIP optimizes text alignment. JEPA picks prediction in feature space, argues that the EMA teacher solves the collapse problem the BYOL line started solving in 2020, and argues that the recipe scales. V-JEPA is the strongest published evidence so far that the feature-prediction route works on video.
The arc inverts the autoregressive language-modeling story. There, scaling came first (GPT-3) and clever objectives came later; here the JEPA architecture is doing the heavy lifting up front, and scale arrives in V-JEPA 2. Whether the same families of representation will keep separating, features for understanding and pixels for synthesis, or whether a future model will unify them, is the open question the lineage was set up to ask.
Questions you might still have
How is the EMA teacher different from the student if they share weights?
They share the architecture, not the values. The student is updated by gradient descent. The teacher Ē is updated by Ē ← τ·Ē + (1−τ)·E with τ ≈ 0.998, so it lags many steps behind. That lag matters: a near-stationary target gives the predictor something to converge against, while gradient flow through the teacher is blocked by the stop-gradient.
What does an "attentive probe" actually do, and why does the paper use one instead of a linear probe?
It is one cross-attention layer with a single learnable query token, followed by a small MLP and a linear classifier. It pools the frozen backbone's feature map non-linearly. The paper uses it because the V-JEPA objective is unnormalized, so there is no a priori reason the backbone's features are linearly separable. Adaptive pooling adds +17 points on K400 and +16 on SSv2 over plain average pooling, and it lifts every baseline too.
How does V-JEPA compare to its sibling on still images, I-JEPA?
Same idea, different axis. I-JEPA at /i-jepa/ predicts the features of masked image regions from other image regions. V-JEPA repeats the recipe with spatio-temporal tubes so the same masked region spans every frame, which forces the encoder to model motion as well as appearance. On Something-Something-v2, where motion is the entire signal, V-JEPA beats every image-only baseline (including I-JEPA) by over 21 points.
What changed between V-JEPA and V-JEPA 2?
V-JEPA 2 at /v-jepa-2/ scales the recipe up (a ViT-g/16, an order of magnitude more video) and adds an action-conditioned predictor for robotic planning. The base self-supervised objective in this paper carries through unchanged; V-JEPA 2 inherits it.
Why an L1 loss and not L2?
The paper found L1 more stable in early training. The theoretical analysis sketched in section 3.1 still works: under L1 the optimal predictor is the conditional median rather than the conditional mean, and the gradient of the encoder reduces to the median absolute deviation. The collapse argument is unchanged.
Footnotes & further reading
- The paper: Bardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, Ballas, Revisiting Feature Prediction for Learning Visual Representations from Video (Meta FAIR / Inria, 2024). Code. Blog post.
- The image-only predecessor: Assran et al., Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA). Explained at /i-jepa/.
- The scaled-up successor: Assran et al., V-JEPA 2. Explained at /v-jepa-2/.
- The collapse-prevention argument: Grill et al., Bootstrap Your Own Latent (BYOL) introduced the EMA-teacher recipe; V-JEPA reuses the analysis with an L1 loss.
- The contextualized-target idea (the teacher sees the unmasked complement, the student does not): Baevski et al., data2vec.
- The pixel-prediction baselines V-JEPA compares against: MAE, VideoMAE, OmniMAE, Hiera, MVD, VideoMAEv2.
- The probe used throughout the evaluation: a one-layer cross-attention pool with a learnable query token, following CAE.
- Yann LeCun's position paper that names the JEPA family: A Path Towards Autonomous Machine Intelligence (2022).
How could this explainer be improved? Found an error, or something unclear? I read every message.