Vision · Self-supervised learning

Revisiting Feature Prediction for Learning Visual Representations from Video

Learn video by predicting the features of masked spatio-temporal tubes.

No pixels, no labels, no text. A student encoder fills in the missing region in its own latent space, while a slow-moving copy of itself supplies the target.

Explaining the paperRevisiting Feature Prediction for Learning Visual Representations from VideoBardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, Ballas · Meta FAIR · arXiv 2024 · arXiv:2404.08471 ↗

Train a video encoder by hiding a block of patches and asking the model what those patches' features will look like. The resulting frozen encoder handles both motion and appearance, from video alone.

Most ways to learn from video lean on something you do not really have. Supervised learning needs labels and video labels are expensive. Image-text models need captions and most video has none. Pixel-reconstruction models need to spend capacity copying every speck of texture in the frame, and most of the loss budget goes to detail you would happily throw away. V-JEPA, from Meta's FAIR group, asks how far you can get with just the videos: no labels, no captions, no negative examples, no pixel-level reconstruction. A masked spatio-temporal block is hidden from the encoder, and the encoder is asked to predict the features of that hidden block from the visible rest. The supervision is the encoder's own answer, run through a delayed copy of itself.

That setup is the Joint-Embedding Predictive Architecture, or JEPA, a family of self-supervised recipes Yann LeCun proposed and that the same Meta group instantiated for still images in I-JEPA. V-JEPA carries the recipe across to the time axis, and the scaled-up V-JEPA 2 later extends it to robotic action conditioning. The main numbers from this paper: a ViT-H/16 (a Vision Transformer at the Huge size, 16-pixel patches) trained only on 2 million unlabeled videos reaches 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K, all with a frozen backbone. The encoder's weights are left untouched and only a small probe on top is trained, so the numbers measure the pretrained features themselves. The same encoder handles appearance-heavy tasks (Kinetics, where what an action looks like is usually enough) and motion-heavy tasks (Something-Something-v2, where action labels are deliberately disambiguated by what is moving, not what it looks like).

The method has a few components: the JEPA layout, the choice of features as the target instead of pixels, the collapse-prevention recipe, the spatio-temporal masking that makes the task hard enough to teach a usable encoder, and the evaluation choice that measures a frozen encoder on tasks it was never shown. Each section below covers one.

A learner with no labels and no pixels

V-JEPA is defined as much by what it refuses to use. Three big self-supervised toolkits have shaped vision in the last six years, and the paper rules out all three.

Contrastive learning teaches an encoder to map two views of the same scene close together and two views of different scenes far apart. The far-apart ones are negative examples. To work, the negatives have to be plentiful and well-chosen, which on video means either a huge memory bank of clips or carefully scheduled batches. V-JEPA uses no negatives. Nothing in the loss says "keep these clips' representations apart."

Cross-modal alignment, the recipe behind CLIP, pairs each clip with a caption and trains an encoder so the clip-embedding matches the caption-embedding. It works wonderfully and inherits the encoder's knowledge from the language side. V-JEPA uses no text. The model is shown video and only video.

Masked pixel reconstruction, the recipe behind MAE and its video successors, hides patches of the input and trains a decoder to fill them in. The supervision is cheap and dense: every masked pixel is a regression target. V-JEPA hides patches the same way, but throws away the pixel target. The target is the features of those patches, not their RGB values.

What that buys, and what it costs, is the central question of the paper. A model trained without labels can be applied to any downstream task without changing weights, the same encoder doing action classification one minute and image classification the next. A model trained without text inherits no linguistic prior, so it must build its visual representations from scratch. And a model trained without pixel targets cannot be evaluated by "does it reconstruct?", so the only way to read out its quality is to attach a small probe on top and see how well that probe does at downstream tasks.

JEPA: regress representations, not images

JEPA is a layout with three pieces, sketched in Figure 2 of the paper. An encoder $E_\theta$ turns either input into a feature vector. A predictor $P_\phi$ maps the feature vector of one input $x$ to the feature vector of a related input $y$ , conditioned on a description $z$ of how $x$ and $y$ are related (in V-JEPA that description is just the spatio-temporal positions of the masked tokens). The loss is regression between predicted feature and actual feature:

\text{minimize}_{\theta,\phi} \quad \big\lVert P_\phi\!\left(E_\theta(x),\Delta_y\right) \;-\; \text{sg}\!\left(\bar{E}_\theta(y)\right) \big\rVert_1

(1)

The notation is dense. $x$ is the visible region of the video, $y$ is the hidden region the predictor must guess. $\bar{E}_\theta$ is a copy of the encoder whose weights are an exponential moving average of the live encoder's weights, updated by $\bar{E} \leftarrow \tau\bar{E} + (1-\tau)E$ with $\tau \approx 0.998$ . The $\text{sg}(\cdot)$ is a stop-gradient, which means the optimizer is allowed to differentiate through everything except the target side of the loss. And $\Delta_y$ is the conditioning $z$ , the positional embeddings of the masked tokens, so the predictor receives which positions in the grid it must fill in.

To summarize: the encoder takes the visible context, the predictor maps it into the encoder's embedding of the missing region, and the target is the slow-moving copy of the same encoder's output for the full video at those missing positions. The loss is L1. (The paper notes L1 trained more stably than L2 in their setup. The theoretical analysis still works under L1, except the "optimal predictor" is now the conditional median of the target distribution given the context, not the mean.)

Something is missing from eq (1). There is no reconstruction loss in pixel space. There is no contrastive term comparing this clip to other clips. There is no caption. The supervision is entirely the gap between the encoder's prediction and a delayed copy of itself, and the encoder is trained to match the output of that delayed copy.

If the encoder can output a constant, the predictor outputs the same constant, the difference is zero, and the loss is zero forever without learning anything. The recipe rests on stopping that shortcut.

Why predict features instead of pixels

V-JEPA's headline finding is that predicting features beats predicting pixels, by a consistent margin in frozen evaluation, on every downstream task they test. Why would that be? Under a pixel target, the model must copy everything in the masked patch, including pieces of the patch that cannot be inferred from the visible context: high-frequency texture, sensor noise, the precise position of a speck of dust. The squared-error loss has to spend capacity fitting those unpredictable bits.

Under a feature target, the model copies the encoder's representation of the masked patch. The encoder, trained on the same loss, is pushed to drop pixel-level detail that does not help with the prediction task; the EMA teacher then bakes that simplification into the target, so the model regresses toward a representation that has already filtered out the unpredictable noise.

Figure 1 · feature target vs. pixel target

targetfeatures

The patch the model is asked to match. On the left, the feature target is the EMA teacher's embedding of the patch, a smoothed, semantic representation. On the right, the pixel target still carries all of its unpredictable detail. The stacked bar on the right is the loss budget: at the feature end every unit of capacity buys signal; at the pixel end roughly half the capacity goes to fitting noise. Numbers under the figure come from Table 1: at matched training, a ViT-L/16 reaches K400 73.7 with the feature target versus 68.6 with pixels.

The paper does this comparison cleanly. Same encoder (ViT-L/16), same VideoMix2M data, same masking, same number of iterations, same evaluation protocol: a frozen backbone with an attentive probe. The only thing that changes is the target. Predicting features gets +5.1 points on K400 (73.7 vs 68.6), holds even on SSv2 (66.2 vs 66.0), and gains +1.5 on IN1K (74.8 vs 73.3). The advantage holds under end-to-end fine-tuning too, though the gap narrows because fine-tuning lets the pixel model recover.

The argument is not that pixel reconstruction is broken; it is that pixel reconstruction is wasteful when your downstream task is recognition rather than synthesis. If you wanted to generate video, predicting pixels is the right loss. If you want a video encoder that downstream tasks can probe, predicting features keeps the capacity focused on the parts of the input that downstream tasks actually care about.

Why this does not collapse

The constant solution is the failure mode to rule out. The encoder outputs $\mathbf{c}$ for every input; the predictor outputs $\mathbf{c}$ ; the loss is zero. Two ingredients in eq (1) block that shortcut: a stop-gradient on the target, and a slow exponential moving average that controls how fast the teacher tracks the student.

The first is the stop-gradient. The optimizer never sees the gradient flowing back through $\bar{E}_\theta(y)$ , only through the prediction side $P_\phi(E_\theta(x),\Delta_y)$ . So the optimizer cannot pull the target toward the prediction; it can only pull the prediction toward the target. If the target moves, that is because the EMA tracking happens to move it.

The second is the EMA decay rate $\tau$ . With $\tau \approx 0.998$ the teacher $\bar{E}$ updates very slowly: it averages thousands of past student updates, so when the student performs a gradient step, the teacher's weights move imperceptibly. The argument from Grill et al. (2020), which the paper adapts for V-JEPA's L1 loss, runs as follows. Suppose the predictor is optimal: it returns the exact conditional median of the target distribution given the context. Substitute that optimum back into the loss and the encoder's gradient becomes the gradient of the conditional median absolute deviation, $\nabla_\theta \mathrm{MAD}(Y \mid E_\theta(x))$ . That MAD is small only when the context already determines the target. So training reduces the encoder's outputs to whatever is predictable from context, which is the opposite of collapsing to a constant.

That argument assumed the predictor is optimal. The hypothesis defended in the paper, which seems to hold empirically, is that the predictor converges to a near-stationary teacher faster than the encoder moves toward a degenerate solution. The EMA acts as a leash. Loosen it (drop $\tau$ too far below 1) and the predictor no longer converges first, leaving the encoder able to reach the constant solution, and collapse happens. Hold it tight ( $\tau \approx 0.998$ ) and the predictor converges first, keeping the encoder away from the constant solution.

Figure 2 · the EMA teacher as a leash

EMA τ0.9980

The student encoder (amber) traces a structured loop in feature space; the EMA teacher follows on a delay set by τ. Drag τ down toward 0.5 and the teacher's weights track the student almost exactly; both points then drift toward the origin, the trivial constant representation. Drag τ back up to 0.998 (V-JEPA's setting) and the teacher is near-stationary, the predictor has a stable target to regress toward, and the encoder is forced to encode information about the input.

The appendix adds one important detail about V-JEPA's targets: they are contextualized. The EMA teacher processes the full clip with no masking, and the prediction targets are the teacher's outputs at the masked positions. The teacher therefore sees neighboring patches when computing the embedding of the patch the student must predict, while the student sees only the unmasked complement. The target is a feature of y in context, not y in isolation, and giving the predictor a target that already encodes how the patch relates to its surroundings is harder to reach by collapsing than a context-free target would be.

The masking that makes the task hard enough

Hiding patches at random is simpler but trains a weaker encoder. Video is so redundant in both space and time that a random scatter of holes leaves visible neighbors a couple of patches away from every hole; the predictor can interpolate without doing much work. The encoder learns little. The paper's ablation in Table 4 puts a number on it: random tubes at 90% drop only reach 46.4 on SSv2 versus 67.4 for the multi-block default. Same encoder, same data, same iterations.

V-JEPA's default is multi-block masking. Sample several spatial rectangles with aspect ratios uniform in (0.75, 1.5), and for each rectangle, drop every token at that spatial location across every frame. The hidden region is a spatio-temporal tube: a hole in the spatial plane that runs all the way through time. Two flavors stacked together: short-range tubes (eight rectangles, ~15% of each frame each) and long-range tubes (two rectangles, ~70% each), and the union of both lands at an average masking ratio of about 90%.

The tube structure removes a temporal shortcut. With per-frame random holes, the encoder can look at neighbor frames to fill a hole. With spatial tubes, the same patch is hidden across every frame, so the neighbor-frame trick is gone; the predictor must extrapolate from the visible surrounding region. This mirrors the original MAE argument about spatial redundancy: neighboring patches are so similar that a randomly hidden one can be copied from its neighbor, so the task only becomes hard once you hide whole contiguous regions. The temporal sweep at the top of the figure shows this directly: a masked spatial block stays masked frame after frame.

Figure 3 · the multi-block spatio-temporal mask

mask r90%

A six-frame stand-in for a 16-frame V-JEPA clip, tokenized into an 8×8 spatial grid per frame. The amber outlines are spatio-temporal tubes: the same spatial blocks are masked on every frame. Drag the masking ratio: at very low ratios the prediction task is trivial (almost nothing to predict); at 90% (the V-JEPA default) the encoder sees a thin visible context and must extrapolate the rest; at 100% there is no context left and the task is impossible. The default lives near the high end on purpose.

With the mask defined, the architecture is straightforward. A ViT splits the 16-frame clip into spatio-temporal patches of 16×16 pixels spanning 2 consecutive frames; a 16-frame, 224×224 clip becomes 8×14×14 = 1568 tokens. The $x$ -encoder, a ViT-L/16 or ViT-H/16, processes only the visible tokens (about 10% of the total at the default mask). The predictor is a narrow ViT-style transformer: 12 layers, embedding dimension 384. It takes the encoder's outputs concatenated with learnable mask tokens that carry the positional embedding of each hidden patch, and returns one prediction vector per hidden patch.

On the supervision side, the $y$ -encoder is the EMA copy of the $x$ -encoder. It processes the full unmasked clip and the loss takes its outputs at the hidden positions. So the regression target lives in the encoder's own (slowly drifting) representation space, and the predictor regresses toward those contextualized features under L1:

# one V-JEPA training step (PyTorch-ish pseudocode)
clip = sample_clip()                # 16 frames, 224x224, 3 channels
blocks = sample_multiblock_masks()  # union of spatial blocks, repeated in t

x_tokens = tokenize(clip)           # 16x14x14 spatio-temporal patches
y_idx = tokens_in(blocks)           # the masked spatio-temporal positions
x_idx = complement(y_idx)           # the visible context

# x-encoder: process visible tokens only (masked tokens are dropped)
x_emb = x_encoder(x_tokens[x_idx])
# predictor: concatenate x_emb with learnable mask tokens carrying pos(y_idx)
y_hat = predictor(x_emb, mask_tokens_at(y_idx))

# y-encoder is the EMA copy of x-encoder; stop-grad on its output
with no_grad():
    y_full = y_encoder_ema(x_tokens)   # contextualized targets
    y_tgt  = y_full[y_idx]

loss = (y_hat - y_tgt).abs().mean()    # L1 regression
loss.backward()
opt.step()
y_encoder_ema.update_from(x_encoder)   # tau-EMA tracking

Two ablations test the design knobs. First, the masking matters. Comparing four strategies under identical training, the multi-block default beats every alternative on every downstream task, and the random-tube strategy at 90% does the worst by a wide margin. Causal masking (the encoder gets the first 6 or 12 frames; everything after is masked) is in between and never wins. Predicting the future from only the past is the autoregressive recipe that powers language models (GPT-style next-token prediction); this is the paper's gentle argument that it is the wrong fit for video.

Figure 4 · which masking strategy works

multi-block

multi-block · V-JEPA default. Large continuous spatio-temporal blocks anywhere in the clip, ~90% masked on average. The encoder must extrapolate across both time and space.

Frozen-backbone top-1 from Table 4. Same ViT-L/16, same K710+SSv2 pretraining, same attentive probe. The multi-block default wins on K400, SSv2, and IN1K; random tubes leak too much neighbor information and never recover. Tap a bar to read what the strategy actually does.

Second, the readout matters. The V-JEPA objective is unnormalized, so there is no a priori reason the encoder should produce a linearly-separable representation. A linear probe under-reads the encoder's quality. The paper instead uses an attentive probe: one cross-attention layer with a single learnable query token, then a small MLP, then a linear classifier. The probe pools the frozen feature map non-linearly. Across the board the attentive probe adds 16 to 17 points over average pooling, and it lifts every baseline they re-evaluate too, so the comparison stays fair.

Numbers: motion, appearance, label scarcity

With the recipe in place, three regimes stand out. Each one says something different about what V-JEPA learned.

Motion-heavy tasks. Something-Something-v2 is the benchmark designed to defeat appearance-based shortcuts: its labels are things like "pushing something from left to right" or "pretending to take something out of something," deliberately decoupled from what is in the frame. V-JEPA H/16 reaches 72.2% on SSv2 with a frozen backbone, more than 21 points ahead of the best image self-supervised models (DINOv2, a strong self-supervised image encoder; OpenCLIP, an open re-implementation of CLIP; and I-JEPA), and 5 to 6 points ahead of every pixel-prediction video model they tested at matched capacity. Static image pretraining, no matter how big, cannot reach the same ground.

Appearance-heavy tasks. Kinetics-400 is the opposite. Its labels are usually inferable from a single representative frame (an action like "playing trumpet" is identifiable from a still image of a trumpet), so big image models do well by accident. V-JEPA H/16 reaches 81.9% on K400, ahead of every video self-supervised baseline; DINOv2 ViT-g/14 still pulls ahead at 83.4% with the same attentive probe, but only by spending image data the V-JEPA model never saw. On ImageNet-1K itself, V-JEPA H/16 reaches 77.9% with a two-layer attentive probe (77.4 with a one-layer probe), narrowing the gap to image models trained on internet-scale image collections.

Label-scarce tasks. Drop the attentive probe's training labels from 100% to 5% and the difference becomes stark. V-JEPA H/16 loses 12 points on K400 (80.2 → 68.2) and 13.9 on SSv2 (67.9 → 54.0). VideoMAEv2, a ViT-g/14 pixel-prediction baseline, loses 30 points on K400 and 26 on SSv2 over the same drop. The paper's framing: the probe is being asked to read the same features, and the V-JEPA features carry more useful structure per unit of labeled data.

Figure 5 · label efficiency

Frozen-backbone top-1 against the fraction of labels available to the attentive probe (log x-axis). V-JEPA H/16 stays the highest and falls the least as labels shrink from 100% to 5%; MVD (Masked Video Distillation), VideoMAE, and VideoMAEv2 (all pixel/distillation video baselines) fall progressively further. The drop at 5% labels is the paper's central label-efficiency number. Drag the probe to read every model's top-1 at any fraction.

On cost, V-JEPA H/16 reaches the K400 result above after roughly 90,000 iterations at batch size 2400; the comparable VideoMAEv2 baseline trains for 1,500,000 iterations on the same video budget. The paper measures the wallclock and reports about a 2× speedup over the largest pixel-prediction baselines at matched quality.

What the model has clearly not learned, and the paper says so: anything the videos do not show. V-JEPA does relatively worse on ImageNet than on K400, and the authors flag that the publicly-available video corpus they trained on is much less visually diverse than the image internet. Their explicit ask is more diverse public video data; V-JEPA 2 takes a step in that direction by scaling the corpus by an order of magnitude.

The JEPA family

V-JEPA sits in the middle of a three-paper arc, all from the same Meta group, all reusing the same core recipe.

I-JEPA (Assran et al., 2023) was the image-only version. Same architecture: an $x$ -encoder on the visible patches, a narrow predictor that fills in masked positions, an EMA teacher that supplies the target. The masking was 2D blocks in the image plane. The result was that feature prediction matched or beat the best pixel-MAE results on image classification, with significantly less compute. V-JEPA adds the time axis: spatio-temporal patches, spatio-temporal blocks, and a 16-frame clip in place of a single image.

V-JEPA 2 (Assran et al., 2025) scales the V-JEPA recipe up. The encoder grows to a ViT-g/16, the video corpus grows by roughly an order of magnitude, and a second predictor head is added that is conditioned on robot actions; the result is an encoder that not only does well on the same recognition benchmarks but also supports model-based planning in a robot. The base self-supervised objective is unchanged; V-JEPA 2 inherits this paper's loss, masking, and EMA teacher untouched.

What ties the three together is a claim about what self-supervised vision should optimize. MoCo and SimCLR optimize invariance to hand-crafted augmentations. MAE and VideoMAE optimize pixel reconstruction. CLIP optimizes text alignment. JEPA picks prediction in feature space, argues that the EMA teacher solves the collapse problem the BYOL line started solving in 2020, and argues that the recipe scales. V-JEPA is the strongest published evidence so far that the feature-prediction route works on video.

The arc inverts the autoregressive language-modeling story. There, scaling came first (GPT-3) and clever objectives came later; here the JEPA architecture is doing the heavy lifting up front, and scale arrives in V-JEPA 2. It remains an open question across this lineage whether these two families of representation (features for understanding, pixels for synthesis) keep separating, or whether a future model unifies them.

Provenance Verified against primary literature

I-JEPA (2023)The image-only JEPA. V-JEPA extends its masked-feature-prediction recipe to the time axis.

BYOL (2020) / data2vec (2022)Stop-grad on an EMA teacher prevents representation collapse; V-JEPA reuses the analysis with an L1 loss.

MAE (2021) / VideoMAE (2022)Masked autoencoders set the template of multi-block masking and high ratios; V-JEPA swaps the pixel target for a feature target.

EDM-style ViViT/ViT (2020-21)Spatio-temporal patches and standard ViT blocks parameterize the encoder and the narrow predictor.

V-JEPA 2 (2025)The scaled-up successor; the techniques here became the base recipe.

caveatThe paper writes "feature prediction without negatives, text, or pretrained encoders" but the H/16 model is reported only after a 384-resolution finetune; the headline 81.9% K400 / 72.2% SSv2 / 77.9% IN1K numbers come from that variant, not the base 224 model.

Questions you might still have

How is the EMA teacher different from the student if they share weights?
They share the architecture, not the values. The student is updated by gradient descent. The teacher Ē is updated by Ē ← τ·Ē + (1−τ)·E with τ ≈ 0.998, so it lags many steps behind. That lag matters: a near-stationary target gives the predictor something to converge against, while gradient flow through the teacher is blocked by the stop-gradient.

What does an "attentive probe" actually do, and why does the paper use one instead of a linear probe?
It is one cross-attention layer with a single learnable query token, followed by a small MLP and a linear classifier. It pools the frozen backbone's feature map non-linearly. The paper uses it because the V-JEPA objective is unnormalized, so there is no a priori reason the backbone's features are linearly separable. Adaptive pooling adds +17 points on K400 and +16 on SSv2 over plain average pooling, and it lifts every baseline too.

How does V-JEPA compare to its sibling on still images, I-JEPA?
Same idea, different axis. I-JEPA at /i-jepa/ predicts the features of masked image regions from other image regions. V-JEPA repeats the recipe with spatio-temporal tubes so the same masked region spans every frame, which forces the encoder to model motion as well as appearance. On Something-Something-v2, where motion is the entire signal, V-JEPA beats every image-only baseline (including I-JEPA) by over 21 points.

What changed between V-JEPA and V-JEPA 2?
V-JEPA 2 at /v-jepa-2/ scales the recipe up (a ViT-g/16, an order of magnitude more video) and adds an action-conditioned predictor for robotic planning. The base self-supervised objective in this paper carries through unchanged; V-JEPA 2 inherits it.

Why an L1 loss and not L2?
The paper found L1 more stable in early training. The theoretical analysis sketched in section 3.1 still works: under L1 the optimal predictor is the conditional median rather than the conditional mean, and the gradient of the encoder reduces to the median absolute deviation. The collapse argument is unchanged.

Footnotes & further reading

The paper: Bardes, Garrido, Ponce, Chen, Rabbat, LeCun, Assran, Ballas, Revisiting Feature Prediction for Learning Visual Representations from Video (Meta FAIR / Inria, 2024). Code. Blog post.
The image-only predecessor: Assran et al., Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA). Explained at /i-jepa/.
The scaled-up successor: Assran et al., V-JEPA 2. Explained at /v-jepa-2/.
The collapse-prevention argument: Grill et al., Bootstrap Your Own Latent (BYOL) introduced the EMA-teacher recipe; V-JEPA reuses the analysis with an L1 loss.
The contextualized-target idea (the teacher sees the unmasked complement, the student does not): Baevski et al., data2vec.
The pixel-prediction baselines V-JEPA compares against: MAE, VideoMAE, OmniMAE, Hiera, MVD, VideoMAEv2.
The probe used throughout the evaluation: a one-layer cross-attention pool with a learnable query token, following CAE.
Yann LeCun's position paper that names the JEPA family: A Path Towards Autonomous Machine Intelligence (2022).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.