Vision · Self-supervised learning

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Predict what a masked image block looks like in embedding space, not in pixels.

Most vision pretraining either matches embeddings of two augmented views or reconstructs missing pixels. I-JEPA does neither. It predicts the EMBEDDING of a masked target block from one context block, in a space the network learns on its own, with no hand-crafted data augmentations.

Explaining the paperSelf-Supervised Learning from Images with a Joint-Embedding Predictive ArchitectureAssran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas · Meta AI (FAIR) · CVPR 2023 · arXiv:2301.08243 ↗

One context block, four masked target blocks, an L2 distance computed entirely in a learned representation space. Train a ViT-Huge/14 on ImageNet in under 72 hours on 16 A100s.

Self-supervised image pretraining has two well-worn recipes, and both pay for what they get. The first one shows the network two augmented crops of the same image and asks for matching embeddings, so the model comes out invariant to whatever your augmentation pipeline jitters: scale, color, crop position. That invariance is exactly what ImageNet classification wants, which is why methods like DINO, iBOT, and MSN top the linear-probing charts (freeze the pretrained encoder, train only a linear classifier on top, report accuracy, a direct read on how semantic the frozen features already are). The cost is a strong, hand-coded prior baked into every representation, one that does not transfer to tasks where colors and scales matter (counting objects, predicting depth), and that does not generalize readily to other modalities where the same augmentation tricks do not exist.

The second recipe goes the other way. Mask a chunk of the image and ask the network to reconstruct the missing pixels (MAE) or their tokenized versions (BEiT). That dispenses with augmentations and ports directly to language and audio, but the representations come out lower-semantic: a pixel-level loss forces the model to spend capacity on textures and edges. Linear-probe accuracy lags well behind the view-invariance camp.

I-JEPA, from Assran and collaborators at FAIR, sits between the two and looks like a compromise that should not work: keep masking, throw away the pixel target, replace it with a target the model computes for itself. Predict the EMBEDDING of the masked region in a representation space the network learns during training. No augmentations. No pixel reconstruction. It is the first concrete instantiation of LeCun's Joint-Embedding Predictive Architecture proposal applied to images, the architecture named in the paper's title, and it later carries over to video as V-JEPA and V-JEPA 2.

Three ideas explain it: a clean separation between three families of self-supervised architectures, a masking strategy that picks targets large enough to be semantic and a context spread enough to be useful, and an asymmetry between the encoder learning from gradients and a slowly-moving teacher that keeps the target representation from collapsing. With those three pieces, a ViT-Huge trains on ImageNet in under 72 hours on 16 A100s and reaches 79.3% top-1 on the standard linear probe, while also winning on Clevr/Count and Clevr/Dist where invariance-based methods stall.

Two camps, both lose something

The paper turns on the choice of where the loss lives, and draws three architectures side by side to make that choice visible. Click between them and watch what changes.

Figure 1 · three architectures, one slot moves

Three energy-based families. Each scores an

(x, y)

pair with a single scalar “energy” (here the distance

D

), trained low when

x

and

y

belong together and high otherwise; what differs is only where in the pipeline that distance is measured. Joint-Embedding compares

s_x

s_y

, the embeddings of two views, and its loss is minimized when they are close. Generative reconstructs

y

in pixel space from

x

and a side input

z

. Joint-Embedding Predictive (I-JEPA) predicts the embedding

\hat{s}_y

, not the pixels, using a predictor conditioned on

z

. The distance node

D

moves up two slots, into the learned representation space.

The distance node $D$ is the thing that moves across the three pictures. In the Joint-Embedding picture, $D$ sits between two embeddings: it measures how similar the model's internal view of $x$ is to its internal view of $y$ . Nothing in the loss prevents the encoder from mapping everything to a single constant; that constant has $D = 0$ and the loss is done. Joint-Embedding methods spend most of their architectural complexity preventing that representation collapse: contrastive negatives, redundancy-reduction regularizers, clustering with entropy floors, asymmetric stop-grads, or as in BYOL/MoCo, a momentum teacher that only moves slowly.

In the Generative picture, $D$ sits in pixel space. The decoder reconstructs $y$ , with a side input $z$ that specifies where to reconstruct (which patches were masked, at which positions). Collapse is not a worry, since the loss is computed on pixels and a constant decoder cannot match a real image. What you pay for instead is that a pixel-level loss assigns error to low-level details (textures, edges, JPEG noise), and that capacity is spent on signal the downstream task usually does not need.

The Joint-Embedding Predictive picture keeps the predictor-and-side-input shape of the generative architecture but moves $D$ back up into embedding space. So the predictor produces $\hat{s}_y$ , an embedding of the masked region, and the loss measures how close that is to the embedding the target encoder produces from the actual region. Pixels never enter the loss. Two views of the same image never enter it either. The model is told: given the context, predict the embedding of this specific masked block, and we will tell you what the right embedding is by encoding the block ourselves.

Collapse is back as a worry, since both sides of $D$ are now learned embeddings: a model could map everything to a constant and drive the loss to zero. The structural choice that prevents this is not a regularizer at all: the target encoder does not receive gradients. It is a slow exponential moving average of the context encoder, the same trick BYOL and MoCo used. The student chases a teacher that is itself a delayed copy of the student, and that small asymmetry, with no explicit contrastive term, is enough to keep the representation alive.

Predict the embedding, not the pixel

With the architecture chosen, the rest of the method follows in three pieces: the networks, the targets, and the masking.

One image, three networks. A context encoder $f_\theta$ takes the visible patches of one large context block and produces a sequence of patch embeddings $s_x$ . A target encoder $f_{\bar\theta}$ takes the FULL image and produces one patch-level embedding per patch, written $s_y$ . A predictor $g_\phi$ , a narrow ViT, takes $s_x$ plus a set of mask tokens (one per target patch, carrying that patch's positional embedding) and outputs the predicted embeddings $\hat{s}_y(i)$ for the patches inside the $i$ -th target block $B_i$ :

s_y = \{s_{y_1}, \dots, s_{y_N}\}, \qquad \hat{\mathbf{s}}_y(i) = \{\hat{s}_{y_j}\}_{j \in B_i}

All three networks are Vision Transformers. The encoders are standard ViTs (B/L/H, patch 14 or 16); the predictor is fixed to an embedding width of 384 and a depth of 6 for ViT-B or 12 for ViT-L/H, a deliberate narrow bottleneck the ablations confirm matters (a wider 1024-channel predictor underperforms the narrow 384 one, an inversion the authors flag without explaining). I-JEPA never uses a [cls] token (the special summary token a ViT usually reads off as the whole-image vector); evaluation instead average-pools the patch embeddings.

Targets are computed at the OUTPUT of the target encoder. The order matters. You encode the WHOLE image with $f_{\bar\theta}$ first, getting one embedding per patch, and THEN you select which patches you want as your target blocks. You do not mask the input and then encode. The paper's Table 11 ablates exactly this choice (masking the input vs masking the output of the target encoder for ViT-H/16 at 300 epochs); 56.1% top-1 vs 67.3%. Encoding the full image lets the target patches benefit from global context through the encoder's self-attention, so what the predictor is trying to match is a representation that already encodes what the rest of the image looks like. Mask first and you cut that global context off at the input layer, and your supervisory signal becomes a less-informed view of the same patch.

The context is one block, with the targets carved out. Sample a single context block at random scale in $(0.85, 1.0)$ with unit aspect ratio. Sample $M = 4$ target blocks at random scale in $(0.15, 0.20)$ and aspect ratio in $(0.75, 1.5)$ . Then remove every patch that belongs to any target block from the context. The resulting context is informative (covers most of the image), spatially distributed (a single contiguous block, not random patches), and disjoint from every target (so the predictor has no access to the answer in its input).

# the I-JEPA mask sampler (the in-paper Python sketch is similar)
def sample_masks(grid=14, M=4):
    # M target blocks: random scale (0.15, 0.20), aspect (0.75, 1.5)
    targets = [sample_block(scale=(0.15, 0.20),
                            aspect=(0.75, 1.5),
                            grid=grid) for _ in range(M)]
    # one context block: scale (0.85, 1.0), unit aspect
    ctx = sample_block(scale=(0.85, 1.0),
                       aspect=(1.0, 1.0),
                       grid=grid)
    # carve targets out of context so they don't overlap
    ctx_patches = patches(ctx) - union(patches(t) for t in targets)
    return ctx_patches, [patches(t) for t in targets]

Drag the sliders to see what the sampler does. Resample to see another draw with the same settings, then crank the number of targets up or shrink the context to see why the paper's choices are not arbitrary. Watch the dashed teal box (the context BEFORE carving) collapse onto the solid teal cells (the context AFTER removing the amber targets):

Figure 2 · the multi-block masking sampler

target scale ≤0.20

context scale ≥0.85

Context in teal, targets in amber, hidden patches dim. The paper's defaults (

M = 4

, target scale

(0.15, 0.20)

, context scale

(0.85, 1.0)

, plus a target carve-out) produce contexts averaging about 25% of the image's patches. Slide the context floor down and the teal disappears; slide the target scale up and the amber blocks bleed into the context.

The defaults give a context block averaging 25% of the image's patches (Table 6, the "Avg. Ratio" column). That is most of the image gone, and yet that 25% is enough for the predictor to recover the embeddings of four 15%-to-20% target blocks. The predictor can recover the embeddings because the four targets are large enough to be semantic (a 20%-scale block on a 224x224 image is roughly a 100x100 region, the size of a face, an animal's head, a wheel), and the context is spatially distributed (one contiguous chunk, not a sprinkling of random pixels), so the predictor receives a unified partial view of the scene rather than a scattering of disconnected patches.

The L2 loss and how it avoids collapse

The loss is one line. Average the L2 distance between predicted and true patch embeddings, summed over the patches in each target block and over the $M$ target blocks:

\frac{1}{M}\sum_{i=1}^{M} D\big(\hat{\mathbf{s}}_y(i),\,\mathbf{s}_y(i)\big) = \frac{1}{M}\sum_{i=1}^{M} \sum_{j\in B_i} \|\hat{\mathbf{s}}_{y_j} - \mathbf{s}_{y_j}\|_2^2

(1)

Three properties stand out. First, every quantity inside the norm is an embedding, never a pixel. The predictor produces $\hat{\mathbf{s}}_{y_j}$ from the context plus position; the target encoder produces $\mathbf{s}_{y_j}$ from the full image. The distance is squared Euclidean in whatever $d$ -dimensional representation space the network has learned. Second, only some networks receive gradients from this loss. The context encoder $f_\theta$ and the predictor $g_\phi$ are trained by gradient descent on the loss above. The target encoder $f_{\bar\theta}$ is not. Third, the gradient signal would be useless if the target encoder collapsed.

The slow teacher addresses that third property. After every optimizer step, the target encoder updates as an exponential moving average of the context encoder:

\bar\theta \leftarrow m\,\bar\theta + (1 - m)\,\theta, \qquad m = 0.996 \to 1.0

(2)

with $m$ starting at 0.996 and rising linearly to 1.0 over the course of training. A momentum of 0.996 means the target encoder's effective time constant is about $1/(1-m) \approx 250$ steps, or a half-life around 170 steps; the teacher updates slowly enough that the student is always chasing a slightly-stale version of itself. That asymmetry breaks the collapse equilibrium without any contrastive term. If the student and target were tied, the predictor could trivially output a constant and the encoders could match it, sending the loss to zero with no learning. With the target lagging behind, a constant collapse is unstable: as soon as the student moves toward the constant, the teacher follows but with delay, so the student is forever chasing a moving target it cannot match by going constant.

Drag the momentum slider and watch what happens at the extremes. Near $m = 0.5$ the teacher tracks the student tick-for-tick and the chase collapses. At $m = 1.0$ the teacher never moves, which is a frozen-target ablation the paper does not run but which other JEPA-style methods have shown to stall (the predictor can match the frozen teacher and stop learning anything beyond that). The paper's $m = 0.996$ is slow enough to break collapse and fast enough that the teacher still encodes the latest signal:

Figure 3 · the slowly moving teacher

momentum m0.996

A simulated weight trajectory. θ (amber) is the context encoder, updated by gradients and noisy on a step-to-step scale. θ̄ (teal) is the EMA target encoder. Near

m = 0.5

the teal tracks the amber and the asymmetry vanishes. At

m = 0.996

, the paper's default, the teal is a smooth low-pass version of the amber: slow enough that the student is always chasing, fast enough to encode the latest signal.

The loss runs through both $f_\theta$ (via the context) and $g_\phi$ , so gradient descent updates both at once. The target encoder gets no gradient signal; it is updated only by (2). At initialization, the two encoders share weights, so the predictor starts chasing a target it could plausibly match, and they diverge through training as the student is shaped by gradients while the teacher integrates them with delay.

# one training step (the predictor and context encoder learn;
# the target encoder updates as an EMA of the context encoder)
ctx_idx, tgt_idxs = sample_masks(grid=14, M=4)         # patch indices
s_x   = context_encoder(image[ctx_idx])                # encode visible
s_y_T = target_encoder_no_grad(image)                  # encode FULL image

loss = 0
for tgt in tgt_idxs:
    mask_tokens = posemb[tgt] + shared_mask_vector     # one per target patch
    s_y_hat = predictor(s_x, mask_tokens)              # predict that block
    loss   += mse(s_y_hat, s_y_T[tgt].detach())        # L2 in embedding space
loss /= len(tgt_idxs)
loss.backward()                                        # updates θ (ctx + pred)
ema_update(target_encoder, context_encoder, m=0.996)   # θ̄ <- m·θ̄ + (1-m)·θ

Why predicting in pixels is worse

The central ablation is Table 7. Train two ViT-L/16 models with the same masking strategy, the same architecture, the same optimizer, but switch the target from the output of the target encoder (embeddings) to the raw pixels of the target block (the MAE-style choice). Evaluate both with a linear probe on 1% of ImageNet.

Figure 4 · representation target vs pixel target

Toggle the target. With a representation target (top), the predictor competes with a learned target encoder; with a pixel target (bottom), the target encoder is bypassed and the loss compares the predictor's output directly to raw pixels. Same architecture (ViT-L/16) and same masking strategy; the representation target reaches 66.9% top-1 linear-probe, the pixel target 40.7%, despite the pixel-target run getting 300 more epochs.

The representation target reaches 66.9% top-1 after 500 epochs; the pixel target reaches 40.7% after 800. That is a 26-point gap with the pixel-target run getting 60% more training. The interpretation the paper offers, and which matches what other masked-image-modelling work has found, is that pixel-level losses spend the model's capacity on signal a downstream classifier cannot use: jpeg compression artefacts, sensor noise, exact intensity values that a robust feature should be invariant to. A raw-pixel L2 has no mechanism for treating "a slightly different shade of brown" as unimportant, so the model spends capacity predicting it anyway. With a representation target, the L2 is computed AFTER the target encoder has already abstracted those details away, so the predictor only spends capacity on what survives the encoder.

The point is that the L2 distance is the same simple object as in MAE, but because both arguments first pass through a learned encoder, the loss is computed in whatever representation space that encoder has carved out. As the encoder gets better at distinguishing semantic features, the loss is computed in increasingly semantic terms. The pixel-target model never gets that benefit; its loss is anchored to the raw signal forever.

The masking strategy ablation makes a related claim by varying what the predictor has to predict. Table 6 compares the multi-block strategy to three other masking schemes that span the same fraction of the image. Each strategy gives the model 25-40% of the image as context and asks it to predict the rest, differing only in how the predicted region is shaped. The bars below are Table 6 verbatim:

Figure 5 · masking strategy ablation

Click a strategy to see its mask shape and where it lands on 1% ImageNet linear probe (ViT-B/16, 300 epochs). Multi-block predicts four small SEMANTIC blocks from a large complement; block and random predict one large region from a smaller complement; rasterized predicts three quadrants from one. The 54.2% multi-block result is more than 30 points above any alternative.

Two findings live in this table. One: predicting a few medium-sized SEMANTIC blocks (the 15%-20% scale) beats predicting one large block or scattered random patches by a wide margin. Small targets give the predictor too little to ground its guess; very large targets demand reconstructing structure absent from the context. Two: the context shape matters as much as the target shape. The rasterized strategy gives the predictor exactly one quadrant of the image and asks it to predict the other three. That sounds like a natural task and it lands at 15.5%, the worst result in the table. A single quadrant is too narrow a view of the scene; you can read the same point off the context-scale ablation (Table 9): shrink the context scale from 0.85 down to 0.40 and accuracy drops from 54.2% to 31.2%. Context that is both INFORMATIVE (covers most of the scene) and DISTRIBUTED (a single coherent block rather than a quadrant or random patches) lets the predictor recover the targets.

The multi-block masking strategy

The ablation table treats multi-block as one strategy, but the design underneath is four separate knobs, and the appendix walks each. The target scale (Table 8) sweeps from $(0.075, 0.2)$ up to $(0.2, 0.3)$ , peaking sharply at $(0.15, 0.2)$ with 54.2% top-1. Below that range the targets are too small to carry semantic content (a 7%-scale block is a 60x60 patch, fine texture, no object). Above it the targets become large enough that the predictor has to extrapolate structure rather than recover it; the model has to predict an unfamiliar half of the image.

The context scale (Table 9) sweeps the other way. At a context floor of 0.40 the model sees less than half the image and accuracy falls to 31.2%. At the paper's 0.85 the context is dense and the predictor has a connected partial view to ground its guess. The middle of the sweep is mild, though: 0.65 already recovers 47.1%, 0.75 reaches 49.3%. The function is monotone but not steep, which says the EXACT size matters less than the shape (one contiguous block) and the disjointness (no overlap with the targets the predictor has to guess).

The number of target blocks (Table 10) is the clearest curve. One target block: 9.0%. Two: 22.0%. Three: 48.5%. Four: 54.2%. The interpretation the paper offers is that each target block is one training signal per image; more targets means more gradient signal per forward pass at almost no extra cost (the predictor is small and the target encoder runs once per image regardless). Past four the paper does not push further, probably because the targets stop being disjoint enough to provide independent signal.

The fourth knob is harder to put on a slider: the predictor depth (Table 12). A 6-layer predictor on ViT-L/16 lands at 64.0% top-1 on 1% ImageNet; a 12-layer predictor reaches 66.9%. The predictor is doing real work, not just averaging. Working from a partial encoded view of the rest of the image, it produces the embedding of a region missing from its input, and depth helps it. The width does not. A 384-channel predictor beats a 1024-channel one (the encoder's own width), 70.7% vs 68.4% on ImageNet-1%, which the authors flag as a width bottleneck that helps. A narrow predictor may force the encoder to do more of the work of representing the visible context, which the downstream evaluation actually probes.

Linear probing, transfer, counting, depth

The results cover linear probing, low-shot ImageNet, transfer, and low-level tasks.

ImageNet linear probe. Freeze the encoder, train a linear head, report top-1. ViT-H/14 at 224x224 reaches 79.3%; ViT-H/16 at 448x448 reaches 81.1%. The method that also avoids hand-crafted view augmentations is data2vec (the closest prior method, which also predicts masked-region representations against an EMA teacher) at 77.3% on ViT-L/16, so I-JEPA gains about 2-4 points without changing what the model is allowed to see. The view-invariance methods still lead this benchmark (iBOT on ViT-L/16: 81.0%), but the gap is narrow and the larger I-JEPA H/16-448 matches iBOT despite using no augmentations. For a benchmark designed to reward augmentation-induced invariance, that is the closest an augmentation-free method has come.

Low-shot ImageNet. Train on 1% of the labels (about 12-13 images per class) by fine-tuning or linear-probing whichever works better per method. I-JEPA at ViT-H/14 reaches 73.3%; the H/16-448 reaches 77.3%. Both beat MAE on every comparable architecture (MAE H/14 at 1600 epochs: 71.5%) and the H/16-448 even passes the augmentation-using DINO at ViT-B/8 (70.0%) and BYOL at RN200x2 (71.2%). Low-shot shows the pretraining quality; you cannot fine-tune your way out of a bad starting point with 12 examples per class.

Transfer to other classifications. CIFAR100, Places205, iNaturalist18. I-JEPA at ViT-H/14 lands at 87.5 / 58.4 / 47.6, all significantly above MAE and data2vec on the same backbone, and on CIFAR100 and Places205 it surpasses DINO at ViT-B/8 despite DINO using augmentations. iNaturalist is the one place augmentation-based methods still win outright (iBOT at ViT-L/16: 57.3%); fine-grained species classification benefits from the color and scale jitter that augmentations train in, and I-JEPA does not learn that prior on its own.

Low-level tasks: counting and depth. Clevr/Count (count the objects in a scene) and Clevr/Dist (estimate distances) are exactly the tasks where invariance hurts: if your representation is invariant to scale, you cannot count, and if it is invariant to color, you cannot estimate depth from texture gradients. I-JEPA at ViT-H/14 reaches 86.7 / 72.4, beating DINO (86.6 / 53.4) and iBOT (85.7 / 62.8) by 19 and 10 points respectively on the distance task. The "less inductive bias, broader applicability" claim pays off here: the same representation that nearly matches augmentation methods on classification beats them by double digits on tasks they were never designed for.

A ViT-H/14 in under 72 GPU-hours

The other result the paper makes a centerpiece of, and the one that drove most of the discussion at the time, is efficiency. Pretraining a ViT-H/14 with I-JEPA takes under 1200 GPU-hours (about 72 hours on 16 A100s); the same architecture with MAE needs over 10x that. The savings come from three places.

First, no augmentation pipeline. View-invariance methods like DINO and iBOT process two or more crops of every image per step (DINO uses a multi-crop schedule with 8 local crops plus 2 global crops); I-JEPA processes one. That is roughly a 5x reduction in per-step encoder work for the encoder side of the loss before any other savings.

Second, the context encoder only processes the VISIBLE patches, not the full image. The masking strategy hides about 75% of the image from the context encoder (the average context covers 25%, then the targets are carved out), so the context-encoder forward pass is roughly 4x cheaper than a full ViT forward at the same architecture. The target encoder still runs on the full image, but only once, and with no gradients, so its cost is comparable.

Third, faster convergence. The paper reports that I-JEPA converges in roughly 5x fewer iterations than MAE for the same downstream accuracy, even though each I-JEPA iteration is about 7% slower than an MAE iteration (the cost of running the target encoder to compute embeddings). The product, fewer-iterations x slightly-slower-iterations, still buys an order of magnitude of compute.

The headline numbers are direct. A ViT-Huge with I-JEPA undercuts a ViT-Small with iBOT on wall-clock GPU hours, while reaching higher downstream accuracy. That is a swap of "huge model trained efficiently" for "small model trained with augmentations," and the efficient huge model wins on most evaluations. Whether that finding holds at the next scale up is a question the paper raises and leaves open. The successor work, V-JEPA on video and V-JEPA 2 on action-conditioned planning, takes the same recipe to larger, more structured inputs and keeps the speedups intact, suggesting the answer is yes.

The design reduces to three moves. Take a joint-embedding loss between two embeddings that should match, replace one side's embedding with a predictor conditioned on position, and use an EMA teacher to keep the targets from collapsing. What comes out is a method that learns semantic features from a single image view, with no hand-coded augmentation pipeline, in a fraction of the compute the augmentation-using and pixel-reconstructing camps spent. That the same recipe later carries to video in V-JEPA and to action-conditioned planning in V-JEPA 2 is a stronger sign than any one ImageNet number.

Provenance Verified against primary literature

LeCun JEPA proposal (2022)The general "joint-embedding predictive" architecture template I-JEPA instantiates.

MAE (He et al., 2021)Masked image modelling with an encoder that only sees visible patches; the recipe I-JEPA reuses, swapping pixel targets for embedding targets.

data2vec (Baevski et al., 2022)The closest prior method: predict the representation of a masked region using a momentum-EMA teacher.

MSN, DINO, iBOT (2021–2022)The view-invariance line I-JEPA competes against without ever using augmentations.

BYOL / MoCo momentum (2020)The EMA target encoder that breaks the collapse equilibrium in joint-embedding setups.

V-JEPA familyV-JEPA extends this recipe to video, V-JEPA 2 scales the video version and adds action conditioning for planning.

correctionThe paper writes (015, 0.2) where it means (0.15, 0.2): a typo in the appendix. The figures and main-paper table use 0.15 to 0.20. (The (0.075, 0.2) entry in the Table 8 target-scale sweep is a real ablation row, not a typo.)

Questions you might still have

Why does a moving teacher not collapse the same way a frozen one does?
A frozen teacher gives a stale, unchanging target that the student can match by ignoring its input. A teacher that moves with the student (the EMA at momentum 0.996) gives a slowly-changing target that still encodes the latest signal, so the student can never settle into a single constant output and call the loss done.

What is z in the JEPA diagram, concretely, for I-JEPA?
Position. For each masked target block the predictor receives one mask token per target patch, and each mask token carries the positional embedding of where in the image that patch sits. So z is "predict the embedding at THIS location," and the predictor has to produce a different output for each requested location.

Why do view-augmentation methods still win on some ImageNet benchmarks?
Augmentations encode strong prior knowledge: two crops of the same image are the same object, scale is irrelevant, color is irrelevant. That prior is exactly the invariance ImageNet classification rewards. I-JEPA never sees it. On low-level tasks the augmentation prior backfires (Clevr/Count, Clevr/Dist), and I-JEPA leads. I-JEPA trades a narrower prior for broader transfer across tasks.

How does I-JEPA fit into the JEPA family?
I-JEPA is the original image-only instantiation. <a href="/v-jepa/">V-JEPA</a> extends the same predict-the-embedding recipe to video by sampling spatio-temporal target tubes; <a href="/v-jepa-2/">V-JEPA 2</a> scales the video pretraining and adds action conditioning so the world model can plan robot manipulation. The masking strategy and the EMA target keep working as the input gets bigger and more structured.

Footnotes & further reading

The paper: Assran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas, Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (FAIR, CVPR 2023). Code.
The JEPA framework as Yann LeCun proposed it: A Path Towards Autonomous Machine Intelligence (v0.9.2, 2022).
The closest prior method, predicting representations of masked regions with a momentum-EMA teacher, across modalities: Baevski et al., data2vec (2022).
The pixel-reconstruction line I-JEPA argues against on semantics: MAE (He et al.) and BEiT (Bao et al.).
The EMA-teacher trick I-JEPA borrows from the view-invariance literature: BYOL (Grill et al.) and MoCo v2 (Chen et al.).
The two follow-up JEPA papers in this series: V-JEPA (Bardes et al. 2024, arxiv:2404.08471) lifts the recipe to video, and V-JEPA 2 (Assran et al. 2025, arxiv:2506.09985) scales it and adds action conditioning for robot planning.
The RCDM visualization framework used to decode I-JEPA representations back to pixels in Figures 6-8: Bordes, Balestriero, Vincent, High Fidelity Visualization of What Your Self-Supervised Representation Knows About (TMLR 2022).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.