Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Predict what a masked image block looks like in embedding space, not in pixels.
Most vision pretraining either matches embeddings of two augmented views or reconstructs missing pixels. I-JEPA does neither. It predicts the EMBEDDING of a masked target block from one context block, in a space the network learns on its own, with no hand-crafted data augmentations.
Explaining the paperSelf-Supervised Learning from Images with a Joint-Embedding Predictive ArchitectureOne context block, four masked target blocks, an L2 distance computed entirely in a learned representation space. Train a ViT-Huge/14 on ImageNet in under 72 hours on 16 A100s.
Self-supervised image pretraining has two well-worn recipes, and both pay for what they get. The first one shows the network two augmented crops of the same image and asks for matching embeddings, so the model comes out invariant to whatever your augmentation pipeline jitters: scale, color, crop position. That invariance is exactly what ImageNet classification wants, which is why methods like DINO, iBOT, and MSN top the linear-probing charts. The cost is a strong, hand-coded prior baked into every representation, one that does not transfer to tasks where colors and scales matter (counting objects, predicting depth), and that does not generalize cleanly to other modalities where the same augmentation tricks do not exist.
The second recipe goes the other way. Mask a chunk of the image and ask the network to reconstruct the missing pixels (MAE) or their tokenized versions (BEiT). That dispenses with augmentations and ports cleanly to language and audio, but the representations come out lower-semantic: a pixel-level loss forces the model to spend capacity on textures and edges. Linear-probe accuracy lags well behind the view-invariance camp.
I-JEPA, from Assran and collaborators at FAIR, sits between the two and looks like a compromise that should not work: keep masking, throw away the pixel target, replace it with a target the model computes for itself. Predict the EMBEDDING of the masked region in a representation space the network learns during training. No augmentations. No pixel reconstruction. It is the first concrete instantiation of LeCun's Joint-Embedding Predictive Architecture proposal applied to images, the architecture the paper's name advertises, and it later carries over to video as V-JEPA and V-JEPA 2.
Three ideas explain it: a clean separation between three families of self-supervised architectures, a masking strategy that picks targets large enough to be semantic and a context spread enough to be useful, and an asymmetry between the encoder learning from gradients and a slowly-moving teacher that keeps the target representation from collapsing. With those three pieces, a ViT-Huge trains on ImageNet in under 72 hours on 16 A100s and reaches 81.1% top-1 on the standard linear probe, while also winning on Clevr/Count and Clevr/Dist where invariance-based methods stall.
Two camps, both lose something
The paper turns on the choice of where the loss lives, and draws three architectures side by side to make that choice visible. Click between them and watch what changes.
Read this triptych with your eye on the distance node . In the Joint-Embedding picture, sits between two embeddings: it asks how similar the model's internal view of is to its internal view of . The trouble is that nothing in the loss forbids the encoder from mapping everything to a single constant; that constant has and the loss is done. Joint-Embedding methods spend most of their architectural complexity preventing that representation collapse: contrastive negatives, redundancy-reduction regularizers, clustering with entropy floors, asymmetric stop-grads, or as in BYOL/MoCo, a momentum teacher that only moves slowly. The collapse risk drives most of the design.
In the Generative picture, sits in pixel space. The decoder reconstructs , with a side input that says where to reconstruct (which patches were masked, at which positions). Collapse is not a worry, since the loss cares about pixels and a constant decoder cannot match a real image. What you pay for instead is that the loss rewards getting low-level details right (textures, edges, JPEG noise), and that capacity is spent on signal the downstream task usually does not need.
The Joint-Embedding Predictive picture keeps the predictor-and-side-input shape of the generative architecture but moves back up into embedding space. So the predictor produces , an embedding of the masked region, and the loss measures how close that is to the embedding the target encoder produces from the actual region. Pixels never enter the loss. Two views of the same image never enter it either. The model is told: given the context, predict the embedding of this specific masked block, and we will tell you what the right embedding is by encoding the block ourselves.
Collapse is back as a worry, since both sides of are now learned embeddings: a model could map everything to a constant and call it a day. The structural choice that prevents this is not a regularizer at all: the target encoder does not receive gradients. It is a slow exponential moving average of the context encoder, the same trick BYOL and MoCo used. The student chases a teacher that is itself a delayed copy of the student, and that small asymmetry, with no explicit contrastive term, is enough to keep the representation alive.
Predict the embedding, not the pixel
With the architecture chosen, the rest of the method follows in three pieces: the networks, the targets, and the masking.
One image, three networks. A context encoder takes the visible patches of one large context block and produces a sequence of patch embeddings . A target encoder takes the FULL image and produces one patch-level embedding per patch, written . A predictor , a narrow ViT, takes plus a set of mask tokens (one per target patch, carrying that patch's positional embedding) and outputs the predicted embeddings for the patches inside the -th target block :
All three networks are Vision Transformers. The encoders are standard ViTs (B/L/H, patch 14 or 16); the predictor is fixed to an embedding width of 384 and a depth of 6 for ViT-B or 12 for ViT-L/H, a deliberate narrow bottleneck the ablations confirm matters (a wider 1024-channel predictor underperforms the narrow 384 one, an inversion the authors flag without explaining). I-JEPA never uses a [cls] token; evaluation average-pools the patch embeddings instead.
Targets are computed at the OUTPUT of the target encoder. The order matters. You encode the WHOLE image with first, getting one embedding per patch, and THEN you select which patches you want as your target blocks. You do not mask the input and then encode. The paper's Table 11 ablates exactly this choice (masking the input vs masking the output of the target encoder for ViT-H/16 at 300 epochs); 56.1% top-1 vs 67.3%. Encoding the full image lets the target patches benefit from global context through the encoder's self-attention, so what the predictor is trying to match is a representation that already knows what the rest of the image looks like. Mask first and you cut that global context off at the input layer, and your supervisory signal becomes a less-informed view of the same patch.
The context is one block, with the targets carved out. Sample a single context block at random scale in with unit aspect ratio. Sample target blocks at random scale in and aspect ratio in . Then remove every patch that belongs to any target block from the context. The resulting context is informative (covers most of the image), spatially distributed (a single contiguous block, not random patches), and disjoint from every target (so the predictor cannot just copy the answer from its input). One image, one context, four targets, no overlap.
# the I-JEPA mask sampler (the in-paper Python sketch is similar)
def sample_masks(grid=14, M=4):
# M target blocks: random scale (0.15, 0.20), aspect (0.75, 1.5)
targets = [sample_block(scale=(0.15, 0.20),
aspect=(0.75, 1.5),
grid=grid) for _ in range(M)]
# one context block: scale (0.85, 1.0), unit aspect
ctx = sample_block(scale=(0.85, 1.0),
aspect=(1.0, 1.0),
grid=grid)
# carve targets out of context so they don't overlap
ctx_patches = patches(ctx) - union(patches(t) for t in targets)
return ctx_patches, [patches(t) for t in targets]Drag the sliders to feel what the sampler does. Resample to see another draw with the same settings, then crank the number of targets up or shrink the context to see why the paper's choices are not arbitrary. Watch the dashed teal box (the context BEFORE carving) collapse onto the solid teal cells (the context AFTER removing the amber targets):
The defaults give a context block averaging 25% of the image's patches (Table 6, the "Avg. Ratio" column). That is most of the image gone, and yet that 25% is enough for the predictor to recover the embeddings of four 15%-to-20% target blocks. The reason the predictor can pull this off is that the four targets are large enough to be semantic (a 20%-scale block on a 224x224 image is roughly a 100x100 region, the size of a face, an animal's head, a wheel), and the context is spatially distributed (one contiguous chunk, not a sprinkling of random pixels), so the predictor receives a coherent partial view of the scene rather than a Swiss-cheese pattern.
The L2 loss and how it avoids collapse
The loss is one line. Average the L2 distance between predicted and true patch embeddings, summed over the patches in each target block and over the target blocks:
Three things to notice. First, every quantity inside the norm is an embedding, never a pixel. The predictor produces from the context plus position; the target encoder produces from the full image. The distance is squared Euclidean in whatever-dimensional representation space the network has learned. Second, only some networks receive gradients from this loss. The context encoder and the predictor are trained by gradient descent on the loss above. The target encoder is not. Third, the gradient signal would be useless if the target encoder collapsed.
That last clause is what the slow teacher fixes. After every optimizer step, the target encoder updates as an exponential moving average of the context encoder:
with starting at 0.996 and rising linearly to 1.0 over the course of training. A momentum of 0.996 means the target encoder's effective time constant is about steps, or a half-life around 170 steps; the teacher updates slowly enough that the student is always chasing a slightly-stale version of itself. That asymmetry is what breaks the collapse equilibrium without any contrastive term. If the student and target were tied, the predictor could trivially output a constant and the encoders could match it, sending the loss to zero with no learning. With the target lagging behind, a constant collapse is unstable: as soon as the student moves toward the constant, the teacher follows but with delay, so the student is forever chasing a moving target it cannot match by going constant. The update rule is the contrastive signal.
Drag the momentum slider and watch what happens at the extremes. Near the teacher tracks the student tick-for-tick and the chase collapses. At the teacher never moves, which is a frozen-target ablation the paper does not run but which other JEPA-style methods have shown to stall (the predictor can match the frozen teacher and stop learning anything beyond that). The paper's is slow enough to break collapse and fast enough that the teacher still encodes the latest signal:
One detail to pin down: the loss runs through both (via the context) and , so gradient descent updates both at once. The target encoder gets no gradient signal; it is updated only by (2). At initialization, the two encoders share weights, so the predictor starts chasing a target it could plausibly match, and they diverge through training as the student is shaped by gradients while the teacher integrates them with delay.
# one training step (the predictor and context encoder learn;
# the target encoder updates as an EMA of the context encoder)
ctx_idx, tgt_idxs = sample_masks(grid=14, M=4) # patch indices
s_x = context_encoder(image[ctx_idx]) # encode visible
s_y_T = target_encoder_no_grad(image) # encode FULL image
loss = 0
for tgt in tgt_idxs:
mask_tokens = posemb[tgt] + shared_mask_vector # one per target patch
s_y_hat = predictor(s_x, mask_tokens) # predict that block
loss += mse(s_y_hat, s_y_T[tgt].detach()) # L2 in embedding space
loss /= len(tgt_idxs)
loss.backward() # updates θ (ctx + pred)
ema_update(target_encoder, context_encoder, m=0.996) # θ̄ <- m·θ̄ + (1-m)·θWhy predicting in pixels is worse
One ablation in the paper carries most of the conceptual weight, and it is Table 7. Train two ViT-L/16 models with the same masking strategy, the same architecture, the same optimizer, but switch the target from the output of the target encoder (embeddings) to the raw pixels of the target block (the MAE-style choice). Evaluate both with a linear probe on 1% of ImageNet.
The representation target reaches 66.9% top-1 after 500 epochs; the pixel target reaches 40.7% after 800. That is a 26-point gap with the pixel-target run getting 60% more training. The interpretation the paper offers, and which matches what other masked-image-modelling work has found, is that pixel-level losses spend the model's capacity on signal a downstream classifier cannot use: jpeg compression artefacts, sensor noise, exact intensity values that a robust feature should be invariant to. A raw-pixel L2 has no way to tell the model that "a slightly different shade of brown" should not count, so the model spends capacity predicting it anyway. With a representation target, the L2 is computed AFTER the target encoder has already abstracted those details away, so the predictor only spends capacity on what survives the encoder.
That is the conceptual crux of the paper: the L2 distance is the same simple object as in MAE, but because both arguments first pass through a learned encoder, the loss is computed in whatever representation space that encoder has carved out. As the encoder gets better at distinguishing semantic features, the loss inherits that semantics. The pixel-target model never gets that benefit; its loss is anchored to the raw signal forever.
The masking strategy ablation makes a related claim by varying what the predictor has to predict. Table 6 compares the multi-block strategy to three other masking schemes that span the same fraction of the image. Each strategy gives the model 25-40% of the image as context and asks it to predict the rest, differing only in how the predicted region is shaped. The bars below are Table 6 verbatim:
Two findings live in this table. One: predicting a few medium-sized SEMANTIC blocks (the 15%-20% scale) beats predicting one large block or scattered random patches by a wide margin. Small targets give the predictor too little to ground its guess; very large targets demand reconstructing structure the context never hinted at. Two: the context shape matters as much as the target shape. The rasterized strategy gives the predictor exactly one quadrant of the image and asks it to predict the other three. That sounds like a natural task and it lands at 15.5%, the worst result in the table. A single quadrant is too narrow a view of the scene; you can read the same point off the context-scale ablation (Table 9): shrink the context scale from 0.85 down to 0.40 and accuracy drops from 54.2% to 31.2%. Context that is both INFORMATIVE (covers most of the scene) and DISTRIBUTED (a single coherent block rather than a quadrant or random patches) is what carries the method.
The multi-block masking strategy
One thing the ablation table compresses is that the multi-block design is not one knob but four, and the appendix walks each. The target scale (Table 8) sweeps from up to , peaking sharply at with 54.2% top-1. Below that range the targets are too small to carry semantic content (a 7%-scale block is a 60x60 patch, fine texture, no object). Above it the targets become large enough that the predictor has to invent structure rather than recover it; the model essentially has to predict an unfamiliar half of the image.
The context scale (Table 9) sweeps the other way. At a context floor of 0.40 the model sees less than half the image and accuracy falls to 31.2%. At the paper's 0.85 the context is dense and the predictor has a coherent partial view to ground its guess. The middle of the sweep is mild, though: 0.65 already recovers 47.1%, 0.75 reaches 49.3%. The function is monotone but not steep, which says the EXACT size matters less than the shape (one contiguous block) and the disjointness (no overlap with the targets the predictor has to guess).
The number of target blocks (Table 10) is the cleanest curve. One target block: 9.0%. Two: 22.0%. Three: 48.5%. Four: 54.2%. The interpretation the paper offers is that each target block is one training signal per image; more targets means more gradient signal per forward pass at almost no extra cost (the predictor is small and the target encoder runs once per image regardless). Past four the paper does not push further, probably because the targets stop being disjoint enough to provide independent signal.
The fourth knob is harder to put on a slider: the predictor depth (Table 12). A 6-layer predictor on ViT-L/16 lands at 64.0% top-1 on 1% ImageNet; a 12-layer predictor reaches 66.9%. The predictor is doing real work, not just averaging. It is forecasting the embedding of a region it has never seen from a partial encoded view of the rest of the image, and depth helps it. The width does not. A 384-channel predictor beats a 1024-channel one (the encoder's own width), 70.7% vs 68.4% on ImageNet-1%, which the authors flag as a width bottleneck that helps. One plausible read of that counterintuitive result is that a narrow predictor forces the encoder to do more of the work of representing the visible context, which is what the downstream evaluation actually probes.
Linear probing, transfer, counting, depth
With the design pinned down, the results break into four pieces.
ImageNet linear probe. Freeze the encoder, train a linear head, report top-1. ViT-H/14 at 224x224 reaches 79.3%; ViT-H/16 at 448x448 reaches 81.1%. The closest method that also avoids hand-crafted view augmentations is data2vec at 77.3% on ViT-L/16, so I-JEPA gains about 2-4 points without changing what the model is allowed to see. The view-invariance methods still lead this benchmark (iBOT on ViT-L/16: 81.0%), but the gap is narrow and the larger I-JEPA H/16-448 matches iBOT despite using no augmentations. For a benchmark designed to reward augmentation-induced invariance, that is the closest an augmentation-free method has come.
Low-shot ImageNet. Train on 1% of the labels (about 12-13 images per class) by fine-tuning or linear-probing whichever works better per method. I-JEPA at ViT-H/14 reaches 73.3%; the H/16-448 reaches 77.3%. Both beat MAE on every comparable architecture (MAE H/14 at 1600 epochs: 71.5%) and the H/16-448 even passes the augmentation-using DINO at ViT-B/8 (70.0%) and BYOL at RN200x2 (71.2%). Low-shot is where the pretraining quality shows; you cannot fine-tune your way out of a bad starting point with 12 examples per class.
Transfer to other classifications. CIFAR100, Places205, iNaturalist18. I-JEPA at ViT-H/14 lands at 87.5 / 58.4 / 47.6, all significantly above MAE and data2vec on the same backbone, and on CIFAR100 and Places205 it surpasses DINO at ViT-B/8 despite DINO using augmentations. iNaturalist is the one place augmentation-based methods still win cleanly (iBOT at ViT-L/16: 57.3%); fine-grained species classification benefits from the color and scale jitter that augmentations train in, and I-JEPA does not learn that prior on its own.
Low-level tasks: counting and depth. Clevr/Count (count the objects in a scene) and Clevr/Dist (estimate distances) are exactly where invariance hurts: if your representation is invariant to scale, you cannot count, and if it is invariant to color, you cannot estimate depth from texture gradients. I-JEPA at ViT-H/14 reaches 86.7 / 72.4, beating DINO (86.6 / 53.4) and iBOT (85.7 / 62.8) by 19 and 10 points respectively on the distance task. This is where the "less inductive bias, broader applicability" claim is doing real work: the same representation that nearly matches augmentation methods on classification beats them by double digits on tasks they were never designed for.
A ViT-H/14 in under 72 GPU-hours
The other result the paper makes a centerpiece of, and the one that drove most of the discussion at the time, is efficiency. Pretraining a ViT-H/14 with I-JEPA takes under 1200 GPU-hours (about 72 hours on 16 A100s); the same architecture with MAE needs over 10x that. The savings come from three places.
First, no augmentation pipeline. View-invariance methods like DINO and iBOT process two or more crops of every image per step (DINO uses a multi-crop schedule with 8 local crops plus 2 global crops); I-JEPA processes one. That is roughly a 5x reduction in per-step encoder work for the encoder side of the loss before any other savings.
Second, the context encoder only processes the VISIBLE patches, not the full image. The masking strategy hides about 75% of the image from the context encoder (the average context covers 25%, then the targets are carved out), so the context-encoder forward pass is roughly 4x cheaper than a full ViT forward at the same architecture. The target encoder still runs on the full image, but only once, and with no gradients, so its cost is comparable.
Third, faster convergence. The paper reports that I-JEPA converges in roughly 5x fewer iterations than MAE for the same downstream accuracy, even though each I-JEPA iteration is about 7% slower than an MAE iteration (the cost of running the target encoder to compute embeddings). The product, fewer-iterations x slightly-slower-iterations, still buys an order of magnitude of compute.
The headline numbers are direct. A ViT-Huge with I-JEPA undercuts a ViT-Small with iBOT on wall-clock GPU hours, while reaching higher downstream accuracy. That is a swap of "huge model trained efficiently" for "small model trained with augmentations," and the efficient huge model wins on most evaluations. Whether that finding holds at the next scale up is a question the paper raises and leaves open. The successor work, V-JEPA on video and V-JEPA 2 on action-conditioned planning, takes the same recipe to larger, more structured inputs and keeps the speedups intact, suggesting the answer is yes.
The design reduces to three moves. Take a joint-embedding loss that asks two embeddings to match, replace one side's embedding with a predictor conditioned on position, and use an EMA teacher to keep the targets from collapsing. What comes out is a method that learns semantic features from a single image view, with no hand-coded augmentation pipeline, in a fraction of the compute the augmentation-using and pixel-reconstructing camps spent. That the same recipe later carries to video in V-JEPA and to action-conditioned planning in V-JEPA 2 is a stronger sign than any one ImageNet number.
Questions you might still have
Why does a moving teacher not collapse the same way a frozen one does?
A frozen teacher gives a stale, unchanging target that the student can match by ignoring its input. A teacher that moves with the student (the EMA at momentum 0.996) gives a slowly-changing target that still encodes the latest signal, so the student can never settle into a single constant output and call the loss done.
What is z in the JEPA diagram, concretely, for I-JEPA?
Position. For each masked target block the predictor receives one mask token per target patch, and each mask token carries the positional embedding of where in the image that patch sits. So z is "predict the embedding at THIS location," and the predictor has to produce a different output for each requested location.
Why do view-augmentation methods still win on some ImageNet benchmarks?
Augmentations encode strong prior knowledge: two crops of the same image are the same object, scale is irrelevant, color is irrelevant. That prior is exactly the invariance ImageNet classification rewards. I-JEPA never sees it. On low-level tasks the augmentation prior backfires (Clevr/Count, Clevr/Dist), and I-JEPA leads. I-JEPA trades a narrower prior for broader transfer across tasks.
How does I-JEPA fit into the JEPA family?
I-JEPA is the original image-only instantiation. <a href="/v-jepa/">V-JEPA</a> extends the same predict-the-embedding recipe to video by sampling spatio-temporal target tubes; <a href="/v-jepa-2/">V-JEPA 2</a> scales the video pretraining and adds action conditioning so the world model can plan robot manipulation. The masking strategy and the EMA target keep working as the input gets bigger and more structured.
Footnotes & further reading
- The paper: Assran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas, Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (FAIR, CVPR 2023). Code.
- The JEPA framework as Yann LeCun proposed it: A Path Towards Autonomous Machine Intelligence (v0.9.2, 2022).
- The closest prior method, predicting representations of masked regions with a momentum-EMA teacher, across modalities: Baevski et al., data2vec (2022).
- The pixel-reconstruction line I-JEPA argues against on semantics: MAE (He et al.) and BEiT (Bao et al.).
- The EMA-teacher trick I-JEPA borrows from the view-invariance literature: BYOL (Grill et al.) and MoCo v2 (Chen et al.).
- The two follow-up JEPA papers in this series: V-JEPA (Bardes et al. 2024, arxiv:2404.08471) lifts the recipe to video, and V-JEPA 2 (Assran et al. 2025, arxiv:2506.09985) scales it and adds action conditioning for robot planning.
- The RCDM visualization framework used to decode I-JEPA representations back to pixels in Figures 6-8: Bordes, Balestriero, Vincent, High Fidelity Visualization of What Your Self-Supervised Representation Knows About (TMLR 2022).
How could this explainer be improved? Found an error, or something unclear? I read every message.