Learning Transferable Visual Models From Natural Language Supervision
The caption is the label.
Train an image encoder and a text encoder to agree on which caption goes with which image. Do it on 400 million pairs off the internet. Now the text encoder can write a classifier for any set of words you hand it, and the model labels images it was never trained to label.
Explaining the paperLearning Transferable Visual Models From Natural Language SupervisionWhat if a vision model could recognize a class it was never trained on, just because you typed its name?
Here is the awkward fact about how we built image classifiers for a decade. You pick a list of categories ahead of time, say the 1,000 in ImageNet, you collect labeled examples of each, and you train a model to output one of those 1,000 numbers. The model is good at exactly those classes and useless for anything else. Want to recognize a new kind of object? Collect a new labeled dataset and train again. The label is a bare integer with no meaning attached, so the model never learns that a "golden retriever" is a kind of dog, or that "a photo taken at dusk" describes lighting. It learns to sort pixels into bins you defined.
CLIP (Contrastive Language-Image Pre-training, from OpenAI) throws that setup out. Instead of a fixed list of integer labels, it learns from the captions people already write next to images on the web. The training task is one sentence: given a batch of images and their captions, figure out which caption goes with which image. No category list, no human annotation, just pairs scraped at scale. The useful part comes at test time. Because the model learned to connect pictures and words, you can describe a brand-new class in plain English and it will recognize it, with no extra training. A CLIP model matches the accuracy of the original ResNet-50 on ImageNet without seeing a single one of ImageNet's 1.28 million labeled training images.
Four ideas carry the whole paper, and we build them in order: why a fixed label set is a cage, how a shared embedding space lets images and text be compared, what the contrastive loss actually optimizes (and the one learned knob that makes it work), and how the text encoder turns into a classifier you write with words.
The label-set bottleneck
Start with what supervised classification costs. To teach a model a concept, you need labeled examples of it. The label carries no information beyond "this is class 7"; the meaning lives only in your head and in the file that maps 7 back to "maltese dog." So the model can only ever predict from the closed set you built it for. Adding a class means relabeling and retraining. This is why a great ImageNet model cannot tell you whether a photo shows a satellite image of farmland: nobody put that class in the list.
Natural language has none of that ceiling. The text "a maltese puppy sitting in a teacup" is its own label, and it ties into every other piece of text the model has read: puppy relates to dog, teacup to cup. A caption supervises richly and for free, since people write captions anyway. The bet CLIP makes is that this cheaper, looser signal, applied at enormous scale, beats the expensive clean one. The dataset is 400 million (image, text) pairs collected from the internet, which the authors call WIT (WebImageText), with roughly the same total word count as the corpus used to train GPT-2.
One thing had to be settled first: how do you train on a caption? The obvious idea is to predict the caption word by word, the way an image-captioning model does. The authors tried it and it was slow. Predicting the exact words is a hard target, because for any image a thousand different sentences would be fine, and the model wastes its effort learning to reproduce the exact phrasing. Swapping that for the easier question, which caption out of this batch matches this image, gave a 4x speedup in how fast the model learned to transfer to ImageNet. You do not need to generate the text. You only need to recognize the right one.
The caption is the label
So the unit of supervision is a pair: one image, one caption that actually accompanied it. Across a batch of such pairs, the model sees images and captions and knows the true matching is image with caption . Every other pairing, image with caption , is a near-miss the model should reject. That is the entire signal. Note what is missing: there is no list of categories, and the same image with a different true caption is a different training example. The model is learning a relationship between two spaces, not a partition of one.
To compare an image with a caption at all, they have to live somewhere comparable. That somewhere is a shared embedding space, and building it is the next step.
Two encoders, one space
CLIP has two networks. An image encoder (a ResNet or a Vision Transformer) turns a picture into a vector. A text encoder (a Transformer over byte-pair tokens) turns a caption into a vector. Each is then linearly projected into one shared space of dimension , and crucially the two outputs land in the same space, so an image vector and a caption vector are directly comparable.
Comparable how? By cosine similarity, the cosine of the angle between two vectors. CLIP -normalizes every embedding to unit length, which throws away magnitude and keeps only direction. On the unit sphere, the dot product of two vectors is their cosine similarity:
A value near means the two vectors point the same way (a good match), near means they are unrelated, and negative means they point apart. The training goal, stated in this geometry, is simple: make a matched image and caption point the same way, and make mismatched ones point apart. Below, images (amber) and captions (teal) sit as unit vectors in one space. Pick an image and watch which caption its spokes glow toward: the brightest one is the highest cosine similarity, the model's best match.
One detail worth pinning, because the code and the paper describe it slightly differently. The text vector is read off at the end-of-text token: after the caption runs through the text Transformer, the activations at the final token are layer-normalized and projected into the shared space. The paper calls that token [EOS]; the released code grabs it with text.argmax(dim=-1), which works because the end-of-text token has the largest id in the byte-pair vocabulary, so the argmax lands on its position.
The contrastive grid
Now the loss, which is the part that makes the whole thing cheap. Take a batch of pairs. Embed and normalize all images into and all captions into . Compute the cosine similarity of every image against every caption. That is an grid with .
The structure of the answer key is the whole trick. The cells on the diagonal, , are the real pairs and should be large. The off-diagonal cells are impostor pairings and should be small. CLIP does not need anyone to label the negatives; every other caption in the batch is automatically a negative for a given image. A batch of gives you one positive and negatives per image, for free.
To turn the grid into a loss, read each row as a classification problem. Row holds image 's similarity to all captions; the correct answer is caption . So run a softmax along the row (it exponentiates the scores and normalizes them into a probability over the captions) and penalize it with cross-entropy whenever the mass lands anywhere but caption . Do the same down each column (each caption picking its image). Average the two directions, and that is the symmetric loss:
The first term asks each image to pick its caption out of the ; the second asks each caption to pick its image. The is the temperature, which we get to next. Drag the toggle below to see the grid as raw similarities, then as the row softmax and the column softmax. The loss is happy exactly when both softmaxes put all their mass on the amber diagonal.
This batch-of-negatives objective is not new. It is the multi-class -pair loss from metric learning (Sohn, 2016), popularized for contrastive learning as InfoNCE (Oord et al., 2018), and applied to image-text pairs in medical imaging by ConVIRT (Zhang et al., 2020). CLIP's contribution is to strip it down and scale it. They train from scratch, use only a linear projection into the shared space (no nonlinear projection head), and lean on the sheer size of the dataset so that over-fitting is not a concern.
Temperature: one learned knob
The temperature in (2) controls how sharp the softmax is. Dividing the similarities by a small spreads them far apart before the exponential, so the softmax becomes confident and spiky; a large squashes them together and the distribution flattens toward uniform. Equivalently, CLIP multiplies the similarities by a scale , which the code stores in log space as logit_scale so it is always positive:
The surprise is that is not a hyper-parameter you tune. It is a single learned scalar, optimized by gradient descent like any weight. The code initializes it to log(1 / 0.07), so at the start and the scale . Left unchecked, the model would happily drive the scale toward infinity (an infinitely confident softmax), so CLIP clamps it: is not allowed above , which means . The authors found that clamp necessary to keep training stable.
Drag the temperature and watch one image's probabilities over five captions. The slider hits a floor at , the clamp CLIP enforces:
One batch, with shapes
Let me make the abstract grid concrete with one training step of a ViT-B/32 CLIP. The shared embedding dimension is . The batch is large on purpose: CLIP uses pairs at once, because a bigger batch means more negatives per image, which makes the contrastive task harder and the signal richer.
- Encode the batch. Images go through the ViT to , captions through the text Transformer to .
- -normalize each row, so every one of the 65,536 vectors sits on the unit sphere.
- Build the grid , then scale by . That is over a billion cosine similarities, with correct pairs on the diagonal and about impostors off it.
- The label vector is just : row 's answer is column . Cross-entropy along rows, cross-entropy down columns, average. One scalar loss.
That is the entire forward pass, nine lines in the paper's notation:
# CLIP core, from the paper's Figure 3 (numpy-like)
I_f = image_encoder(I) # [n, d_i] images in a batch
T_f = text_encoder(T) # [n, d_t] their captions
# project into the shared space, then L2-normalize to the unit sphere
I_e = l2_normalize(I_f @ W_i, axis=1) # [n, d_e]
T_e = l2_normalize(T_f @ W_t, axis=1) # [n, d_e]
# every image vs every caption: an n x n grid of cosine sims
logits = (I_e @ T_e.T) * np.exp(t) # t = learned log-temperature
# the right answer for row i (and column i) is i: the diagonal
labels = np.arange(n)
loss_i = cross_entropy(logits, labels, axis=0) # caption -> image
loss_t = cross_entropy(logits, labels, axis=1) # image -> caption
loss = (loss_i + loss_t) / 2The batch is so large that the similarity computation is sharded across GPUs, each one holding only the slice of the grid its local embeddings need. The temperature in that pseudocode is the same logit_scale from (3). CLIP trained eight models this way (five ResNets and three Vision Transformers) for 32 epochs over the 400M pairs. The largest, RN50x64, took 18 days on 592 V100 GPUs; the best, ViT-L/14, took 12 days on 256 V100s, with a final extra epoch at 336-pixel resolution to make ViT-L/14@336px.
The text encoder writes a classifier
Training is done. The model can score how well any image matches any caption. The trick is to notice that this is already a classifier, if you feed it the right captions.
Suppose you want to classify ImageNet, 1,000 classes. Take the 1,000 class names, wrap each in a sentence like "A photo of a {label}.", and run all 1,000 through the frozen text encoder. You get 1,000 unit vectors, one per class. Now embed your image once into its own unit vector. The cosine similarity of the image vector against each of the 1,000 class vectors gives 1,000 scores; the largest one is your prediction. No training, no labeled images, no gradient step. You wrote the classifier by typing class names.
The whole zero-shot recipe is shorter than the training loss:
# zero-shot ImageNet, no training, no labeled images
classes = ["tabby cat", "golden retriever", ...] # 1000 names
prompts = [f"A photo of a {c}." for c in classes] # wrap each
W = l2_normalize(text_encoder(prompts)) # [1000, d_e] classifier
x = l2_normalize(image_encoder(image)) # [d_e] one image
logits = x @ W.T # cosine to each class
pred = classes[logits.argmax()] # nearest winsThe paper has a clean way of reading this. The text encoder is a hypernetwork: it generates the weights of a linear classifier from the words describing the classes. And training CLIP, viewed through this lens, is optimizing a randomly drawn proxy classification task at every step, one with 32,768 classes (the captions in the batch) and a single example each (the matched image). The pre-training task and the zero-shot task are the same shape; the only difference at test time is that you choose the captions.
On ImageNet this zero-shot classifier reaches 76.2% top-1 and 95% top-5, matching a fully supervised ResNet-50 that was trained on all 1.28 million labeled ImageNet images. The earlier best zero-shot attempt (Visual N-Grams) managed 11.5%.
Prompts and ensembles
Why "A photo of a {label}." rather than the bare class name? Because the captions CLIP trained on are rarely a single word. They are sentences. Feeding the text encoder a lone word like dog puts it in a part of the space it rarely saw during training. Wrapping it in a short sentence closes that distribution gap, and on ImageNet it alone is worth about a point of accuracy.
Wording also disambiguates. A common failure is polysemy: ImageNet has both construction cranes and cranes that fly, and Oxford Pets has "boxer," which the text encoder might read as the dog or the athlete. A prompt that supplies context fixes it. "A photo of a {label}, a type of pet." nudges "boxer" toward the breed; "a satellite photo of a {label}." helps on aerial imagery; putting quotes around text helps on OCR.
The second lever is ensembling. Build several classifiers from different prompts ("A photo of a big {label}.", "A photo of a small {label}.") and average their class vectors in the embedding space, before the softmax. Because you average the vectors and not the predictions, the cost at inference is the same as a single classifier once the averaged vectors are cached. On ImageNet, an ensemble of 80 prompts adds about 3.5% over the single default prompt. Prompt engineering and ensembling together buy nearly 5%, for free, with no change to the trained weights.
Why it survives a new dataset
The result that made CLIP famous is not the headline ImageNet number. It is what happens when the test images shift. Take a standard ImageNet model and show it ImageNet-style photos and it does well. Show it the same 1,000 classes rendered as sketches (ImageNet Sketch), or artistic renditions (ImageNet-R), or the harder real-world crops of ObjectNet, and it falls apart, because it learned features that work on the exact ImageNet distribution and partly memorized that distribution's quirks.
CLIP, classifying zero-shot, holds up. It never trained on the ImageNet distribution, so it had nothing to overfit to. The paper measures this with effective robustness: plot accuracy on ImageNet against average accuracy on seven natural distribution shifts. A perfectly robust model would sit on the line . Standard ImageNet models fall on a line far below it; zero-shot CLIP sits close to the ideal. The cleanest comparison is a ResNet-101 and a CLIP model that tie at 76.2% on ImageNet itself, then diverge on every variant of it. Rendered as art (ImageNet-R): ResNet-101 drops to 37.7%, CLIP holds 88.9%. On ObjectNet: 32.6% versus 72.3%. On ImageNet Sketch: 25.2% versus 60.2%. Same ImageNet score, different robustness by 30 to 50 points. Across the full suite CLIP closes up to 75% of that gap.
There is a sharp twist that the authors flag honestly. If you take CLIP and fit a supervised linear classifier on its features using the ImageNet training set, ImageNet accuracy jumps 9.2% to 85.4%. Average accuracy on the distribution shifts does not improve at all. The gain concentrates right on the ImageNet distribution and does not generalize. Robustness came from not training on the target distribution, and you can train it back out.
So what does it actually do
CLIP shows that a simple objective, predicting which caption goes with which image, scaled to 400 million web pairs, produces a vision model you can steer with words. The zero-shot classifier matches supervised ResNet-50 on ImageNet, stays robust where supervised models crumble, and transfers across more than 30 datasets without ever fitting to them. The text encoder doubling as a classifier generator is the piece that keeps paying off, and it is the direct ancestor of the text-conditioned image generators (DALL-E 2, Stable Diffusion) that lean on a frozen CLIP text embedding.
The limits are real and the paper is candid about them. Zero-shot CLIP is weak on fine-grained classification (telling apart car models, flower species, aircraft variants) and on abstract tasks like counting objects. On anything truly outside its web-scraped world it can be near chance: it manages only 88% on handwritten MNIST digits, worse than logistic regression on raw pixels, because almost nothing like MNIST appears in its training data. It can only ever choose from the class names you give it, so it is not a generative captioner. And it is data-hungry and compute-hungry: the authors estimate roughly a 1,000x increase in compute would be needed for zero-shot CLIP to reach overall state of the art, which is infeasible on current hardware.
Step back and the argument is four moves long. A fixed label set caps what a model can recognize. Natural language is an open, free label that scale makes practical. A contrastive match-the-pair objective learns a shared space cheaply, with a batch supplying its own negatives and a single learned temperature setting the sharpness. And once that space exists, the text encoder writes a classifier from words alone. The caption was the label the whole time. We just had to stop throwing it away.
Questions you might still have
Why is predicting the whole caption easier than predicting its exact words?
Generating the exact words is a hard, high-entropy task: a million captions could fit one image. CLIP only asks which caption out of the batch goes with this image, a multiple-choice question. The paper measured a 4x jump in training efficiency from making that swap.
The image encoder never sees a class label. How does it classify?
It does not, on its own. At test time the text encoder reads your class names and emits one vector per class. Those vectors are the classifier. The image encoder just produces one image vector, and the nearest class vector wins. The classifier is built from words, not fitted on labeled images.
Why learn the temperature instead of fixing it?
It sets how sharp the softmax is, which interacts with the learning rate and the geometry of the embedding space. Letting the model tune it (as a log-scale multiplier) removes a finicky hyper-parameter. The only guardrail is a clamp: the scale exp(logit_scale) cannot exceed 100, which the authors found necessary to keep training stable.
Does it actually understand the image, or just match strings?
Neither fully. CLIP learns a shared geometry where related images and texts point the same way, which is why prompt wording matters and why it stumbles on tasks far from web text (it gets 88% on MNIST, beaten by logistic regression on raw pixels). It is a strong, broad pattern matcher, not a reasoner.
Footnotes & further reading
- The paper: Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever, Learning Transferable Visual Models From Natural Language Supervision (OpenAI, ICML 2021). Code and blog post.
- The batch-of-negatives objective: Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss (2016), and Oord, Li, Vinyals, Representation Learning with Contrastive Predictive Coding (InfoNCE, 2018).
- The contrastive image-text precursor CLIP simplifies: Zhang et al., Contrastive Learning of Medical Visual Representations from Paired Images and Text (ConVIRT, 2020).
- The image encoder for the strongest models: Dosovitskiy et al., An Image Is Worth 16x16 Words (the Vision Transformer, 2020).
- A downstream use of the frozen CLIP text embedding: Ramesh et al., Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2, 2022).
How could this explainer be improved? Found an error, or something unclear? I read every message.