Multimodal · Representation learning

Learning Transferable Visual Models From Natural Language Supervision

The caption is the label.

Train an image encoder and a text encoder to agree on which caption goes with which image. Do it on 400 million pairs off the internet. Now the text encoder can write a classifier for any set of words you hand it, and the model labels images it was never trained to label.

Explaining the paperLearning Transferable Visual Models From Natural Language SupervisionRadford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever · OpenAI · ICML 2021 · arXiv:2103.00020 ↗

Every new category used to mean collecting and labeling a fresh dataset; CLIP replaces that with a sentence.

For a decade, image classification was built on an awkward premise. You pick a list of categories ahead of time, say the 1,000 in ImageNet, you collect labeled examples of each, and you train a model to output one of those 1,000 numbers. The model is good at exactly those classes and useless for anything else. Want to recognize a new kind of object? Collect a new labeled dataset and train again. The label is a bare integer with no meaning attached, so the model never learns that a "golden retriever" is a kind of dog, or that "a photo taken at dusk" describes lighting. It learns to sort pixels into bins you defined.

CLIP (Contrastive Language-Image Pre-training, from OpenAI) replaces that setup. Instead of a fixed list of integer labels, it learns from the captions people already write next to images on the web. The training task is one sentence: given a batch of images and their captions, figure out which caption goes with which image. No category list, no human annotation, only pairs scraped at scale. The useful part comes at test time. Because the model learned to connect pictures and words, you can describe a brand-new class in plain English and it will recognize it, with no extra training. A CLIP model matches the accuracy of the original ResNet-50 (a standard 50-layer supervised convolutional net, the long-time ImageNet baseline; see /resnet/) on ImageNet without seeing a single one of ImageNet's 1.28 million labeled training images.

A handful of ideas carry the paper: why a fixed label set is limiting, how a shared embedding space lets images and text be compared, what the contrastive loss optimizes (and the one learned knob that makes it work), and how the text encoder turns into a classifier you write with words.

The label-set bottleneck

Supervised classification carries a fixed cost. To teach a model a concept, you need labeled examples of it. The label carries no information beyond "this is class 7"; the meaning lives only in your head and in the file that maps 7 back to "maltese dog." So the model can only ever predict from the closed set you built it for. Adding a class means relabeling and retraining. This is why a great ImageNet model cannot tell you whether a photo shows a satellite image of farmland: nobody put that class in the list.

Natural language has none of that ceiling. The text "a maltese puppy sitting in a teacup" is its own label, and it ties into every other piece of text the model has read: puppy relates to dog, teacup to cup. A caption supervises richly, and people write captions anyway. CLIP wagers that this cheaper, looser signal, applied at enormous scale, beats the expensive clean one. The dataset is 400 million (image, text) pairs collected from the internet, which the authors call WIT (WebImageText), with roughly the same total word count as the corpus used to train GPT-2 (OpenAI's 2019 language model, trained on ~40GB of web text).

First, though, how do you train on a caption? The obvious idea is to predict the caption word by word, the way an image-captioning model does. The authors tried it and it was slow. Predicting the exact words is a hard target, because for any image a thousand different sentences would be fine, and most of the compute goes into reproducing the exact phrasing. Swapping that for the easier question, which caption out of this batch matches this image, gave a 4x speedup in how fast the model learned to transfer to ImageNet.

The caption is the label

So the unit of supervision is a pair: one image, one caption that actually accompanied it. Across a batch of $N$ such pairs there are $N$ images and $N$ captions, and the true matching is image $i$ with caption $i$ . Every other pairing, image $i$ with caption $j \ne i$ , is a near-miss the model should reject. That is the whole signal. Two things are missing here. There is no list of categories, and the same image with a different true caption is a different training example. The model is learning a relationship between two spaces, not a partition of one.

To compare an image with a caption at all, they have to live somewhere comparable. That somewhere is a shared embedding space.

Two encoders, one space

CLIP has two networks. An image encoder (a ResNet or a Vision Transformer) turns a picture into a vector. A text encoder (a Transformer over byte-pair tokens: BPE, text split into subword pieces from a fixed vocabulary built by merging frequent character pairs) turns a caption into a vector. Each is then linearly projected into one shared space of dimension $d_e$ , and crucially the two outputs land in the same space, so an image vector and a caption vector are directly comparable.

Comparable how? By cosine similarity, the cosine of the angle between two vectors. CLIP $L_2$ -normalizes every embedding to unit length, which throws away magnitude and keeps only direction. On the unit sphere, the dot product of two vectors is their cosine similarity:

\hat{\mathbf{u}} = \frac{\mathbf{u}}{\lVert \mathbf{u}\rVert}, \quad \hat{\mathbf{v}} = \frac{\mathbf{v}}{\lVert \mathbf{v}\rVert} \;\Longrightarrow\; \cos(\theta) = \hat{\mathbf{u}}\cdot\hat{\mathbf{v}} \in [-1, 1]

(1)

A value near $1$ means the two vectors point the same way (a good match), near $0$ means they are unrelated, and negative means they point apart. The training goal, stated in this geometry, is simple: make a matched image and caption point the same way, and make mismatched ones point apart. Below, images (amber) and captions (teal) sit as unit vectors in one space. Pick an image and watch the lines fanning out from it glow brightest toward its matching caption: the brightest one is the highest cosine similarity, the model's best match.

Figure 1 · one shared space

image

Each image and each caption is a unit vector in the same space. The selected image's spokes glow by cosine similarity to each caption; its true match points the same way (cosine near 1). Training is just "rotate matched pairs together, push mismatched pairs apart."

Here the code and the paper differ on a small point. The text vector is read off at the end-of-text token: after the caption runs through the text Transformer, the activations at the final token are layer-normalized and projected into the shared space. The paper calls that token [EOS]; the released code grabs it with text.argmax(dim=-1), which works because the end-of-text token has the largest id in the byte-pair vocabulary, so the argmax lands on its position.

The contrastive grid

Now the loss. It is what makes CLIP cheap to train. Take a batch of $N$ pairs. Embed and normalize all $N$ images into $\mathbf{I}_1,\dots,\mathbf{I}_N$ and all $N$ captions into $\mathbf{T}_1,\dots,\mathbf{T}_N$ . A batch of $N$ pairs holds far more supervision than $N$ positive examples, because every mismatched image-caption pairing inside it is an unlabeled wrong answer: the $N$ matched pairings are right, the other $N^2 - N$ are wrong, and the training signal is to tell them apart. Compute the cosine similarity of every image against every caption. That is an $N\times N$ grid $S$ with $S_{ij} = \mathbf{I}_i\cdot\mathbf{T}_j$ .

The structure of the answer key explains why the loss works. The $N$ cells on the diagonal, $S_{ii}$ , are the real pairs and should be large. The $N^2 - N$ off-diagonal cells are impostor pairings and should be small. CLIP does not need anyone to label the negatives; every other caption in the batch is automatically a negative for a given image. A batch of $N$ gives you one positive and $N-1$ negatives per image without any labeling.

To turn the grid into a loss, read each row as a classification problem. Row $i$ holds image $i$ 's similarity to all $N$ captions; the correct answer is caption $i$ . So run a softmax along the row (it exponentiates the scores and normalizes them into a probability over the $N$ captions) and penalize it with cross-entropy whenever the mass lands anywhere but caption $i$ . Do the same down each column (each caption picking its image). Average the two directions, and that is the symmetric loss:

\mathcal{L} = \tfrac{1}{2}\big(\mathcal{L}_{\text{img}\to\text{txt}} + \mathcal{L}_{\text{txt}\to\text{img}}\big), \qquad \mathcal{L}_{\text{img}\to\text{txt}} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{e^{\,S_{ii}/\tau}}{\sum_{j=1}^{N} e^{\,S_{ij}/\tau}}

(2)

The first term asks each image to pick its caption out of the $N$ ; the second asks each caption to pick its image. The $\tau$ is the temperature. Drag the toggle below to see the grid as raw similarities, then as the row softmax and the column softmax. The loss is smallest exactly when both softmaxes put all their mass on the amber diagonal.

Figure 2 · the contrastive grid

view

Every image against every caption: an N×N grid of cosine similarities. The diagonal is the N real pairs (push up); the N²−N off-diagonal cells are false pairs (push down). Switch to the row / column softmax to see what the symmetric cross-entropy is steering onto the diagonal. The dog/cat cells stay high against each other, as confusable concepts do.

This batch-of-negatives objective is not new. It is the multi-class $N$ -pair loss from metric learning (Sohn, 2016), popularized for contrastive learning as InfoNCE (Oord et al., 2018), and applied to image-text pairs in medical imaging by ConVIRT (Zhang et al., 2020). CLIP's contribution is to strip it down and scale it. They train from scratch, use only a linear projection into the shared space (no nonlinear projection head), and lean on the sheer size of the dataset so that over-fitting is not a concern.

Temperature: one learned knob

The temperature $\tau$ in (2) controls how sharp the softmax is. Dividing the similarities by a small $\tau$ spreads them far apart before the exponential, so the softmax becomes confident and spiky; a large $\tau$ squashes them together and the distribution flattens toward uniform. Equivalently, CLIP multiplies the similarities by a scale $1/\tau$ , which the code stores in log space as logit_scale so it is always positive:

\text{logits} = (\hat{\mathbf{I}}\,\hat{\mathbf{T}}^{\top})\cdot \exp(\texttt{logit\_scale}), \qquad \exp(\texttt{logit\_scale}) = \frac{1}{\tau}

(3)

Note that $\tau$ is not a hyper-parameter you tune. It is a single learned scalar, optimized by gradient descent like any weight. The code initializes it to log(1 / 0.07), so at the start $\tau = 0.07$ and the scale $\exp(\texttt{logit\_scale}) = 1/0.07 \approx 14.3$ . Left unchecked, gradient descent pushes the scale toward infinity (an infinitely confident softmax), so CLIP clamps it: $\exp(\texttt{logit\_scale})$ is not allowed above $100$ , which means $\tau \ge 0.01$ . The authors found that clamp necessary to keep training stable.

Drag the temperature slider and watch one image's probabilities over five captions. The slider hits a floor at $\tau = 0.01$ , the clamp CLIP enforces:

Figure 3 · the temperature knob

temp ττ = 0.070

One image's softmax over five captions, as τ varies. Small τ sharpens the distribution onto the best match; large τ flattens it toward uniform. CLIP learns τ but clamps the scale exp(logit_scale) at 100, so τ cannot drop below 0.01.

One batch, with shapes

Let me make the abstract grid concrete with one training step of a ViT-B/32 CLIP (ViT = Vision Transformer; "B" = the base size, "/32" = 32x32-pixel image patches). The shared embedding dimension is $d_e = 512$ . The batch is large on purpose: CLIP uses $N = 32{,}768$ pairs at once, because a bigger batch means more negatives per image, which makes the contrastive task harder and the signal richer. (A batch this large only fits because the similarity grid is sharded across GPUs, as the code note below explains.)

First shrink the batch to four pairs so the grid fits on a page. Say the images are a dog, a cat, a beach, and a bicycle, each with its own true caption, and after encoding and normalizing you compute the cosine similarity of every image against every caption. That gives you a $4\times 4$ grid where the diagonal, image $i$ against its own caption $i$ , should be the largest in its row. Plausible numbers, with the true match on the diagonal:

dog image vs [dog, cat, beach, bicycle] captions: $0.90,\ 0.55,\ 0.08,\ 0.20$
cat image vs the same four: $0.50,\ 0.88,\ 0.10,\ 0.15$
beach image: $0.05,\ 0.07,\ 0.92,\ 0.12$
bicycle image: $0.18,\ 0.14,\ 0.16,\ 0.85$

The diagonal entries (0.90, 0.88, 0.92, 0.85) are high because each image matches its own caption; the off-diagonal entries are lower, with the dog/cat cells (0.55 and 0.50) the highest off-diagonal pair because those two concepts are genuinely confusable. To turn the dog row into a loss, scale it by the temperature and run a softmax across the four captions. With the released scale near $1/0.07 \approx 14.3$ , the row $[0.90, 0.55, 0.08, 0.20]$ becomes logits $[12.9, 7.9, 1.1, 2.9]$ , and the softmax of those puts roughly 0.99 of its mass on the dog caption and almost nothing on the other three. The cross-entropy loss for that row is then about $-\log 0.99 \approx 0.01$ , small because the model already matched the right caption; if instead the cat caption had scored highest, the dog-caption probability would be tiny and $-\log$ of it large, which is the penalty that pushes the matched pair up and the impostors down. The same softmax-and-cross-entropy runs down each column, so every caption also has to pick its own image. Now scale this from four pairs to the real batch:

Encode the batch. Images go through the ViT to $\mathbf{I} \in \mathbb{R}^{32768 \times 512}$ , captions through the text Transformer to $\mathbf{T} \in \mathbb{R}^{32768 \times 512}$ .
$L_2$ -normalize each row, so every one of the 65,536 vectors sits on the unit sphere.
Build the grid $S = \hat{\mathbf{I}}\,\hat{\mathbf{T}}^{\top} \in \mathbb{R}^{32768 \times 32768}$ , then scale by $\exp(\texttt{logit\_scale})$ . That is over a billion cosine similarities, with $32{,}768$ correct pairs on the diagonal and about $1.07\times 10^{9}$ false pairs off it.
The label vector is $[0, 1, 2, \dots, 32767]$ : row $i$ 's answer is column $i$ . Cross-entropy along rows, cross-entropy down columns, average. One scalar loss.

That is the entire forward pass, nine lines in the paper's notation:

# CLIP core, from the paper's Figure 3 (numpy-like)
I_f = image_encoder(I)            # [n, d_i]   images in a batch
T_f = text_encoder(T)             # [n, d_t]   their captions

# project into the shared space, then L2-normalize to the unit sphere
I_e = l2_normalize(I_f @ W_i, axis=1)   # [n, d_e]
T_e = l2_normalize(T_f @ W_t, axis=1)   # [n, d_e]

# every image vs every caption: an n x n grid of cosine sims
logits = (I_e @ T_e.T) * np.exp(t)      # t = learned log-temperature

# the right answer for row i (and column i) is i: the diagonal
labels = np.arange(n)
loss_i = cross_entropy(logits, labels, axis=0)   # caption -> image
loss_t = cross_entropy(logits, labels, axis=1)   # image -> caption
loss   = (loss_i + loss_t) / 2

The batch is so large that the similarity computation is sharded across GPUs, each one holding only the slice of the grid its local embeddings need. The temperature $t$ in that pseudocode is the same logit_scale from (3). CLIP trained eight models this way (five ResNets and three Vision Transformers) for 32 epochs over the 400M pairs. The largest, RN50x64, took 18 days on 592 V100 GPUs; the best, ViT-L/14, took 12 days on 256 V100s, with a final extra epoch at 336-pixel resolution to make ViT-L/14@336px.

The text encoder writes a classifier

Training is done. The model can score how well any image matches any caption. That is already a classifier, if you feed it the right captions.

Suppose you want to classify ImageNet, 1,000 classes. The 1,000 class names each get wrapped in a sentence like "A photo of a {label}.", and run all 1,000 through the frozen text encoder. You get 1,000 unit vectors, one per class. Now embed your image once into its own unit vector. The cosine similarity of the image vector against each of the 1,000 class vectors gives 1,000 scores; the largest one is your prediction. No training, no labeled images, no gradient step.

Figure 4 · zero-shot classification

image

The text encoder turns each class name (wrapped in "A photo of a label.") into a classifier weight vector. The image is embedded once; the class with the highest cosine similarity wins. Swap the image and the argmax moves. The classifier is written in words, not fitted on labels.

The zero-shot recipe is shorter than the training loss:

# zero-shot ImageNet, no training, no labeled images
classes = ["tabby cat", "golden retriever", ...]      # 1000 names
prompts = [f"A photo of a {c}." for c in classes]      # wrap each
W = l2_normalize(text_encoder(prompts))                # [1000, d_e] classifier

x = l2_normalize(image_encoder(image))                 # [d_e] one image
logits = x @ W.T                                       # cosine to each class
pred   = classes[logits.argmax()]                      # nearest wins

The paper has a clean way of reading this. The text encoder is a hypernetwork: it generates the weights of a linear classifier from the words describing the classes. A normal classifier learns its weight vectors by gradient descent on labeled examples; here the text encoder simply writes them, one embedding per class name, so a new task costs one sentence per class instead of a training run. That is all "hypernetwork" means here: a network whose output is another network's weights. And training CLIP, viewed through this lens, is optimizing a randomly drawn proxy classification task at every step, one with 32,768 classes (the captions in the batch) and a single example each (the matched image). The pre-training task and the zero-shot task are the same shape; the only difference at test time is that you choose the captions.

On ImageNet this zero-shot classifier reaches 76.2% top-1, matching a fully supervised ResNet-50 that was trained on all 1.28 million labeled ImageNet images, and 95% top-5 (on par with Inception-V4). The earlier best zero-shot attempt (Visual N-Grams) managed 11.5%.

Prompts and ensembles

Why "A photo of a {label}." rather than the bare class name? Because the captions CLIP trained on are rarely a single word. They are sentences. Feeding the text encoder a lone word like dog puts it in a part of the space it rarely saw during training. Wrapping it in a short sentence closes that distribution gap, and on ImageNet it alone is worth about a point of accuracy.

Wording also disambiguates. A common failure is polysemy: ImageNet has both construction cranes and cranes that fly, and Oxford Pets has "boxer," which the text encoder might read as the dog or the athlete. A prompt that supplies context fixes it. "A photo of a {label}, a type of pet." nudges "boxer" toward the breed; "a satellite photo of a {label}." helps on aerial imagery; putting quotes around text helps on OCR (optical character recognition, reading printed or rendered text in an image).

The second lever is ensembling. Build several classifiers from different prompts ("A photo of a big {label}.", "A photo of a small {label}.") and average their class vectors in the embedding space, before the softmax. Because you average the vectors and not the predictions, the cost at inference is the same as a single classifier once the averaged vectors are cached. On ImageNet, an ensemble of 80 prompts adds about 3.5% over the single default prompt. Prompt engineering and ensembling together add nearly 5% with no change to the trained weights.

Why it survives a new dataset

The result that made CLIP famous is not the headline ImageNet number. It is what happens when the test images shift. A standard ImageNet model does well on ImageNet-style photos. Show it the same 1,000 classes rendered as sketches (ImageNet Sketch), or artistic renditions (ImageNet-R), or the harder real-world crops of ObjectNet, and it falls apart, because it learned features that work on the exact ImageNet distribution and partly memorized that distribution's quirks.

CLIP, classifying zero-shot, holds up. It never trained on the ImageNet distribution, so it had nothing to overfit to. The paper measures this with effective robustness: plot accuracy on ImageNet against average accuracy on seven natural distribution shifts. A perfectly robust model would sit on the line $y = x$ . Standard ImageNet models fall on a line far below it; zero-shot CLIP sits close to the ideal. The cleanest comparison is a ResNet-101 and a CLIP model that tie at 76.2% on ImageNet itself, then diverge on every variant of it. Rendered as art (ImageNet-R): ResNet-101 drops to 37.7%, CLIP holds 88.9%. On ObjectNet: 32.6% versus 72.3%. On ImageNet Sketch: 25.2% versus 60.2%. Same ImageNet score, different robustness by 30 to 50 points. Across the full suite CLIP closes up to 75% of that gap.

Figure 5 · the robustness gap

focus

Per dataset, a ResNet-101 against zero-shot CLIP. They tie at 76.2% on ImageNet itself, then split on every variant: 88.9 vs 37.7 on ImageNet-R, 72.3 vs 32.6 on ObjectNet, 77.1 vs 2.7 on ImageNet-A. Same ImageNet score, far apart on every variant. Numbers are from the paper's Figure 13.

The authors flag one more result directly. If you take CLIP and fit a supervised linear classifier on its features using the ImageNet training set, ImageNet accuracy jumps 9.2% to 85.4%. Average accuracy on the distribution shifts does not improve at all. The gain concentrates right on the ImageNet distribution and does not generalize. Robustness came from not training on the target distribution, and you can train it back out.

What it can do, and what it cannot

CLIP shows that a simple objective, predicting which caption goes with which image, scaled to 400 million web pairs, produces a vision model you can steer with words. The zero-shot classifier matches supervised ResNet-50 on ImageNet, stays robust where supervised models crumble, and transfers across more than 30 datasets without ever fitting to them. Later work kept reusing the text encoder as a classifier generator: it is the direct ancestor of the text-conditioned image generators (DALL-E 2, Stable Diffusion) that lean on a frozen CLIP text embedding.

The paper names the limits plainly. Zero-shot CLIP is weak on fine-grained classification (telling apart car models, flower species, aircraft variants) and on abstract tasks like counting objects. On anything truly outside its web-scraped world it can be near chance: it manages only 88% on handwritten MNIST digits, worse than logistic regression on raw pixels, because almost nothing like MNIST appears in its training data. It can only ever choose from the class names you give it, so it is not a generative captioner. And it is data-hungry and compute-hungry: the authors estimate roughly a 1,000x increase in compute would be needed for zero-shot CLIP to reach overall state of the art, which is infeasible on current hardware.

The paper's contribution is to treat the caption as the label. Trace it back through the paper. A fixed label set caps what a model can recognize. Natural language is an open, free label that scale makes practical. A contrastive match-the-pair objective learns a shared space cheaply, with a batch supplying its own negatives and a single learned temperature setting the sharpness. Once that space exists, the text encoder writes a classifier from words alone.

Provenance Verified against primary literature

CLIP (2021)Radford et al.: the contrastive objective, the symmetric loss (Fig. 3 pseudocode), the learned temperature, and the zero-shot transfer results.

model.py (code)Official OpenAI repo. logit_scale = log(1/0.07), the L2 normalization, and logits = logit_scale.exp() · image_features @ text_features.t().

N-pair / InfoNCESohn (2016) and Oord et al. (2018): the batch-of-negatives contrastive objective CLIP adopts.

ConVIRT (2020)Zhang et al.: the contrastive image-text objective in medical imaging that CLIP simplifies and scales.

ViT (2020)Dosovitskiy et al.: the image encoder for CLIP’s strongest models.

caveatThe paper’s Fig. 3 pseudocode applies a separate learned projection (W_i, W_t) and then L2-normalizes; the released model.py folds that projection into each encoder and normalizes the encoder output directly. Same math, different code layout. The text feature is taken at the end-of-text token: the paper says [EOS], the code uses text.argmax(dim=-1), which finds it because the EOT token has the largest id in the BPE vocab.

Questions you might still have

Why is predicting the whole caption easier than predicting its exact words?
Generating the exact words is a hard, high-entropy task: a million captions could fit one image. CLIP only asks which caption out of the batch goes with this image, a multiple-choice question. The paper measured a 4x jump in training efficiency from making that swap.

The image encoder never sees a class label. How does it classify?
It does not, on its own. At test time the text encoder reads your class names and emits one vector per class. Those vectors are the classifier. The image encoder just produces one image vector, and the nearest class vector wins. The classifier is built from words, not fitted on labeled images.

Why learn the temperature instead of fixing it?
It sets how sharp the softmax is, which interacts with the learning rate and the geometry of the embedding space. Letting the model tune it (as a log-scale multiplier) removes a finicky hyper-parameter. The only guardrail is a clamp: the scale exp(logit_scale) cannot exceed 100, which the authors found necessary to keep training stable.

Does it actually understand the image, or just match strings?
Neither fully. CLIP learns a shared geometry where related images and texts point the same way, which is why prompt wording matters and why it stumbles on tasks far from web text (it gets 88% on MNIST, beaten by logistic regression on raw pixels). It is a strong, broad pattern matcher, not a reasoner.

Footnotes & further reading

The paper: Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, Sutskever, Learning Transferable Visual Models From Natural Language Supervision (OpenAI, ICML 2021). Code and blog post.
The batch-of-negatives objective: Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss (2016), and Oord, Li, Vinyals, Representation Learning with Contrastive Predictive Coding (InfoNCE, 2018).
The contrastive image-text precursor CLIP simplifies: Zhang et al., Contrastive Learning of Medical Visual Representations from Paired Images and Text (ConVIRT, 2020).
The image encoder for the strongest models: Dosovitskiy et al., An Image Is Worth 16x16 Words (the Vision Transformer, 2020).
A downstream use of the frozen CLIP text embedding: Ramesh et al., Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2, 2022).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.