An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is just a sequence of patches.
Cut a picture into a grid of 16×16 patches, treat each patch as a word, and feed the sequence to the same Transformer that runs language models. Change nothing else. With enough pre-training data, it beats the convolutional networks that ruled vision for a decade.
Explaining the paperAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleWhat if you fed an image to a plain language Transformer, with none of the machinery vision spent a decade building?
By 2020 the Transformer had quietly won natural language processing. The recipe was settled: pre-train one big attention-based model on a mountain of text, then fine-tune it on whatever task you cared about. Vision had not gone that way. Image models were still convolutional networks, the descendants of the same idea that won ImageNet in 2012, and attempts to bring attention into vision mostly bolted it onto a convolutional backbone or replaced a few pieces while keeping the overall shape.
This paper makes the blunt bet. Do the most obvious thing and see if it works: chop the image into patches, line the patches up as a sequence, and run a standard Transformer encoder over them, the kind shipped for language, with as few changes as possible. What is surprising is how well it does. Given enough data to pre-train on, it wins. The authors put the finding in one line: large scale training trumps inductive bias.
That sentence needs unpacking, because it hides the catch. Trained on a normal-sized image dataset, this patch-and-Transformer model is mediocre, a few points behind a convolutional network of similar size. The convolutional priors it threw away were doing real work. Only when you pre-train on hundreds of millions of images does the order flip and the Transformer pull ahead. The rest of this piece builds that story: how an image becomes a sequence, what exactly ViT discards, and why scale is what pays it back.
Vision skipped the Transformer
Start with why this was an open question at all. A Transformer's core operation is self-attention: every element of a sequence looks at every other element and pulls in a weighted blend of them. That is wonderful for flexibility and terrible for cost. Comparing every element to every other is quadratic, so attention over a sequence of length costs on the order of .
For a sentence, is a few hundred words and the quadratic cost is fine. For an image, the natural sequence is the pixels, and that is a disaster. A modest 224×224 image has about 50,000 pixels, so attention over pixels would compare roughly 2.5 billion pairs in every layer. No accelerator wants that.
So before ViT, applying Transformers to images meant dodging the wall. People restricted attention to local neighborhoods, or approximated it with sparse and axial patterns, or used it only on top of a convolutional network that had already shrunk the spatial size. Several of these worked on paper but needed custom kernels that did not run efficiently on real hardware. In large-scale image recognition, classic ResNet-style convolutional networks were still the state of the art.
ViT's trick for the problem is simple: make small by not using pixels. Group the pixels into patches, and let each patch, not each pixel, be one element of the sequence.
An image as a sequence of patches
Start with the image. Cut it into a grid of fixed-size square patches, say 16×16 pixels each. A 224×224 image becomes a 14×14 grid of patches, which is 196 of them. Flatten each patch's pixels into a single vector. A 16×16 RGB patch holds numbers, so each patch is now a 768-length vector. The image has become a short sequence of vectors, the same kind of object a language Transformer eats.
Written out, the reshape is just bookkeeping:
is the image resolution, the color channels, the patch size, and the number of patches, which is also the sequence length the Transformer sees. The count is the dial that matters. Halve the patch size and you quadruple the number of patches, and the attention cost grows like on top of that. This is why a model's name carries its patch size: ViT-L/16 is the Large variant with 16×16 patches, and ViT-L/32 is the same body with coarser 32×32 patches, a quarter as many tokens and far cheaper.
The patches are still raw pixels, so the first learned step is a single linear projection. Multiply each flattened patch by a trainable matrix to map it into the model's working width (the constant vector size the Transformer carries through every layer, 768 for ViT-Base). The output is the patch embedding: one -dimensional token per patch.
Two pieces still need adding before this is a real input sequence.
A class token. Borrowing directly from BERT, ViT prepends one extra learnable vector to the front of the sequence, the [class] token. It is not tied to any patch. Its job is to be a place to accumulate a summary: after the whole encoder runs, the final state of this one token is read off as the image representation and sent to the classifier. The patches do the seeing; the class token does the reporting.
Position embeddings. Self-attention has no built-in sense of order. It treats its input as a set, so shuffling the patches would just shuffle the outputs the same way, carrying no information about where anything sits (the technical word is permutation-equivariant). That is fatal for an image, where position is everything. ViT fixes it the standard way: add a learnable position embedding to each token, a vector that says "you are slot ." Notably, these are plain 1D embeddings, one per slot, with no knowledge that the slots form a 2D grid. We will come back to what they learn, because it is one of the nicer surprises in the paper.
Putting the three pieces together gives the first equation of the paper, the input to the encoder:
with the projection and the position table . The is worth pausing on: the class token makes the sequence one longer than the patch count, so 196 patches become 197 tokens at the encoder's input. Drag the patch size below and watch the grid, the token strip, and the counts move together.
That is the entire vision-specific part of ViT. From here on there is nothing about images at all. The sequence goes into a Transformer that does not know or care that its tokens came from a picture.
A standard Transformer, unchanged
The point of pride in the paper is how little happens next. The encoder is the 2017 Transformer, the same one behind every language model, used almost out of the box. It is a stack of identical blocks, and each block is two operations: multi-head self-attention (MSA) and a small per-token MLP, each wrapped in a LayerNorm and a residual connection.
Read (2) and (3) as "normalize, do the work, add the input back." The residual add (the ) is the same identity shortcut that lets deep networks train at all. Then (4) takes the final state of the class token, , normalizes it, and calls it the image representation.
There is one quiet deviation hiding in (2) and (3), worth flagging because the paper says it follows the original Transformer "as closely as possible" and then does not, in this one spot. Notice the LayerNorm sits inside the residual branch, before MSA and MLP. The 2017 Transformer put it after the residual add instead. This is the difference between pre-norm and post-norm, and it is not cosmetic: pre-norm is what makes deep Transformers train stably without delicate warmup tricks, which is exactly why ViT adopts it (following Wang et al. and Baevski & Auli, both cited). So the block shape is the original; the norm placement is the modern fix. Put the two together and you can see why a stack this deep trains at all: the residual adds hand the gradient a straight path back through every block, so the signal that tunes the earliest layers never has to survive multiplication through all the layers above, and normalizing before each sublayer holds every layer's input at a stable scale, so the activations neither explode nor fade on the way down.
Inside MSA is the one operation everything rests on, scaled dot-product attention. Each token is projected into three vectors: a query , a key , and a value . A token's query is compared against every token's key by a dot product; those scores are scaled and softmaxed into weights; the output is the weighted sum of values.
Read at the level of the whole sequence, are stacked over all the tokens, so the attention matrix is : every token scored against every token, the class token included. The division by keeps the dot products from growing with dimension and saturating the softmax; is the width of one head. "Multi-head" means running of these attention operations in parallel, each with its own projections, then concatenating and mixing their outputs. To keep the compute fixed as you add heads, each head is made narrower: . This all matches the original Transformer exactly, with ViT's heads playing the role of the original's heads and its model width.
The other half of each block, the MLP, is the plain feed-forward network from the original Transformer: two linear layers with a GELU between them, applied to every token on its own. It widens each token to four times the model width and back, so ViT-Base's 768 expands to 3072 in the middle.
Stack of these blocks and you have the model. The three sizes in the paper are lifted straight from BERT, plus a larger one:
- ViT-Base: 12 layers, width , 12 heads, 86M parameters.
- ViT-Large: 24 layers, width 1024, 16 heads, 307M parameters.
- ViT-Huge: 32 layers, width 1280, 16 heads, 632M parameters.
The whole forward pass, patches to logits, fits in a handful of lines:
# ViT forward pass: one image -> class logits
patches = unfold(image, P, P) # [N, P*P*C], N = H*W / P^2
tokens = patches @ E # linear patch embedding -> [N, D]
z = concat([cls, tokens]) + pos_emb # prepend [class], add positions
for block in encoder: # L identical Transformer blocks
z = z + MSA(LN(z)) # eq (2): attention, pre-norm
z = z + MLP(LN(z)) # eq (3): MLP, pre-norm
y = LN(z[0]) # eq (4): the [class] token's state
logits = head(y) # MLP at pre-train, linear at fine-tuneNothing in that listing is vision-specific past the first two lines. The encoder is interchangeable with a language model's. That interchangeability is the entire pitch, and it is also what lets ViT inherit every scaling trick the NLP world had already worked out.
What ViT gives up: locality
Trading convolution for attention is not free, and the paper is honest about the bill. A convolutional network carries three strong assumptions about images, baked into every layer. It is worth naming them, because ViT discards almost all three.
Locality: a convolutional filter only ever looks at a small neighborhood, so early layers are forced to build features from nearby pixels. Two-dimensional structure: the filter slides over a grid, so the notion that pixels have up, down, left, and right neighbors is built in. Translation equivariance: because the same filter is applied at every position, an object shifted across the image produces the same features, just shifted. (Equivariance is the precise word here. The looser "a cat is a cat wherever it appears" invariance comes later, from pooling, not from the convolution itself.) These three are good guesses about how images work, and they are why a convolutional network can learn from a few thousand images.
ViT keeps almost none of it. Self-attention is global from the very first layer: every patch can read every other patch in one step, with no preference for nearby ones. And it is permutation-equivariant, treating the patches as an unordered set, which is the whole reason position embeddings had to be bolted on. The only place locality survives is the per-token MLP, which processes each token on its own. The model is handed a far weaker set of assumptions and told to learn the rest. Cash that out concretely: shuffle the input patches and self-attention shuffles its outputs the same way and computes nothing differently, because nothing in the architecture knows the patches came from a 2D grid. A convolution knows that 2D structure by construction, from the way one filter slides over neighbors; ViT has to spend data learning the same spatial structure.
The figure makes the contrast concrete. Pick a query patch and compare its reach under attention against its reach under a 3×3 convolution. Attention touches the whole image at once; the convolution sees nine patches. A convolutional network only reaches across the image by stacking many layers, each one growing the receptive field a little. ViT can do it immediately.
One careful note, because it is easy to overstate. ViT has much less image-specific inductive bias, not none. Two pieces of 2D knowledge are still injected by hand: cutting the image into a 2D grid of patches at the start, and, at fine-tuning time, 2D-interpolating the position embeddings when the input resolution changes. Everything else about spatial structure, which patch neighbors which and what that means, the model has to learn from scratch. The empirical question the rest of the paper answers is what it takes to learn it.
Scale beats inductive bias
The finding, stated as the experiments found it: pre-train ViT on a mid-sized dataset like ImageNet (1.3M images) without heavy regularization, and it lands a few percentage points below a comparable ResNet. This is the expected outcome and the paper says so plainly. The convolutional priors ViT abandoned were earning their keep, and with limited data the model could not learn good replacements.
Then grow the pre-training set and the order reverses. The paper pre-trains on three datasets of increasing size: ImageNet (1.3M images), ImageNet-21k (14M), and the in-house JFT-300M (303M). On the smallest, convolutional ResNets win. As the data grows, ViT catches up and then overtakes. Drag the slider and watch the crossover.
At the top of the range the numbers are striking. The best model, ViT-H/14 pre-trained on JFT-300M and then fine-tuned, reaches 88.55% on ImageNet, 90.72% on the cleaned-up ImageNet-ReaL labels, 94.55% on CIFAR-100, and 77.63% on the 19-task VTAB suite. It matches or beats the strongest convolutional baselines of the day, the Big Transfer ResNets and Noisy Student EfficientNet.
One number here is the single most misquoted fact about this paper, so state it carefully. That 88.55% is not a from-scratch ImageNet result. It is what you get after pre-training on JFT-300M, a proprietary dataset of 303 million images, and then fine-tuning on ImageNet at high resolution. Take JFT away and train on ImageNet alone, and large ViT does worse than smaller ViT and worse than a comparable ResNet. The headline is real, and it is a story about scale.
The twist that makes it more than a curiosity is the cost. ViT also wins on compute. Pre-training is measured in TPUv3-core-days (the number of TPU v3 cores used, times the days they ran). ViT-H/14 took about 2.5k core-days. The ResNet baseline BiT-L took 9.9k, and Noisy Student took 12.3k. ViT reached a better result for roughly a quarter of the compute. The more modest ViT-L/16 pre-trained on the public ImageNet-21k could be trained on a single cloud TPUv3 with 8 cores in about 30 days, which is within reach of a normal lab.
Why does the order flip with scale? The paper offers a reading, and it is a hypothesis rather than a theorem, so take it as one. A convolutional network's priors are a head start that is also a ceiling. Locality and translation equivariance are roughly right, and when data is scarce that head start is decisive. But they are not exactly right, and given enough examples a model that learns its own structure from the data can fit images better than one constrained by hand-built assumptions. Past some data scale, the constraint costs more than it saves. The mechanism is one fact seen from two sides: when examples are scarce the hand-built assumptions point the model straight at plausible features and spare it from learning them the hard way, and when examples are abundant those same assumptions become guardrails that forbid structure the data actually contains, so a model free to learn its own can follow the data right past them. Notably, ViT shows no sign of saturating in the range the authors tried, which is part of why this paper kicked off a scaling race in vision.
Put the trade in plain terms, because it is the crux of the whole paper. A convolution hard-codes locality: a filter only ever looks at a small neighborhood, so the model is told from the start that nearby pixels matter and distant ones do not, until many layers stack up. When data is scarce that is a head start, the model does not have to spend examples discovering that a cat's ear sits next to its head rather than across the image. But the same hard-coding is a ceiling, because it forbids any pattern that is not local, and some real patterns are not. Self-attention carries almost none of that prior (much less, not none, since cutting the image into a patch grid is still a 2D assumption). The cost is that it has to learn 2D structure from data, which is why it needs scale, and which is exactly what the position-embedding result above shows it doing. The payoff is that once it has learned that structure it is not boxed in by a small kernel: from the very first layer it can relate any patch to any other, an object to its distant reflection or its shadow on the far wall, a relationship a small convolution filter cannot express at all and a deep convolutional stack can only build up slowly. So the bias is not simply good or bad. It is a loan against data, cheap when examples are few and expensive when they are many.
The next figure draws the geometric picture of that freedom. Mark an object patch in one corner of the grid and a related reflection patch in the opposite corner. Flip the toggle between conv and ViT, and click a patch to make it the query. In the conv frame the 3×3 ring is the whole world the kernel sees and the partner patch sits outside it. In the ViT frame the query can read any patch in one step, and the model is free to put weight on the distant one.
What the patches learned
Given all that freedom, what does ViT actually do with it? The paper opens the trained model up, and two of its measurements are worth seeing, because they show ViT rediscovering structure nobody handed it.
The first is attention distance. For a given head, average how far across the image, in pixels, it attends. This is the Transformer's analogue of a convolutional network's receptive field: small distance means the head is acting locally, large distance means it is reaching across the whole picture. Plotting it by layer reveals a clear pattern.
In the lowest layers, the heads split in two. Some already attend across the whole image, using the global reach attention gives them. Others stay tightly local, behaving like the early layers of a convolutional network. As you go deeper, the local heads climb until almost every head is global. So ViT does end up building a local-to-global hierarchy, much like a CNN, but it learns the hierarchy instead of being wired for it, and it can still go global from the first layer. (In hybrid models, where a ResNet stem runs before the Transformer, those local low-layer heads largely vanish, because the convolutions already did the local work.) One way to read the split: different patches need context at different scales, so the model staffs both jobs at once, some heads reading their immediate neighbors while others scan the whole image. That is a reading of the measurement, though, not something the paper proves the heads are doing.
The second measurement is the nicer surprise. Recall the position embeddings are plain 1D vectors, one per slot, with no built-in knowledge that the slots form a 14×14 grid. So ask: after training, how similar is each patch's position vector to every other's? Hover a patch and look.
The learned vectors recover the grid. A patch's position embedding is most similar to its near neighbors, and patches in the same row or column stay similar too. Nothing in the architecture told the model that 2D layout exists; it inferred it from the data. This also explains a small negative result the paper reports: hand-designed 2D-aware position embeddings gave no real improvement, because the plain 1D ones already learn the 2D structure on their own. (One correction worth keeping straight: these are learned embeddings, not the fixed sinusoidal encodings of the original Transformer. A sinusoidal-looking pattern sometimes emerges in the learned vectors for larger grids, but it is a result, not an input.)
What it actually does
Step back and there is strikingly little to it. Cut an image into patches, embed them, prepend a class token, add positions, run an unmodified Transformer encoder, read off the class token. The vision ingredient is one slicing step at the front. Almost everything that makes it work is borrowed from language, and the win is scale.
The paper also runs a preliminary self-supervised experiment, and it is worth flagging as preliminary. Copy BERT's masked-language idea over to images: mask out some patches and train the model to predict them. This masked-patch prediction lifts ViT-B/16 to 79.9% on ImageNet, a 2% gain over training from scratch, but still 4% behind supervised pre-training. The authors leave the bigger question, contrastive and self-supervised pre-training at scale, to future work. (That thread is where a lot of the next few years of vision went.)
The limits are real and the paper does not hide them. ViT needs scale to shine, and on small data the convolutional inductive bias still wins. The class-token-versus-pooling choice is finicky enough to need its own learning-rate tuning. And the strongest results lean on JFT-300M, a dataset most people cannot touch.
What this paper changed is hard to overstate even so. "An image is a sequence of patches" became the default way to put vision into a Transformer. Follow-up work showed you could train ViT without a private 300M-image dataset by leaning on distillation (DeiT), and the patch-tokenization idea spread into detection, segmentation, image-text models like CLIP, and the vision front-ends of today's multimodal systems. The bet was to change as little as possible and let scale carry it. It carried.
Questions you might still have
If 88.55% beats the best CNN, why do people say ViT needs huge data?
Because that number comes after pre-training on JFT-300M (303M images), then fine-tuning. On ImageNet alone, large ViT does worse than a smaller ViT and worse than a comparable ResNet. The win is conditional on scale, not a from-scratch result.
Why split into patches instead of feeding raw pixels?
Self-attention compares every element to every other, so its cost grows like N². A 224×224 image is about 50,000 pixels, so pixel-level attention is roughly 2.5 billion pairs per layer. Grouping pixels into 16×16 patches drops N to 196 and makes the Transformer affordable.
Does ViT have no inductive bias, or just less?
Less, not none. Two image-specific biases remain: cutting the image into a 2D grid of patches, and 2D-interpolating the position embeddings when the resolution changes. Everything else about which patch relates to which is learned from data.
If the position embeddings are 1D, how does ViT know the 2D layout?
It learns it. Nothing tells the model that one patch sits directly below another, yet after training the learned position vectors of nearby patches, and of patches in the same row or column, become the most similar. The 2D map is recovered from data.
Is the class token necessary?
No. Average-pooling the patch tokens works about as well, but only after retuning the learning rate; out of the box it does much worse. ViT keeps the class token mainly to stay close to the standard Transformer, not for a measured accuracy gain.
Footnotes & further reading
- The paper: Dosovitskiy, Beyer, Kolesnikov, et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Google Research, ICLR 2021). Code.
- The encoder, attention, and the head split come straight from Vaswani et al., Attention Is All You Need (2017). The pre-norm placement ViT actually uses is from Wang et al., Learning Deep Transformer Models for Machine Translation (2019).
- The
[class]token is borrowed from Devlin et al., BERT (2019). - The closest prior model, 2×2 patches plus full self-attention: Cordonnier, Loukas & Jaggi, On the Relationship between Self-Attention and Convolutional Layers (ICLR 2020).
- The convolutional baselines ViT is measured against: Kolesnikov et al., Big Transfer (BiT) (2020), and Xie et al., Self-training with Noisy Student (2020).
- Training ViT without a private 300M-image dataset, via distillation: Touvron et al., Training data-efficient image transformers (DeiT) (2021).
How could this explainer be improved? Found an error, or something unclear? I read every message.