Vision · Architecture

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is just a sequence of patches.

Cut a picture into a grid of 16×16 patches, treat each patch as a word, and feed the sequence to the same Transformer that runs language models. Change nothing else. With enough pre-training data, it beats the convolutional networks that ruled vision for a decade.

Explaining the paperAn Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleDosovitskiy, Beyer, Kolesnikov, et al. · Google Research · ICLR 2021 · arXiv:2010.11929 ↗

On a mid-sized dataset it trails a plain convolutional network; feed it hundreds of millions of images and it pulls ahead.

By 2020 the Transformer had won natural language processing outright. The recipe was settled: pre-train one big attention-based model on a mountain of text, then fine-tune it on whatever task you cared about. Vision had not gone that way. Image models were still convolutional networks, the descendants of the same idea that won ImageNet in 2012, and attempts to bring attention into vision mostly bolted it onto a convolutional backbone or replaced a few pieces while keeping the overall shape.

This paper does the most obvious thing and sees if it works: chop the image into patches, line the patches up as a sequence, and run a standard Transformer encoder over them, the kind shipped for language, with as few changes as possible. It does well. Given enough data to pre-train on, it wins. The authors put the finding in one line: large scale training trumps inductive bias.

That sentence needs unpacking, because it comes with a condition. Trained on a normal-sized image dataset, this patch-and-Transformer model is mediocre, a few points behind a convolutional network of similar size. The convolutional priors it threw away were doing real work. Only when you pre-train on hundreds of millions of images does the order flip and the Transformer pull ahead. The rest of this piece builds that story: how an image becomes a sequence, what exactly ViT discards, and why scale is what pays it back.

Vision skipped the Transformer

It helps to see why this was an open question at all. A Transformer's core operation is self-attention: every element of a sequence looks at every other element and pulls in a weighted blend of them. That is wonderful for flexibility and terrible for cost. Comparing every element to every other is quadratic, so attention over a sequence of length $N$ costs on the order of $N^2$ .

For a sentence, $N$ is a few hundred words and the quadratic cost is fine. For an image, the natural sequence is the pixels, and that is a disaster. A modest 224×224 image has about 50,000 pixels, so attention over pixels would compare roughly 2.5 billion pairs in every layer. No accelerator wants that.

So before ViT, applying Transformers to images meant dodging the $N^2$ wall. People restricted attention to local neighborhoods, or approximated it with sparse and axial patterns, or used it only on top of a convolutional network that had already shrunk the spatial size. Several of these worked on paper but needed custom kernels that did not run efficiently on real hardware. In large-scale image recognition, classic ResNet-style convolutional networks were still the state of the art.

ViT handles the $N^2$ problem in one move: make $N$ small by not using pixels. Group the pixels into patches, and let each patch, not each pixel, be one element of the sequence.

An image as a sequence of patches

Begin with the image. It is cut into a grid of fixed-size square patches, say 16×16 pixels each. A 224×224 image becomes a 14×14 grid of patches, which is 196 of them. Flatten each patch's pixels into a single vector. A 16×16 RGB patch holds $16\times 16\times 3 = 768$ numbers, so each patch is now a 768-length vector. The image has become a short sequence of vectors, the same kind of object a language Transformer eats.

Written out, the reshape is just bookkeeping:

\mathbf{x}\in\mathbb{R}^{H\times W\times C}\ \longrightarrow\ \mathbf{x}_p\in\mathbb{R}^{N\times(P^2\cdot C)},\qquad N=\frac{HW}{P^2}

$(H,W)$ is the image resolution, $C$ the color channels, $(P,P)$ the patch size, and $N$ the number of patches, which is also the sequence length the Transformer sees. The count $N=HW/P^2$ is the dial that matters. Halve the patch size and you quadruple the number of patches, and the attention cost grows like $N^2$ on top of that. This is why a model's name carries its patch size: ViT-L/16 is the Large variant with 16×16 patches, and ViT-L/32 is the same body with coarser 32×32 patches, a quarter as many tokens and far cheaper.

Figure 1 · the cost of the patch-size dial

patch P16×16 · N+1=197

The same 224×224 image at six patch sizes. Each P×P patch becomes one token, so

N=HW/P^2

patches plus the prepended [class] token make an

N{+}1

-long sequence: at P=16 that is 196 patches, 197 tokens. Hover any patch to see which token it becomes, the tick on the token bar is that patch's seat in the sequence. Self-attention cost grows like

N^2

, drawn on a log bar so the explosion stays on screen. Halve P and you quadruple N and lift the attention cost about sixteenfold.

The patches are still raw pixels, so the first learned step is a single linear projection. Multiply each flattened patch by a trainable matrix $\mathbf{E}$ to map it into the model's working width $D$ (the constant vector size the Transformer carries through every layer, 768 for ViT-Base). The output is the patch embedding: one $D$ -dimensional token per patch.

Two pieces still need adding before this is a real input sequence.

A class token. Borrowing directly from BERT, ViT prepends one extra learnable vector to the front of the sequence, the [class] token. It is not tied to any patch. Its job is to be a place to accumulate a summary: after the whole encoder runs, the final state of this one token is read off as the image representation $\mathbf{y}$ and sent to the classifier.

Position embeddings. Self-attention has no built-in sense of order. It treats its input as a set, so shuffling the patches would just shuffle the outputs the same way, carrying no information about where anything sits (the technical word is permutation-equivariant). That is fatal for an image, where position is everything. ViT fixes it the standard way: add a learnable position embedding to each token, a vector that says "you are slot $i$ ." Notably, these are plain 1D embeddings, one per slot, with no knowledge that the slots form a 2D grid. We will come back to what they learn, because it is one of the nicer surprises in the paper.

Putting the three pieces together gives the first equation of the paper, the input to the encoder:

\mathbf{z}_0=[\,\mathbf{x}_{\text{class}};\ \mathbf{x}_p^1\mathbf{E};\ \mathbf{x}_p^2\mathbf{E};\ \cdots;\ \mathbf{x}_p^N\mathbf{E}\,]+\mathbf{E}_{pos}

(1)

with the projection $\mathbf{E}\in\mathbb{R}^{(P^2\cdot C)\times D}$ and the position table $\mathbf{E}_{pos}\in\mathbb{R}^{(N+1)\times D}$ . Note the $N+1$ : the class token makes the sequence one longer than the patch count, so 196 patches become 197 tokens at the encoder's input. Drag the patch size below and watch the grid, the token strip, and the counts move together.

Figure 2 · image to token sequence

patch P16×16 · N=196

The image is cut into a grid of patches, each patch is flattened and linearly projected to a token, a teal [class] token is prepended, and position embeddings are added. Smaller patches reconstruct the image better but produce more tokens, and attention cost grows like N². The 16×16 setting gives the title its words.

That is the entire vision-specific part of ViT. From here on there is nothing about images at all. The sequence $\mathbf{z}_0$ goes into a Transformer that does not know or care that its tokens came from a picture.

A standard Transformer, unchanged

The point of pride in the paper is how little happens next. The encoder is the 2017 Transformer, the same one behind every language model, used almost out of the box. It is a stack of $L$ identical blocks, and each block is two operations: multi-head self-attention (MSA) and a small per-token MLP, each wrapped in a LayerNorm and a residual connection.

\mathbf{z}'_\ell=\operatorname{MSA}(\operatorname{LN}(\mathbf{z}_{\ell-1}))+\mathbf{z}_{\ell-1}

(2)

\mathbf{z}_\ell=\operatorname{MLP}(\operatorname{LN}(\mathbf{z}'_\ell))+\mathbf{z}'_\ell

(3)

\mathbf{y}=\operatorname{LN}(\mathbf{z}_L^0)

(4)

Equations (2) and (3) each say: normalize, do the work, add the input back. The residual add (the $+\,\mathbf{z}_{\ell-1}$ ) is the same identity shortcut that lets deep networks train at all. Then (4) takes the final state of the class token, $\mathbf{z}_L^0$ , normalizes it, and calls it the image representation.

There is one deviation in (2) and (3), worth flagging because the paper says it follows the original Transformer "as closely as possible" and then does not, in this one spot. Notice the LayerNorm sits inside the residual branch, before MSA and MLP. The 2017 Transformer put it after the residual add instead. This is the difference between pre-norm and post-norm: pre-norm is what makes deep Transformers train stably without delicate warmup tricks, which is exactly why ViT adopts it (following Wang et al. and Baevski & Auli, both cited). So the block shape is the original; the norm placement is the modern fix. Put the two together and you can see why a stack this deep trains at all: the residual adds give the gradient a direct path back through every block, so the signal that tunes the earliest layers never has to survive multiplication through all the layers above, and normalizing before each sublayer holds every layer's input at a stable scale, so the activations neither explode nor fade on the way down.

Inside MSA is the one operation everything rests on, scaled dot-product attention. Each token is projected into three vectors: a query $\mathbf{q}$ , a key $\mathbf{k}$ , and a value $\mathbf{v}$ . A token's query is compared against every token's key by a dot product; those scores are scaled and softmaxed into weights; the output is the weighted sum of values.

\mathbf{q},\mathbf{k},\mathbf{v}=\mathbf{z}\mathbf{U}_{qkv},\qquad A=\operatorname{softmax}\!\big(\mathbf{q}\mathbf{k}^{\top}/\sqrt{D_h}\big),\qquad \operatorname{SA}(\mathbf{z})=A\,\mathbf{v}

(5-7)

At the level of the whole sequence, $\mathbf{q},\mathbf{k},\mathbf{v}$ are stacked over all the tokens, so the attention matrix $A$ is $(N{+}1)\times(N{+}1)$ : every token scored against every token, the class token included. The division by $\sqrt{D_h}$ keeps the dot products from growing with dimension and saturating the softmax; $D_h$ is the width of one head. "Multi-head" means running $k$ of these attention operations in parallel, each with its own projections, then concatenating and mixing their outputs. To keep the compute fixed as you add heads, each head is made narrower: $D_h=D/k$ . This all matches the original Transformer exactly, with ViT's $k$ heads playing the role of the original's heads and $D$ its model width.

The other half of each block, the MLP, is the plain feed-forward network from the original Transformer: two linear layers with a GELU nonlinearity (a smooth ReLU-like activation) between them, applied to every token on its own. It widens each token to four times the model width and back, so ViT-Base's 768 expands to 3072 in the middle.

Stack $L$ of these blocks and you have the model. The three sizes in the paper are lifted straight from BERT, plus a larger one:

ViT-Base: 12 layers, width $D=768$ , 12 heads, 86M parameters.
ViT-Large: 24 layers, width 1024, 16 heads, 307M parameters.
ViT-Huge: 32 layers, width 1280, 16 heads, 632M parameters.

The whole forward pass, patches to logits, fits in a handful of lines:

# ViT forward pass: one image -> class logits
patches = unfold(image, P, P)        # [N, P*P*C],  N = H*W / P^2
tokens  = patches @ E                 # linear patch embedding -> [N, D]
z = concat([cls, tokens]) + pos_emb   # prepend [class], add positions
for block in encoder:                 # L identical Transformer blocks
    z = z + MSA(LN(z))                # eq (2): attention, pre-norm
    z = z + MLP(LN(z))                # eq (3): MLP, pre-norm
y = LN(z[0])                          # eq (4): the [class] token's state
logits = head(y)                      # MLP at pre-train, linear at fine-tune

Nothing in that listing is vision-specific past the first two lines. The encoder is interchangeable with a language model's. ViT was built to be interchangeable like this, and the interchangeability is also what lets it inherit every scaling trick the NLP world had already worked out.

What ViT gives up: locality

Trading convolution for attention is not free, and the paper spells out the bill. A convolutional network carries three strong assumptions about images, baked into every layer, and ViT discards almost all three of them.

Locality: a convolutional filter only ever looks at a small neighborhood, so early layers are forced to build features from nearby pixels. Two-dimensional structure: the filter slides over a grid, so the notion that pixels have up, down, left, and right neighbors is built in. Translation equivariance: because the same filter is applied at every position, an object shifted across the image produces the same features, just shifted. (Equivariance is the precise word here. The looser "a cat is a cat wherever it appears" invariance comes later, from pooling, not from the convolution itself.) These three are good guesses about how images work, and they are why a convolutional network can learn from a few thousand images.

ViT keeps almost none of it. Self-attention is global from the very first layer: every patch can read every other patch in one step, with no preference for nearby ones. And it is permutation-equivariant, treating the patches as an unordered set, which is exactly why position embeddings had to be bolted on. The only place locality survives is the per-token MLP, which processes each token on its own. The model is handed a far weaker set of assumptions and told to learn the rest. Concretely, shuffling the input patches shuffles self-attention's outputs the same way and computes nothing differently, because nothing in the architecture knows the patches came from a 2D grid. A convolution knows that 2D structure by construction, from the way one filter slides over neighbors; ViT has to spend data learning the same spatial structure.

The figure makes the contrast concrete. Pick a query patch and compare its reach under attention against its reach under a 3×3 convolution. Attention touches the whole image at once; the convolution sees nine patches. A convolutional network only reaches across the image by stacking many layers, each one growing the receptive field a little. ViT can do it immediately.

Figure 3 · global attention vs local convolution

query patchhover the grid to move it · global vs local reach

Hover to move the query patch. Self-attention reads every one of the 49 patches in a single layer, with no locality prior. A 3×3 convolution reads only its neighborhood; a CNN must stack many layers to reach as far. This is the "much less inductive bias" the paper trades away.

ViT has much less image-specific inductive bias, not none. Two pieces of 2D knowledge are still injected by hand: cutting the image into a 2D grid of patches at the start, and, at fine-tuning time, 2D-interpolating the position embeddings when the input resolution changes. Everything else about spatial structure, which patch neighbors which and what that means, the model has to learn from scratch. The empirical question the rest of the paper answers is what it takes to learn it.

Scale beats inductive bias

The finding, stated as the experiments found it: pre-train ViT on a mid-sized dataset like ImageNet (1.3M images) without heavy regularization, and it lands a few percentage points below a comparable ResNet. This is the expected outcome and the paper says so plainly. The convolutional priors ViT abandoned were earning their keep, and with limited data the model could not learn good replacements.

Then grow the pre-training set and the order reverses. The paper pre-trains on three datasets of increasing size: ImageNet (1.3M images), ImageNet-21k (14M), and the in-house JFT-300M (303M). On the smallest, convolutional ResNets win. As the data grows, ViT catches up and then overtakes. Drag the slider and watch the crossover.

Figure 4 · the data crossover

pre-train data≈ 303M images

ImageNet top-1 after fine-tuning, against pre-training set size on a log axis. ViT-L/16 starts below the band of BiT ResNets on small data and crosses above it as the pre-training set grows to JFT-300M. The shape and the anchor points follow the paper's Figure 3 and Table 2; the curve between them is a read of the comparison, not extra measurements.

At the top of the range the numbers are striking. The best model, ViT-H/14 pre-trained on JFT-300M and then fine-tuned, reaches 88.55% on ImageNet, 90.72% on the cleaned-up ImageNet-ReaL labels, 94.55% on CIFAR-100, and 77.63% on the 19-task VTAB suite (a transfer benchmark of 19 varied vision tasks, designed to test how broadly a representation transfers). It matches or beats the strongest convolutional baselines of the day, the Big Transfer ResNets and Noisy Student EfficientNet.

That 88.55% is not a from-scratch ImageNet result. It is what you get after pre-training on JFT-300M, a proprietary dataset of 303 million images, and then fine-tuning on ImageNet at high resolution. Take JFT away and train on ImageNet alone, and large ViT does worse than smaller ViT and worse than a comparable ResNet.

The twist that makes it more than a curiosity is the cost. ViT also wins on compute. Pre-training is measured in TPUv3-core-days (the number of TPU v3 cores used, times the days they ran). ViT-H/14 took about 2.5k core-days. The ResNet baseline BiT-L took 9.9k, and Noisy Student took 12.3k. ViT reached a better result for roughly a quarter of the compute. The more modest ViT-L/16 pre-trained on the public ImageNet-21k could be trained on a single cloud TPUv3 with 8 cores in about 30 days, which is within reach of a normal lab.

Why does the order flip with scale? The paper offers a reading, and it is a hypothesis rather than a theorem, so take it as one. A convolutional network's priors are a head start that is also a ceiling. Locality and translation equivariance are roughly right, and when data is scarce that advantage is decisive. But they are not exactly right, and given enough examples a model that learns its own structure from the data can fit images better than one constrained by hand-built assumptions. Past some data scale, the constraint costs more than it saves. The mechanism is one fact seen from two sides: when examples are scarce the hand-built assumptions point the model straight at plausible features and spare it from learning them the hard way, and when examples are abundant those same assumptions become guardrails that forbid structure the data actually contains, so a model free to learn its own can follow the data right past them. Notably, ViT shows no sign of saturating in the range the authors tried, which is part of why this paper kicked off a scaling race in vision.

A convolution hard-codes locality: a filter only ever looks at a small neighborhood, so the model is told from the start that nearby pixels matter and distant ones do not, until many layers stack up. When data is scarce that is an advantage, the model does not have to spend examples discovering that a cat's ear sits next to its head rather than across the image. But the same hard-coding is a ceiling, because it forbids any pattern that is not local, and some real patterns are not. Self-attention carries almost none of that prior (much less, not none, since cutting the image into a patch grid is still a 2D assumption). The cost is that it has to learn 2D structure from data, which is why it needs scale, and which is exactly what the position-embedding result above shows it doing. Once it has learned that structure it is not boxed in by a small kernel: from the very first layer it can relate any patch to any other, an object to its distant reflection, a relationship a small convolution filter cannot express at all and a deep convolutional stack can only build up slowly.

The next figure draws the geometric picture of that freedom. Mark an object patch in one corner of the grid and a related reflection patch in the opposite corner. Flip the toggle between conv and ViT, and click a patch to make it the query. In the conv frame the 3×3 ring is the whole world the kernel sees and the partner patch sits outside it. In the ViT frame the query can read any patch in one step, and the model is free to put weight on the distant one.

Figure 5 · what the freedom allows

modelclick a patch to move the query

Click a patch to choose the query and flip conv 3×3 against ViT. The conv ring can never reach the far related patch; the attention frame can. This is an illustrative schematic: the real ViT learns where to attend from data, and the next figure measures that. ViT keeps the 2D patch grid as a prior, so the trade is much less inductive bias, not none.

What the patches learned

Given all that freedom, what does ViT actually do with it? The paper opens the trained model up, and two of its measurements stand out, because they show ViT rediscovering structure nobody handed it.

The first is attention distance. For a given head, average how far across the image, in pixels, it attends. This is the Transformer's analogue of a convolutional network's receptive field: small distance means the head is acting locally, large distance means it is reaching across the whole picture. Plotting it by layer reveals a clear pattern.

Figure 6 · attention distance grows with depth

layerlayer 0 / 23

Each dot is one head's mean attention distance at a layer; the line is the per-layer mean. In the lowest layers the heads split: some already span the whole image, others stay tight and local. With depth, the local heads climb until nearly every head is global. ViT learns a local-to-global hierarchy, but it is free to go global immediately.

In the lowest layers, the heads split in two. Some already attend across the whole image, using the global reach attention gives them. Others stay tightly local, behaving like the early layers of a convolutional network. As you go deeper, the local heads climb until almost every head is global. So ViT does end up building a local-to-global hierarchy, much like a CNN, but it learns the hierarchy instead of being wired for it, and it can still go global from the first layer. (In hybrid models, where a ResNet stem runs before the Transformer, those local low-layer heads largely vanish, because the convolutions already did the local work.) The split may reflect that different patches need context at different scales, so the model staffs both jobs at once, some heads reading their immediate neighbors while others scan the whole image. That is a reading of the measurement, though, not something the paper proves the heads are doing.

The second measurement is the nicer surprise. The position embeddings are plain 1D vectors, one per slot, with no built-in knowledge that the slots form a 14×14 grid. So ask: after training, how similar is each patch's position vector to every other's? Hover a patch and look.

Figure 7 · 1D embeddings learn the 2D grid

positionhover a patch · nearby + same row/column stay similar

For a selected patch, the cosine similarity of its learned position embedding to every other patch's. The selected patch and its near neighbors are brightest, with a fainter band along the same row and column. Although the embeddings are 1D and learned from scratch, they recover the 2D image layout. This is a model of the paper's Figure 7.

The learned vectors recover the grid. A patch's position embedding is most similar to its near neighbors, and patches in the same row or column stay similar too. Nothing in the architecture told the model that 2D layout exists; it inferred it from the data. This also explains a small negative result the paper reports: hand-designed 2D-aware position embeddings gave no real improvement, because the plain 1D ones already learn the 2D structure on their own. (These are learned embeddings, not the fixed sinusoidal encodings of the original Transformer. A sinusoidal-looking pattern sometimes emerges in the learned vectors for larger grids, but it is a result, not an input.)

The numbers, and what they hinge on

For all that, there is strikingly little to the model. Cut an image into patches, embed them, prepend a class token, add positions, run an unmodified Transformer encoder, read off the class token. The vision ingredient is one slicing step at the front. Almost everything that makes it work is borrowed from language, and what makes it win is scale.

The paper also runs a self-supervised experiment, and labels it preliminary itself. Copy BERT's masked-language idea over to images: mask out some patches and train the model to predict them. This masked-patch prediction lifts ViT-B/16 to 79.9% on ImageNet, a 2% gain over training from scratch, but still 4% behind supervised pre-training. The authors leave the bigger question, contrastive and self-supervised pre-training at scale, to future work. (A lot of the next few years of vision research followed that thread.)

The limits track the same story. ViT needs scale to shine, and on small data the convolutional inductive bias still wins. The class-token-versus-pooling choice is finicky enough to need its own learning-rate tuning. And the strongest results lean on JFT-300M, a dataset most people cannot touch.

The paper's influence ran well beyond the model itself, even so. "An image is a sequence of patches" became the default way to put vision into a Transformer. Follow-up work showed you could train ViT without a private 300M-image dataset by leaning on distillation (DeiT), and the patch-tokenization idea spread into detection, segmentation, image-text models like CLIP (which learns a shared embedding for images and their captions), and the vision front-ends of today's multimodal systems.

Provenance Verified against primary literature

Transformer (2017)The encoder, scaled dot-product attention, and the D_h = D/k head split, reused almost verbatim.

BERT (2019)The prepended [class] token whose final state is the classification representation.

Cordonnier et al. (2020)The closest prior model: 2×2 patches plus full self-attention, which ViT scales up.

BiT / Noisy StudentThe large ResNet and EfficientNet baselines ViT is measured against.

correctionThree things the popular telling gets wrong. (1) The 88.55% ImageNet number is ViT-H/14 pre-trained on JFT-300M and then fine-tuned, not trained from scratch; on ImageNet alone, large ViT loses to ResNets. (2) ViT follows the 2017 Transformer but switches its post-norm for pre-norm (LayerNorm inside the residual branch). (3) Its position embeddings are learned 1D vectors, not the original Transformer’s fixed sinusoidal ones.

Questions you might still have

If 88.55% beats the best CNN, why do people say ViT needs huge data?
Because that number comes after pre-training on JFT-300M (303M images), then fine-tuning. On ImageNet alone, large ViT does worse than a smaller ViT and worse than a comparable ResNet. The win is conditional on scale, not a from-scratch result.

Why split into patches instead of feeding raw pixels?
Self-attention compares every element to every other, so its cost grows like N². A 224×224 image is about 50,000 pixels, so pixel-level attention is roughly 2.5 billion pairs per layer. Grouping pixels into 16×16 patches drops N to 196 and makes the Transformer affordable.

Does ViT have no inductive bias, or just less?
Less, not none. Two image-specific biases remain: cutting the image into a 2D grid of patches, and 2D-interpolating the position embeddings when the resolution changes. Everything else about which patch relates to which is learned from data.

If the position embeddings are 1D, how does ViT know the 2D layout?
It learns it. Nothing tells the model that one patch sits directly below another, yet after training the learned position vectors of nearby patches, and of patches in the same row or column, become the most similar. The 2D map is recovered from data.

Is the class token necessary?
No. Average-pooling the patch tokens works about as well, but only after retuning the learning rate; out of the box it does much worse. ViT keeps the class token mainly to stay close to the standard Transformer, not for a measured accuracy gain.

Footnotes & further reading

The paper: Dosovitskiy, Beyer, Kolesnikov, et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Google Research, ICLR 2021). Code.
The encoder, attention, and the $D_h=D/k$ head split come straight from Vaswani et al., Attention Is All You Need (2017). The pre-norm placement ViT actually uses is from Wang et al., Learning Deep Transformer Models for Machine Translation (2019).
The [class] token is borrowed from Devlin et al., BERT (2019).
The closest prior model, 2×2 patches plus full self-attention: Cordonnier, Loukas & Jaggi, On the Relationship between Self-Attention and Convolutional Layers (ICLR 2020).
The convolutional baselines ViT is measured against: Kolesnikov et al., Big Transfer (BiT) (2020), and Xie et al., Self-training with Noisy Student (2020).
Training ViT without a private 300M-image dataset, via distillation: Touvron et al., Training data-efficient image transformers (DeiT) (2021).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.