Multimodal · Vision-language

Flamingo: a Visual Language Model for Few-Shot Learning

Connect a frozen vision model to a frozen language model, and it learns new tasks from a few examples.

A pretrained vision model and a pretrained language model stay fixed. A small set of new layers lets the language model read the images, and once those new layers are trained on web data, the model takes on a new image or video task from a few examples in its prompt, with no further training.

Explaining the paperFlamingo: a Visual Language Model for Few-Shot LearningAlayrac, Donahue, Luc, Miech, et al. · DeepMind · NeurIPS 2022 · arXiv:2204.14198 ↗

Show it two captioned photos and then a third photo, and it writes the third caption. Show it a chart and a question, and it answers. The weights never change: the task is defined entirely by the prompt.

A large language model like GPT-3 can pick up a new task from a handful of examples typed into its prompt, with no extra training. You show it a few questions and their answers, add a new question, and it continues the pattern. GPT-3 called this in-context few-shot learning, and it was the thing that made scale feel like a change of kind rather than degree. Vision had no equivalent. The dominant recipe was still to collect thousands of labeled images for each new task and fine-tune a model on them. Contrastive models like CLIP loosened that by scoring how well an image matches a piece of text, which handles classification with no fine-tuning, but a similarity score cannot write a caption or answer a question about a picture. It has no way to produce language at all.

Flamingo, from DeepMind, closes that gap. It is a visual language model: feed it text with images and videos mixed in, and it produces free-form text. Once trained, it takes on a new image or video task the GPT-3 way, by reading a few (image, answer) examples in its prompt. Across 16 image and video benchmarks it set a new few-shot state of the art with as few as four examples per task, and on six of those benchmarks a single set of weights, prompted with 32 examples and never fine-tuned, beat models that had been fine-tuned on hundreds of thousands of task-specific examples, on the order of a thousand times more data.

The design reuses two things the field already has and adds a small piece to join them. A pretrained vision model supplies the eyes; a pretrained language model supplies the words and the reasoning; both stay frozen. Between them sit a few new trainable layers that let the language model read the images. A few ideas make that work: freezing the two experts and training only the bridge, a resampler that turns any image or video into a fixed handful of tokens, a gated cross-attention layer that begins as a no-op so it never disturbs the frozen model, and a masking rule that lets one prompt hold many interleaved images. Each is straightforward on its own.

Prompted, not fine-tuned

Start with what using Flamingo looks like, because it shapes everything else. To adapt it to a task you do not touch the weights. You assemble a prompt: a few worked examples, each an image followed by the text you want, then the new image you actually care about. The model reads the pattern and continues it.

Say you want captions in a fixed style. You feed a photo of a chinchilla and the text This is a chinchilla., a photo of a shiba and This is a shiba., then a photo of a flamingo and let the model finish, and it writes This is a flamingo. The examples fix the format; they do not change a single weight. This is the same in-context few-shot learning GPT-3 showed for text, now with pictures sitting next to the answers. Drag the number of shots in the figure below. Each shot adds one (image, caption) pair to the prompt; nothing about the model changes and no gradient step is taken.

Figure 1 · few-shot prompting

shots2-shot

A prompt is a few image → caption support pairs, then a query image the model must continue. Add shots and the prompt grows; the weights stay frozen. Flamingo trains on sequences of at most 5 images yet the same prompt keeps working out to 32.

Flamingo's zero-shot is not GPT-3's literal no-examples prompt: it supplies two text-only examples with no images, enough to convey the answer format, since without a format cue the model tends to ramble. That makes it really zero image-shots, a slightly counterintuitive name. The upper end is generous too: trained on sequences holding at most five images, Flamingo still improves when you give it 32 at test time, a robustness that comes directly from the masking rule a few sections below.

Freeze two experts, train the bridge

Both halves of Flamingo are borrowed and left untouched. The language half is a frozen Chinchilla model, DeepMind's compute-optimal LM; the vision half is a frozen image encoder covered in the next section. Freezing means their weights receive no gradient during Flamingo training. Only the new layers wedged between them learn.

The obvious alternative is to fine-tune the language model so it adapts to images directly. The authors tried it and it hurt: unfreezing the pretrained LM dropped the overall benchmark score from 70.7 to 62.7, an 8.0-point fall they name as catastrophic forgetting, the model overwriting its language knowledge while it learns the new visual objective. Starting the LM from scratch instead of borrowing a pretrained one was worse, 57.8, a 12.9-point drop. Freezing keeps the expensive, hard-won knowledge of a pretrained model intact while a small bridge learns to feed it a new kind of input, no gradient ever touching the LM. (One caveat on those figures: the overall score is a 0-to-100 average across five held-out benchmarks, and the drops are absolute points off 70.7, not relative percentages.) The picture to hold is writing notes in the margins of a finished textbook rather than rewriting the chapters: the original text stays as it was, and the margins add what you need.

How big is the bridge? Switch model size in the figure and read the parameter split.

Figure 2 · frozen vs trainable

size

A frozen vision encoder and language model do the heavy lifting; only the Perceiver Resampler and gated cross-attention layers train. Switch size: the trainable slice falls from nearly half the parameters at 3B to about an eighth at 80B, because the gated layers are inserted less often as the frozen LM grows.

Notice two things. First, the trainable bridge is never the bulk of the model, and its share only shrinks as the frozen LM scales up. Second, the name misleads: Flamingo-80B does not contain an 80-billion-parameter language model. The frozen Chinchilla inside it has 70 billion parameters; the 80B is the total once you add the roughly 10 billion trainable parameters of the bridge. The three models wrap frozen Chinchilla LMs of 1.4, 7, and 70 billion parameters, and are named Flamingo-3B, -9B, and -80B for their totals.

A frozen, contrastive vision encoder

The vision half has to output features that already mean something in language terms, so that a frozen copy can usefully feed a language model. Flamingo gets them the way CLIP does. Take a large batch of (image, caption) pairs, push the images through an image encoder and the captions through a text encoder into one shared space, and train so that each image scores highest against its own caption and each caption highest against its own image, a symmetric two-term contrastive loss. The image features come out aligned to language without a single hand-drawn class label.

Despite the shared loss, this is not OpenAI's CLIP. Flamingo trains its own vision encoder, a Normalizer-Free ResNet (NFNet-F6, a convolutional network that drops batch normalization in favor of weight standardization and gradient clipping), from scratch on that same contrastive loss over the authors' own image-text data. OpenAI's CLIP ViT-L/14 appears only as a baseline in the ablations, where the in-house NFNet-F6 beats it by 5.8 points. Once pretrained, this 435-million-parameter encoder is frozen for the rest of Flamingo's life.

The encoder does not hand over one pooled vector per image. It keeps the final convolutional feature map, a 2D grid of features, and flattens it into a sequence, so spatial layout survives as a set of tokens. Video is folded in the plain way: sample frames at one per second, encode each as an image, add a learned embedding marking which frame a feature came from, and flatten. (The authors add only these temporal markers, not spatial ones, since a convolutional encoder already carries spatial position in its channels.) The next piece has to bring that stream of visual features, however large, down to a fixed size.

The Perceiver Resampler: a fixed-width bottleneck

The visual stream is a problem for the language model. A high-resolution image is a few hundred features; a video is that many again for every frame. If the language model had to cross-attend to all of them, the cost of reading vision would swing with resolution and length, and a long video could dominate the computation.

The Perceiver Resampler fixes the width. It carries a small, fixed set of 64 learned vectors, called latent queries, that do not depend on the input. These 64 queries cross-attend to whatever visual features arrived (in cross-attention, one set of tokens reads a separate set: here the 64 latents read the visual features), each query pulling in a learned summary, and the module emits exactly 64 tokens every time, per image or video. The idea comes from the Perceiver: let a small learned set of latents attend to a big input so downstream cost stops tracking input size. Flamingo's twist is one detail, it lets the 64 latents also attend to themselves by concatenating their own keys and values with the visual ones, which the authors found helped a little.

So a single image and a thirty-frame video both leave the resampler as 64 tokens. The language model's cost of looking at one visual input is now a constant, set once at 64 and independent of pixels or frames. In the ablations a resampler beat both a plain MLP and a plain Transformer on the same parameter budget, while also running faster. Drag the input size in the figure: the left cloud of features swells with every added frame, and the 64 output tokens do not move.

Figure 3 · the resampler bottleneck

input1 frame · 144 features

A variable cloud of visual features (bigger with every video frame) is read by 64 fixed latent queries that emit exactly 64 visual tokens, every time. That fixed 64 keeps the language model's vision cross-attention cost bounded.

That fixed 64 is per visual input, not for the entire prompt. Each image in an interleaved sequence gets its own 64-token block. Keeping the per-image count small and constant is also what makes it affordable to pack 32 images into one prompt at test time.

Gated cross-attention: reading vision into the frozen LM

Now the 64 visual tokens have to reach the frozen language model. Flamingo inserts new layers between the existing, frozen layers of the LM, each a cross-attention followed by a small feed-forward block. Cross-attention is ordinary attention with the two streams split: the queries come from the language tokens and the keys and values come from the visual tokens, so each language position reads the picture and pulls the relevant part into its own representation, the same soft-search idea attention was born with. Because there is one query per language token, the output keeps the language sequence exactly as long as it was; the image is read into it, not appended to it.

Dropping brand-new, randomly initialized layers into the middle of a carefully pretrained model would wreck it. At the first step those layers emit noise, and the frozen model, tuned to expect its own clean activations, would produce garbage, and training could diverge. Flamingo prevents this by making each new layer start as a no-op. It multiplies the layer's output by $\tanh(\alpha)$ , where $\alpha$ is a single learnable number set to zero. Since $\tanh(\alpha)$ is zero when $\alpha$ is zero, the new layer contributes nothing at initialization and the conditioned model reproduces the frozen language model exactly. Training then raises each $\alpha$ above zero only as far as the visual signal earns its place.

# a gated xattn-dense block, inserted between two frozen LM layers
# x:   language hidden states, shape [n_text, d]
# vis: the image's 64 visual tokens, shape [64, d]
# alpha_xattn, alpha_ff: learnable scalars, both initialized to 0
def gated_block(x, vis, alpha_xattn, alpha_ff):
    x = x + tanh(alpha_xattn) * cross_attn(q=x, k=vis, v=vis)
    x = x + tanh(alpha_ff)    * feed_forward(x)
    return x            # at init, tanh(0)=0, so x passes through unchanged

Drag $\alpha$ in the figure. At the far left, $\alpha = 0$ , the gate is shut, the output equals the input, and the block is the identity. As $\alpha$ grows the gate opens and a share of the visual correction mixes into the language model's running hidden state, the residual stream in the figure below.

Figure 4 · the tanh gate

αα=0.60 tanh α=0.54

A frozen-LM residual stream carries a hidden state. A new cross-attention block reads the visual tokens and produces a correction that passes a gate of openness

\tanh(\alpha)

. At

\alpha=0

the gate is shut and the block is the identity: the frozen model is untouched. Open it and vision flows in.

Two details fill the picture out. Each inserted block actually holds two of these gates, one on the cross-attention and one on the feed-forward, each with its own $\alpha$ . And $\tanh$ does more than switch on: it bounds the gate to the range $(-1,1)$ , so even wide open a gate cannot amplify its layer without limit, and it can even go slightly negative if that helps. The gate matters in the ablations: remove it and add the layers plainly, and the score falls 4.2 points and training turns unstable. A zero-initialized scalar on a residual branch is not unique to Flamingo (ReZero and SkipInit use the same trick, and Flamingo credits earlier tanh-gating work for multimodal fusion), but the exact-zero start paired with the $\tanh$ bound is its particular recipe.

How often these layers are inserted is a compute knob. Flamingo-3B puts one before every LM layer; the 9B model before every fourth; the 80B model before every seventh. Inserting them less often costs little: at every fourth layer the score drops only 1.9 points while training runs 66% faster, which is why the largest model, where compute bites hardest, gets the sparsest spacing.

Weaving many images into one prompt

A Flamingo prompt is one interleaved sequence: text with images and videos (each already turned into its 64 tokens) dropped in at their positions. The model is autoregressive over the text, predicting each token from everything before it. Writing $y$ for the text tokens and $x$ for the images and videos:

p(y \mid x) = \prod_{\ell=1}^{L} p\big(y_\ell \mid y_{<\ell},\, x_{\le \ell}\big)

(1)

Each text token $y_\ell$ is predicted from the earlier text $y_{<\ell}$ and the images that came before it, $x_{\le\ell}$ . To pin down which images those are, the paper defines a function $\phi$ that maps each text position to the index of the last image preceding it (or 0 if none has appeared yet), and $x_{\le\ell}$ is every image up to that one. For the sequence img1 t1 t2 img2 t3, $\phi$ sends $t_1,t_2$ to image 1 and $t_3$ to image 2.

The likelihood conditions each token on all preceding images, but the cross-attention layers do something narrower: a text token attends directly to only one image, the single most recent one before it. Earlier images are not dropped from the model. Their influence reaches the token through the frozen LM's self-attention over the text, which has already folded in what those earlier images contributed to nearby tokens. The two views agree; the mask factorizes the full conditioning rather than contradicting it.

Why restrict the direct attention to one image when you could let every token see every prior image? Two reasons. It scored better, 7.2 points of overall score over attend-to-all in the ablation. And it makes the model blind to how many images the sequence holds: each text token always faces exactly one image, so a model trained on sequences of at most five images runs unchanged on prompts with 32. The jump from 5 shots at training to 32 at test, promised earlier, is a direct consequence of this rule. Step through the tokens in the figure and toggle the two masks.

Figure 5 · per-image masking

An interleaved sequence of images and text. Under the single-image rule each text token's row in the mask lights exactly one column, the last image before it (

\phi

). Toggle to attend-to-all to see the ablated variant. Cross-image links travel through the LM's self-attention.

Training on the open web, without labels

None of this needs data labeled for machine learning. Flamingo trains on three kinds of web scrape at once. The first and most important is MultiModal MassiveWeb (M3W): the text and images of about 43 million web pages, kept in their original interleaved order, with an <image> marker at each image's position and an end-of-chunk token closing each block. From each page, Flamingo samples a 256-token window and keeps up to the first five images. The other two are more familiar: image-text pairs (1.8 billion from ALIGN, plus 312 million higher-quality ones the authors collected) and 27 million short video-text pairs, each reshaped to look like the interleaved format.

Flamingo trains on the plain next-token negative log-likelihood of the text given the visual conditioning, summed over the datasets with a weight on each:

\sum_{m=1}^{M} \lambda_m \cdot \mathbb{E}_{(x,y)\sim \mathcal{D}_m}\!\left[\, -\sum_{\ell=1}^{L} \log p\big(y_\ell \mid y_{<\ell},\, x_{\le\ell}\big) \right]

(2)

$\mathcal{D}_m$ is the $m$ -th dataset and $\lambda_m$ its weight. Tuning those weights mattered more than the raw dataset sizes: the interleaved M3W set, small at 43 million pages, carries the largest weight (1.0), while the 1.8-billion-pair ALIGN set gets 0.2 and the video set 0.03. And the datasets are combined by accumulation, not rotation: every gradient step sees a batch drawn from all of them at once, which beat cycling through one dataset at a time by a wide margin, 70.7 against 62.9. The next section settles how much the mixture matters, and which parts carry the load.

What matters most, and what it buys

The ablations rank the design decisions by how much each is worth, all measured as the overall score of Flamingo-3B at four shots. Select rows in the figure to read what each one changes.

Figure 6 · what each choice is worth

ablate53.4

Overall score (Flamingo-3B, 4 shots) as each design choice is removed or swapped. The full model scores 70.7; the amber shortfall marks what each ablation costs. The training data and freezing the LM dwarf the architectural knobs.

The bars sort the decisions by what each costs, and data outranks architecture. Removing the interleaved M3W data costs 17.3 points, training the LM from scratch 12.9, dropping the paired image-text data 9.8. The architectural pieces cost far less: the tanh gate 4.2 points, and the spacing of the gated layers under 2. A reader could reasonably have guessed the clever new layers were where the performance lived. They help, but the data and the decision to freeze help more.

On the benchmarks themselves, a single Flamingo model, prompted with 32 examples and never fine-tuned, beat the previous fine-tuned state of the art on six of sixteen tasks, even though those fine-tuned models used on the order of a thousand times more task-specific data. When the authors did allow fine-tuning, spending a larger annotation budget, Flamingo set a new state of the art on five more benchmarks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes.

Flamingo inherits its language model's weaknesses: it can hallucinate, it generalizes poorly to text far longer than it trained on, and it is sample-hungry. Its classification accuracy trails dedicated contrastive models, which optimize directly for image-text matching, of which classification is a special case. And in-context learning, while it needs only a few dozen examples and no tuning, is sensitive to which examples you pick and scales poorly, in both cost and accuracy, once you push past that low-data regime.

The larger point Flamingo made is that you do not have to train a multimodal model end to end to get one. Freeze two pretrained experts, a vision model and a language model, and join them with a thin trainable bridge: a resampler to fix the width, a gated cross-attention to feed the language model without disturbing it, and a masking rule to weave many images into one prompt. Prompted with a few examples, the result then learns new vision tasks the way a large language model learns new text tasks. Much of what followed in multimodal modeling is a variation on that recipe.

Provenance Verified against primary literature

CLIP (Radford et al. 2021)The two-term contrastive loss used to pretrain Flamingo’s own in-house NFNet-F6 vision encoder.

Chinchilla (Hoffmann et al. 2022)The frozen 1.4B / 7B / 70B language models Flamingo builds on.

NFNet (Brock et al. 2021)The normalizer-free convolutional vision backbone, the F6 variant.

Perceiver (Jaegle et al. 2021)Learned latent queries cross-attending to a variable input; the Resampler adapts it.

GPT-3 (Brown et al. 2020)In-context few-shot learning, carried here to multimodal prompts.

ReZero / tanh-gatingZero-initialized residual scalars: identity at initialization.

correctionThe paper's own count of tasks where few-shot Flamingo beats the fine-tuned state of the art is six (the intro, the body, the conclusion, and the Figure 2 caption). The Table 1 caption slips and says seven; six is the correct number.

Questions you might still have

Is Flamingo’s vision encoder just OpenAI’s CLIP?
No. Flamingo trains its own NFNet-F6 encoder from scratch on the CLIP contrastive loss, using its own image-text data, then freezes it. OpenAI’s CLIP ViT-L/14 appears only as an ablation baseline, which the in-house encoder beats by 5.8 points.

Does "Flamingo-80B" mean an 80-billion-parameter language model?
No. The frozen Chinchilla language model inside it has 70 billion parameters. The 80B is the total after adding roughly 10 billion trainable bridge parameters. The three sizes wrap frozen 1.4B, 7B, and 70B language models.

If a text token cross-attends to only one image, how does it use earlier images?
The single-image rule governs only the direct cross-attention. Earlier images still reach a token through the frozen language model’s self-attention over the text, which has already absorbed what those images contributed nearby. The likelihood conditions on all prior images; the mask factorizes that conditioning.

What does "zero-shot" mean for Flamingo?
Not literally no examples. A zero-shot prompt carries two text-only examples (no images) to convey the answer format, since without a cue the model tends to ramble. It is really zero image-shots, not GPT-3’s instruction-only setting.

Why not just fine-tune the language model on the images?
It causes catastrophic forgetting. Unfreezing the pretrained LM drops the overall score by 8.0 points; training one from scratch drops it by 12.9. Freezing keeps the language knowledge intact while the bridge learns to inject vision.

Footnotes & further reading

The paper: Alayrac, Donahue, Luc, Miech et al., Flamingo: a Visual Language Model for Few-Shot Learning (DeepMind, NeurIPS 2022).
The contrastive loss used to pretrain the vision encoder: Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP). Our explainer: CLIP.
The frozen language models: Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla). Our explainer: Chinchilla.
The vision backbone: Brock, De, Smith, Simonyan, High-Performance Large-Scale Image Recognition Without Normalization (NFNet; introduces adaptive gradient clipping and the F0–F6 family).
The resampler's ancestor: Jaegle et al., Perceiver: General Perception with Iterative Attention.
In-context few-shot learning: Brown et al., Language Models are Few-Shot Learners (GPT-3). Our explainer: GPT-3.
Zero-initialized residual scalars (identity at init): Bachlechner et al., ReZero is All You Need; the tanh-gating for multimodal fusion Flamingo cites is Hendricks et al. (2021).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.