Vision · Foundation models

Segment Anything

One promptable model that cuts out any object.

Segment Anything tried to give image segmentation a foundation-model moment: pre-train one model on a single general task, then point it at anything. What made it work was letting the model build its own dataset, a billion masks deep.

Explaining the paperSegment AnythingKirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Dollár, Girshick · Meta AI (FAIR) · ICCV 2023 · arXiv:2304.02643 ↗

Language had the whole web to learn from; segmentation had nothing like it.

Segmentation is the task of tracing the exact outline of a thing, pixel by pixel, separating it from everything else in the frame. It is one of the oldest problems in computer vision, and for most of the last decade it was solved the same tedious way: pick a dataset, fix a list of object classes, train a fresh model, and start over for the next dataset. Each model knew only the classes it was shown and only the kind of images it was trained on.

Language modeling had already escaped that loop. You pre-train one large model on a single, dumb objective (predict the next token), and then you steer it to new jobs by prompting it, no retraining required. The same model summarizes, translates, and writes code, because the pre-training task was general enough to teach it something reusable. Segment Anything, out of Meta's FAIR lab in 2023, asks the obvious question for vision: could segmentation work like that?

SAM answers with a system, not a single network. It has three parts that were designed together: a task general enough to pre-train on and to repurpose by prompting, a model (SAM, the Segment Anything Model) fast enough to use interactively, and a data engine that produced a dataset of $1.1$ billion masks. That dataset, SA-1B, is the paper's main contribution, because the hard part was never the architecture. It was that the web is full of free text and free image-caption pairs, and almost entirely empty of pixel-perfect object outlines. SAM used the model to label the data that trained the model.

The pieces, in order: what a foundation model means for pixels, the promptable task and its ambiguity, how the model is split so it runs in a browser, what a mask actually is and how it is scored, the loop that produced a billion of them, and finally what SAM can and cannot do once it is trained.

A foundation model for pixels

A foundation model is one you pre-train once on a broad task and then adapt to many narrower ones. In language the recipe is next-token prediction plus prompting. In vision the closest examples align images and text: CLIP learns an image encoder and a text encoder that land in the same space, and once trained you can name a new category in words and have it recognized, with no labeled examples of that category. That last move has a name the paper uses precisely: zero-shot transfer, evaluating a model on data and tasks it never saw in training.

Segmentation sits below that text-and-image layer. The job is not to name an object, it is to draw it. And there is no web-scale pile of drawings to learn from. So the SAM authors set themselves a sharper goal than "a good segmenter": a promptable model, pre-trained on a task broad enough that prompting alone carries it to new image distributions and new downstream problems. Get the task right and the same frozen model should handle jobs nobody trained it for, the way a language model answers a question it never saw.

The promptable task

A prompt, for SAM, is anything that points at what you want segmented: one or more foreground or background points, a rough box, a coarse mask, or even a line of text. The promptable segmentation task is then a single sentence: given any prompt, return a valid mask.

The word "valid" exists to handle ambiguity. Consider a single point clicked on someone's shirt. Did you mean the shirt, or the person wearing it, or the small logo on the pocket? A single point cannot say. The task does not demand the model read your mind. It demands that the output be a reasonable mask for at least one sensible interpretation. This is the same forgiveness we extend to a language model: asked something vague, it should still answer something coherent rather than freeze.

That definition was chosen because it does double duty. It is a pre-training objective: simulate a stream of prompts for each training mask and score the model's guesses against the ground truth. And it is a method for downstream tasks: once the model answers any prompt well, you solve a new problem by engineering the right prompt rather than training a new head. SAM borrows the prompt-simulation idea from interactive segmentation, where a user clicks to refine a mask, but bends the goal. Interactive segmenters aim to be right eventually, after enough clicks. SAM has to be valid immediately, for any prompt, even a single ambiguous one, because its own data engine will later lean on that property to label images with no human in the loop.

One point, three masks

Ambiguity is the central modeling problem here, not a corner case, and the naive fix makes it worse. If the model emits one mask and the training signal is the average over every valid interpretation, it learns to predict the average mask: a blurry compromise that is the pocket, the shirt, and the person smeared together, and a good outline of none of them.

SAM does not assume there is one answer. For a single prompt it predicts three masks at once. Three is not arbitrary. When a point is ambiguous, the candidate objects are almost always nested: a subpart inside a part inside a whole (the pocket, the shirt, the person), and that nesting is rarely more than three deep. So three outputs cover the usual ambiguity without the model ever having to guess which level you meant.

Training has to resolve one thing. You only have one ground-truth mask per prompt, so which of the three predictions should it supervise? SAM computes the loss for all three and back-propagates only through the best match:

\mathcal{L}_{\text{mask}} = \min_{i\,\in\,\{1,2,3\}}\Big[\,20\,\mathrm{FL}(m_i, g) + \mathrm{DL}(m_i, g)\,\Big]

(1)

Here $m_i$ are the three predicted masks, $g$ the ground truth, and $\mathrm{FL}$ , $\mathrm{DL}$ the focal and dice losses, defined further down. The $\min$ selects which head is trained. Whichever prediction happens to be closest to this particular ground truth gets all the gradient, and the other two are left unchanged, so they specialize on the interpretations they currently fit best. This is an old technique called multiple-choice learning: let an ensemble of outputs cover the plausible answers, and grade only the one that was right. The counterfactual shows why the $\min$ matters: hand the gradient to all three heads on every prompt and each gets pulled toward the same compromise, three copies of the blur from before; starving all but the closest head leaves each free to drift toward the interpretation it is already best at. Over many prompts the three slots settle into a rough whole / part / subpart division of labor.

If the model returns three masks, which does it hand back when an application wants just one? SAM adds a tiny head that, for each mask, predicts its own quality as an estimated IoU (the overlap it thinks it would score against the true object), and ranks the three by that number. It is a learned self-estimate trained by regression, not a calibrated probability and not a guarantee. A high estimated IoU means the model predicts a high IoU for that mask, nothing more. It can be confidently wrong on an unfamiliar image.

Drag the green point and slide through the three masks. One ambiguous prompt, three valid answers, each with the model's own confidence:

Figure 1 · one prompt, three valid masks

maskwhole

A single foreground point on the pocket is ambiguous. SAM returns three nested masks (subpart, part, whole), each ranked by its own estimated IoU. Because the point sits inside all three, all three are valid at once. The confidence values are illustrative; the mechanism is the point.

There is a fourth output token too, used only when you give the model more than one prompt. With several points or a box the ambiguity mostly evaporates, so SAM switches to a single clean prediction instead of three near-duplicates. The three-mask machinery is there for the hard case of one lonely point.

Heavy once, cheap forever

The task says "answer any prompt." The product goal adds "and do it while the user clicks around," which means roughly $50$ milliseconds per prompt, in a browser. That single constraint shapes the architecture, because the obvious design (run a big network end to end for every click) is far too slow. A heavy vision transformer takes a sizeable fraction of a second per image. You cannot pay that on every click. Fifty milliseconds is interactive-latency territory, short enough that the mask seems to arrive with the click rather than after it; holding that budget is the design pressure that pushes the heavy work out of the click loop, and the amortization in Figure 2 below shows the result.

SAM splits the model along the line between "depends only on the image" and "depends on the prompt." A heavyweight image encoder runs once per image and turns it into a reusable embedding. Then a lightweight prompt encoder and mask decoder run per prompt, reusing that embedding. The expensive part is paid once and amortized over every prompt you ask.

Image encoder. An MAE-pretrained ViT-H/16, adapted to a $1024 \times 1024$ input. It outputs a $64 \times 64$ grid of $256$ -dimensional feature vectors (the image downscaled $16\times$ ). Heavy, and run exactly once.
Prompt encoder. Points and boxes become $256$ -dimensional vectors: a positional encoding of the location summed with a learned embedding for the prompt's type (foreground point, box corner, and so on). A coarse mask prompt is fed through a few convolutions and added straight onto the image embedding. Text uses a frozen CLIP text encoder.
Mask decoder. Two small transformer blocks with attention running both ways, prompt tokens (alongside a learned output token, introduced below) reading the image and the image reading those tokens, so each informs the other. Then the embedding is upscaled and turned into a mask. It uses under $1\%$ of the image encoder's compute, which is why it fits in the $\sim 50$ ms budget.

The first prompt on a fresh image is expensive, because it alone pays for the encoder. The second prompt is cheap, and the tenth is cheap, because they all reuse the same embedding. The average cost per prompt falls toward the $50$ ms floor. Drag the number of prompts and watch it drop:

Figure 2 · amortizing the heavy encoder

prompts8 · 800ms

The image encoder runs once and produces a reusable embedding; each prompt after that is a

\sim 50

ms decode. So the amortized cost per prompt falls toward the 50ms floor the more prompts you ask of one image. The encoder cost shown is illustrative; the 50ms decode is from the paper.

In code the split is plain to see. Encode once, then loop over prompts:

# encode the image ONCE (heavy ViT-H, ~0.15s on a GPU)
img_embed = image_encoder(image)        # 1024x1024 -> 64x64 x 256

# then every prompt is a ~50ms decode that reuses img_embed
for prompt in prompts:                  # a point, box, mask, ...
    tokens = prompt_encoder(prompt)     # -> 256-dim embeddings
    masks, iou = mask_decoder(img_embed, tokens)  # 3 masks + scores
    mask = masks[argmax(iou)]           # rank by estimated IoU

This is also why SAM is, strictly, not real-time end to end. The $50$ ms figure is the prompt encoder plus mask decoder. The image encoder behind it is heavy, and is paid up front, once, before you start clicking.

What lives inside that $50$ ms? Roughly, a prompt embedding, two transformer blocks of cross-attention, an upsample to a full-resolution mask, and an $\widehat{\mathrm{IoU}}$ head that scores each output mask. Slide $K$ below to see the wall time: the first prompt eats the encoder, every prompt after it is $50$ ms.

Figure 3 · where the 50 ms goes

prompts (K)K = 8

Top row: the first prompt on a fresh image pays for the image encoder once, then a thin ∼50 ms slice for prompt encoder + decoder + mask +

\widehat{\mathrm{IoU}}

head. Bottom row:

K

subsequent prompts on the same embedding,

50

ms each. Only the

50

ms total is paper-stated; the sub-segments inside it are illustrative (the paper does not publish per-block decoder latencies), and the encoder bar is drawn long enough to dominate without claiming a measured ratio. The top row's time axis is compressed so the

50

ms slice stays legible.

What a mask actually is

We have been saying "mask" loosely. Concretely it is a probability per pixel: how likely each location is to be part of the object. The decoder produces it in a neat way. The decoder carries a learned output token through the attention blocks, projects it to a vector $v$ , and then computes, at every location $(x,y)$ , the dot product between $v$ and that location's image feature $e_{x,y}$ :

M(x,y) = \sigma\big(\langle v,\, e_{x,y}\rangle\big)

(2)

The mask is a similarity heatmap. The output token learns "what this object looks like in feature space," and the mask is high wherever the image feature matches the token, squashed through a sigmoid $\sigma$ into a probability. It is a tiny linear classifier whose weights are produced on the fly from the prompt, which is why the paper calls it a dynamic classifier. The decoder writes a brand-new classifier for every prompt, and the mask is that classifier evaluated at every pixel at once. Rotate $v$ below and watch the same equation produce the mask:

Figure 4 · the mask is a dot product

rotate vv∠125° · align 0.05

Equation (2) on a toy grid. Left, each cell's feature vector (the object region's features point one way, the background's roughly the other, with noise). Right, the mask: teal intensity is σ(〈v, e〉) per cell, the bright contour is the 0.5 threshold. Align

v

with the object's feature direction (the amber notch) and the mask snaps onto the object; misalign it and the mask is mush. Toy 2D features; the real model uses 256-dim features and the decoder's attention, not a slider, picks

v

. Same mechanism.

To train any of this you need a way to score a predicted mask against the truth. The standard yardstick is Intersection over Union: the area where prediction and truth overlap, divided by the area they cover together.

\mathrm{IoU}(A,B) = \frac{|A \cap B|}{|A \cup B|}

(3)

It runs from $0$ (no overlap) to $1$ (perfect). Averaged over many examples it is reported as mIoU, the standard segmentation score. Drag the predicted mask onto the ground truth and watch the number climb:

Figure 5 · IoU, the yardstick

IoU is the overlap divided by the union of the ground truth and the prediction. Drag the teal mask to change it. The dashed mark sits at

0.90

, where 94% of SA-1B's automatic masks land when checked against a human correction.

IoU is the metric, but it does not make a smooth training loss on its own. SAM supervises masks with a fixed blend of two losses, in a $20{:}1$ ratio of focal to dice, a recipe it inherits from DETR. The two losses fix different problems. Focal loss is cross-entropy with a knob that turns down the easy pixels:

\mathrm{FL}(p_t) = -\,\alpha_t\,(1 - p_t)^{\gamma}\,\log p_t

(4)

where $p_t$ is the predicted probability of the correct label for a pixel. The factor $(1-p_t)^{\gamma}$ does the down-weighting: when a pixel is already classified confidently ( $p_t$ near $1$ ) that factor collapses to near zero, so the loss contributes little for that pixel, and training attention stays on the hard pixels along the object's edge. (The separate $\alpha_t$ just balances foreground against background; it is not what down-weights the easy cases.) Dice loss comes from the Dice overlap coefficient and pushes the predicted region to coincide with the true one:

\mathrm{Dice}(A,B) = \frac{2\,|A \cap B|}{|A| + |B|}, \qquad \mathrm{DL} = 1 - \mathrm{Dice}

(5)

Focal works pixel by pixel and is good at boundaries; dice works on the region as a whole and is good when the object is small and the background dwarfs it. Putting them together, with the min-over-three trick from before, gives the full training step:

# one training step on an (image, ground-truth mask) pair
img_embed = image_encoder(image)
prompt    = sample_prompt(gt)           # a point, or a noisy box
tokens    = prompt_encoder(prompt)
masks, iou_pred = mask_decoder(img_embed, tokens)   # 3 candidates

# 20:1 focal+dice per candidate; supervise only the BEST match
loss_i    = [20 * focal(m, gt) + dice(m, gt) for m in masks]
mask_loss = min(loss_i)                 # multiple-choice / hindsight
iou_loss  = mse(iou_pred, iou(masks, gt))   # the rank head, scale 1
(mask_loss + iou_loss).backward()

The ground-truth prompt is not a single tidy click. SAM simulates an interactive session of $11$ rounds per mask: an initial point or box, then points sampled from wherever the current prediction is wrong, plus a couple of rounds with no new input so the model learns to polish its own output. That simulation lets the trained model later sit inside the data engine and behave like a tireless annotator.

The data engine

This is circular: training SAM needs masks, and making masks cheaply needs SAM. The paper resolves this with a data engine: a loop where the model annotates, people correct, the model retrains on the corrections and gets better, and the next turn it carries more of the load. Three stages, and the human role shrinks at each one.

Assisted-manual. Annotators click foreground and background points in a browser tool powered by SAM, refining with brush and eraser. As the model improved (retrained six times, its encoder scaled from ViT-B up to ViT-H) the time to label a mask fell from $34$ to $14$ seconds and the masks per image rose from $20$ to $44$ . This stage produced $4.3$ million masks over $120$ k images.
Semi-automatic. To push diversity, SAM first auto-fills the confident, obvious masks, and annotators are asked only to add the objects it missed. That added $5.9$ million more masks (to $10.2$ million total) over $180$ k images, with masks per image climbing from $44$ to $72$ . The time per hand-drawn mask rose back to about $34$ seconds, because the objects SAM had missed were the harder ones.
Fully automatic. Now the model works alone. It is prompted with a regular $32 \times 32$ grid of points, returns the nested masks for each, keeps only the confident and stable ones (masks that barely change when you nudge the probability threshold up or down), and removes duplicate masks with non-maximal suppression (when several grid points land on the same object they produce near-identical masks; NMS keeps the highest-scoring one and drops the overlapping rest). The stability filter works as a quality test: when nudging the threshold barely moves the boundary, the per-pixel probabilities fall sharply right at the edge, which is what a sure segmentation looks like, while a wobbly mask means the model is guessing where the object ends. Run over all $11$ million images, this produced $1.1$ billion masks.

The stability test extracts a confidence signal with no extra training. A mask is a per-pixel probability, and you turn it into a hard boundary by thresholding at, say, $0.5$ . Look at the probability along a line crossing the object's edge. When the model is sure where the object ends, that probability drops sharply at the boundary: high inside, low outside, with almost no in-between. Slide the threshold from $0.45$ to $0.55$ and the boundary barely moves, because the transition is nearly vertical and any threshold through it lands in the same place. When the model is guessing, the probability instead ramps gently across a fuzzy band, so the same nudge to the threshold slides the boundary a long way. So the question "does the mask change much when I nudge the threshold?" is a stand-in for "is the model confident where this object ends?", and that is why the data engine keeps only the masks that pass it. Note this is a different signal from the estimated-IoU head. Estimated IoU is a separate learned number the model regresses to rank its three masks against each other; stability is a property you read off the probability map directly by perturbing the threshold, no extra head involved. The engine uses both: confident and stable.

Step through the stages and watch the work migrate from human to model and the mask count jump three orders of magnitude:

Figure 6 · the data-engine flywheel

stage1 / 3

Across three stages the effort shifts from human to model, masks per image climb (44, 72, ~100), and the cumulative mask count jumps from 4.3M to 10.2M to 1.1B. The final stage is 99.1% automatic. Numbers are from the paper; the human/model split is illustrative of who annotates.

The result, SA-1B, is the dataset the project is named for: $1.1$ billion masks on $11$ million licensed, privacy-protecting images, with about $100$ masks per image. That is $400\times$ more masks and $11\times$ more images than the largest segmentation dataset that came before it. And almost none were drawn by hand: 99.1% of those masks were generated fully automatically by SAM. People drew on the order of ten million masks across the first two stages, enough to bootstrap a model good enough to draw the rest.

Which raises the fair worry: are machine-made masks any good? The authors checked by handing $500$ images (about $50$ k automatic masks) to professionals and asking them to fix every mask. Comparing before and after, 94% of the automatic masks already landed above $0.9$ IoU with the human-corrected version, and 97% above $0.75$ . For reference, prior work pegs human-to-human agreement on segmentation at roughly $85$ to $91$ percent IoU (that range is a cited baseline, not SAM's own number). An ablation confirms the point: training on the automatic masks alone costs only about $0.5$ mIoU versus using every mask the engine ever made, so SA-1B ships automatic-only to keep things simple.

Zero-shot transfer

With SAM trained, the test is whether the promptable task actually bought generalization. The paper evaluates five tasks under zero-shot transfer, four of them quite different from the segment-from-a-prompt objective SAM was trained on, each reached by prompt engineering rather than retraining.

The most direct test is the hardest prompt: a single point. On a suite of $23$ diverse datasets, scoring SAM's most-confident mask against the ground truth, SAM beats RITM (a strong interactive segmenter) on $16$ of the $23$ . The ambiguity machinery from earlier applies here. If instead you let an "oracle" pick whichever of SAM's three masks best matches the truth, removing the penalty for guessing the wrong nesting level, SAM wins on all $23$ . The gap between those two views measures the cost of ambiguity, which is why predicting three masks matters. Slide between them:

Figure 7 · single-point, vs RITM across 23 datasets

scoringconfident

Each square is one dataset. Scoring SAM's most-confident mask, SAM beats RITM on 16 of 23. Switch to the oracle (keep the best of SAM's three masks against the truth) and SAM wins on all 23. The grid is an abstract tally of the counts, not specific named datasets.

(A human study agreed: shown the masks, annotators rated SAM's between $7$ and $9$ out of $10$ , clearly above RITM and above an ablated single-mask SAM, on several datasets where the automatic metric had SAM behind. The metric and the eye do not always agree; this disagreement recurs with the instance-segmentation results.)

The other tasks lean on composition, the real point of a promptable model. SAM is a component you drop into a larger system. Have a box detector for cats? Feed its boxes to SAM as prompts and you have cat instance segmentation, with no segmentation training. That is exactly how the instance-segmentation experiment runs: take a detector's boxes, prompt SAM, get masks. Here the result cuts against the headline. Against ViTDet, a model trained directly on the target dataset, SAM trails on the mask-AP metric, $46.5$ versus $51.0$ on COCO and $44.7$ versus $46.6$ on LVIS. Yet when annotators were asked to rate the masks, they preferred SAM's, which have crisper boundaries. The disagreement has a mundane reading: AP scores pixel overlap against ground-truth annotations that are not themselves perfect, while a rater rewards a clean, coherent boundary, so a mask more precise than the annotation it is graded against loses points for the difference. Toggle the two verdicts:

Figure 8 · instance segmentation: metric vs humans

viewmask AP

Prompted with a detector's boxes, SAM trails dataset-trained ViTDet on mask AP (46.5 vs 51.0 on COCO, 44.7 vs 46.6 on LVIS). In the human study, annotators rate SAM's masks higher. The metric and the eye disagree. Human-study heights show direction only, not a scale.

The remaining tasks fill in the picture. Prompted to find edges (by segmenting from a grid of points and reading off mask boundaries), SAM produces sensible edge maps despite never training on edges, with high recall but low precision (it finds nearly all the true edges, but also flags many extra ones), since it draws real edges the benchmark simply did not annotate. On object-proposal generation (producing a set of class-agnostic candidate regions for a downstream detector to classify, here read off SAM's grid-of-points masks) it is strong on medium, large, rare, and common objects, trailing only on small and frequent ones, where the dataset-trained baseline has memorized the dataset's quirks. And text-to-mask works as a proof of concept through a neat trick: SAM is never trained on text, it is trained with CLIP image embeddings of masked regions, and at inference you feed a CLIP text embedding instead. Because CLIP aligns image and text in one space, "a wheel" in words lands near the wheels it saw in pixels. The authors are clear that this one is exploratory and not robust, and the public release does not ship it.

The image encoder's size is the last knob: going from ViT-B to ViT-H helps a lot, but ViT-H over ViT-L is only a marginal gain, and the authors note that scaling the encoder further did not look fruitful at the time.

What SAM is not

It helps to be exact about what was and was not claimed. What the paper actually claims is narrower than the headline, and more interesting for it.

SAM produces class-agnostic masks. It cuts out objects, it does not name them. Semantic segmentation and panoptic segmentation, which assign a category to every pixel, are not things you can coax out of SAM with a simple prompt, and the paper says outright that it is unclear how to. The labels, when you need them, come from whatever you compose SAM with.

It is also not a universal champion. It trails specialist models like ViTDet on their own metrics, it lags edge-detection methods that have learned a benchmark's particular biases, and dedicated interactive segmenters beat it once you give them many clicks. SAM is built for breadth. It does not top any single leaderboard. It misses fine structures, sometimes hallucinates small disconnected blobs, and its boundaries are less crisp than slow methods that zoom in. And the estimated-IoU score it reports is a self-assessment, not a certificate of quality.

What it got right is the overall design. The task is general enough to pre-train on and to repurpose by prompting. The model is split so the heavy part is paid once and the interactive part is cheap. And the data desert had a way out: use the model to label the data that trains the model, until a billion masks exist that no one drew by hand. Whether SAM itself is a foundation model, the authors leave to how the field uses it. The promptable framing, the open release of SA-1B, and the video successor SAM 2 suggest the answer has mostly been yes.

Provenance Verified against primary literature

MAE (2022)Self-supervised pretraining of the ViT-H image encoder (mask 75% of patches, reconstruct).

ViT (2021)The plain Transformer-on-patches backbone; SAM runs a ViT-H/16 once per image.

CLIP (2021)Aligned image/text space; enables the train-on-image, infer-on-text text prompt.

Focal + DiceMask supervision (20:1), a recipe inherited from DETR; focal from Lin, dice from V-Net.

DETR / MaskFormerThe set-prediction, two-way-attention mask-decoder lineage.

Multiple-choice learningPredict 3 masks, back-prop only the best, to cover ambiguity.

caveatSAM’s “estimated IoU” is an MSE-regressed self-estimate used to rank the 3 masks. It is not a calibrated probability and not a guarantee of mask quality.

Questions you might still have

Were the 1 billion masks drawn by people?
No. 99.1% of SA-1B was generated fully automatically by SAM. Humans drew on the order of 10M masks across the first two data-engine stages, just enough to bootstrap a model good enough to label the rest.

If SAM is never trained on text, how does it segment from a text prompt?
It trains with CLIP image embeddings of masked regions, and at inference you feed a CLIP text embedding instead. Because CLIP aligns image and text in one space, the swap works. The paper calls it a proof of concept, not robust, and it is not in the public release.

Is SAM state of the art at segmentation?
It is the most general, not the best at every task. It trails specialist ViTDet on instance-segmentation mask AP (though humans rate its masks higher), lags edge-detection methods that learn a benchmark’s biases, and cannot do semantic or panoptic segmentation by prompting. It is a component, not a leaderboard winner.

Why predict three masks instead of one?
A single point is ambiguous (pocket, shirt, or person), and one output forces the model to average those into a blurry compromise. Three masks cover the usual nested interpretations, and training back-props only the best match, so each slot specializes.

Footnotes & further reading

The paper: Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Dollár, Girshick, Segment Anything (Meta AI / FAIR, ICCV 2023). Model, dataset, and demo at segment-anything.com.
The image encoder is an MAE-pretrained Vision Transformer (ViT-H/16), with the high-resolution windowed-attention adaptation from ViTDet.
Text prompts route through the text encoder of CLIP; the train-on-image, infer-on-text trick relies on CLIP's shared image-text space.
The mask losses: focal loss from Lin et al. (RetinaNet) and dice loss from Milletari et al. (V-Net). The 20:1 combination and the two-way-attention mask decoder are inherited from DETR and MaskFormer.
Predicting several outputs and supervising only the best is multiple-choice learning: Lee et al., building on Guzman-Rivera et al. (2012). The interactive baseline SAM is measured against is RITM.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.