VerifiedarXiv:2304.0264326 min
Vision · Foundation models

Segment Anything

One promptable model that cuts out any object.

Segment Anything tried to give image segmentation a foundation-model moment: pre-train one model on a single general task, then point it at anything. The trick that made it work was letting the model build its own dataset, a billion masks deep.

Explaining the paperSegment AnythingKirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Dollár, Girshick · Meta AI (FAIR) · ICCV 2023 · arXiv:2304.02643

What if segmentation had a GPT moment: train one model, then prompt it to cut out anything?

Segmentation is the task of tracing the exact outline of a thing, pixel by pixel, separating it from everything else in the frame. It is one of the oldest problems in computer vision, and for most of the last decade it was solved the same tedious way: pick a dataset, fix a list of object classes, train a fresh model, and start over for the next dataset. Each model knew only the classes it was shown and only the kind of images it was trained on.

Language modeling had already escaped that loop. You pre-train one large model on a single, dumb objective (predict the next token), and then you steer it to new jobs by prompting it, no retraining required. The same model summarizes, translates, and writes code, because the pre-training task was general enough to teach it something reusable. Segment Anything, out of Meta's FAIR lab in 2023, asks the obvious question for vision: could segmentation work like that?

The answer is a system, not a single network. It has three parts that were designed together: a task general enough to pre-train on and to repurpose by prompting, a model (SAM, the Segment Anything Model) fast enough to use interactively, and a data engine that produced a dataset of 1.11.1 billion masks. That dataset, SA-1B, is the quiet hero of the paper, because the real obstacle was never the architecture. It was that the web is full of free text and free image-caption pairs, and almost entirely empty of pixel-perfect object outlines. SAM's answer was to use the model to label the data that trained the model.

We will build the idea in order: what a foundation model would even mean for pixels, the promptable task and the ambiguity it has to swallow, how the model is split so it runs in a browser, what a mask actually is and how it is scored, the flywheel that made a billion of them, and finally what SAM can and cannot do once it is trained.

A foundation model for pixels

A foundation model is one you pre-train once on a broad task and then adapt to many narrower ones. In language the recipe is next-token prediction plus prompting. In vision the closest examples align images and text: CLIP learns an image encoder and a text encoder that land in the same space, and once trained you can name a new category in words and have it recognized, with no labeled examples of that category. That last move has a name the paper uses precisely: zero-shot transfer, evaluating a model on data and tasks it never saw in training.

Segmentation sits below that text-and-image layer. The job is not to name an object, it is to draw it. And there is no web-scale pile of drawings to learn from. So the SAM authors set themselves a sharper goal than "a good segmenter": a promptable model, pre-trained on a task broad enough that prompting alone carries it to new image distributions and new downstream problems. Get the task right and the same frozen model should handle jobs nobody trained it for, the way a language model answers a question it never saw.

The promptable task

Start with what a prompt is. For SAM a prompt is anything that points at what you want segmented: one or more foreground or background points, a rough box, a coarse mask, or even a line of text. The promptable segmentation task is then a single sentence: given any prompt, return a valid mask.

The word doing the work is "valid," and it exists to handle ambiguity. Click one point on someone's shirt. Did you mean the shirt, or the person wearing it, or the small logo on the pocket? A single point genuinely cannot say. The task does not demand the model read your mind. It demands that the output be a reasonable mask for at least one sensible interpretation. This is the same forgiveness we extend to a language model: asked something vague, it should still answer something coherent rather than freeze.

That definition was chosen because it does double duty. It is a pre-training objective: simulate a stream of prompts for each training mask and score the model's guesses against the ground truth. And it is a method for downstream tasks: once the model answers any prompt well, you solve a new problem by engineering the right prompt rather than training a new head. SAM borrows the prompt-simulation idea from interactive segmentation, where a user clicks to refine a mask, but bends the goal. Interactive segmenters aim to be right eventually, after enough clicks. SAM has to be valid immediately, for any prompt, even a single ambiguous one, because its own data engine will later lean on that property to label images with no human in the loop.

One point, three masks

Ambiguity is not a corner case here, it is the central modeling problem, and the naive fix makes it worse. If the model emits one mask and the training signal is the average over every valid interpretation, it learns to predict the average mask: a blurry compromise that is the pocket, the shirt, and the person smeared together, and a good outline of none of them. Averaging is exactly the wrong thing to do with genuinely distinct answers.

SAM's fix is to stop pretending there is one answer. For a single prompt it predicts three masks at once. Three is not arbitrary. When a point is ambiguous, the candidate objects are almost always nested: a subpart inside a part inside a whole (the pocket, the shirt, the person), and that nesting is rarely more than three deep. So three outputs cover the usual ambiguity without the model ever having to guess which level you meant.

The training trick is this. You only have one ground-truth mask per prompt, so which of the three predictions should it supervise? SAM computes the loss for all three and back-propagates only through the best match:

Lmask=mini{1,2,3}[20FL(mi,g)+DL(mi,g)]\mathcal{L}_{\text{mask}} = \min_{i\,\in\,\{1,2,3\}}\Big[\,20\,\mathrm{FL}(m_i, g) + \mathrm{DL}(m_i, g)\,\Big]
(1)

Here mim_i are the three predicted masks, gg the ground truth, and FL\mathrm{FL}, DL\mathrm{DL} the focal and dice losses we define below. The min\min is the whole idea. Whichever prediction happens to be closest to this particular ground truth gets all the gradient, and the other two are left alone to specialize on the interpretations they are already leaning toward. This is an old technique called multiple-choice learning: let an ensemble of outputs cover the plausible answers, and grade only the one that was right. The counterfactual shows why the min\min is load-bearing: hand the gradient to all three heads on every prompt and each gets pulled toward the same compromise, three copies of the blur from before; starving all but the closest head is what lets each drift toward the interpretation it is already best at. Over many prompts the three slots settle into a rough whole / part / subpart division of labor.

One loose end. If the model returns three masks, which does it hand back when an application wants just one? SAM adds a tiny head that, for each mask, predicts its own quality as an estimated IoU (the overlap it thinks it would score against the true object), and ranks the three by that number. Worth saying plainly, because it is easy to over-read: this score is a learned self-estimate trained by regression, not a calibrated probability and not a guarantee. A high estimated IoU means the model is optimistic about that mask, nothing more. It can be confidently wrong on an unfamiliar image.

Drag the green point and slide through the three masks. One ambiguous prompt, three valid answers, each with the model's own confidence:

Figure 1 · one prompt, three valid masks
whole
A single foreground point on the pocket is ambiguous. SAM returns three nested masks (subpart, part, whole), each ranked by its own estimated IoU. Because the point sits inside all three, all three are valid at once. The confidence values are illustrative; the mechanism is the point.

There is a fourth output token too, used only when you give the model more than one prompt. With several points or a box the ambiguity mostly evaporates, so SAM switches to a single clean prediction instead of three near-duplicates. The three-mask machinery is there for the hard case of one lonely point.

Heavy once, cheap forever

The task says "answer any prompt." The product goal adds "and do it while the user clicks around," which means roughly 5050 milliseconds per prompt, in a browser. That single constraint shapes the whole architecture, because the obvious design (run a big network end to end for every click) is far too slow. A heavy vision transformer takes a sizeable fraction of a second per image. You cannot pay that on every click. Fifty milliseconds is interactive-latency territory, short enough that the mask seems to arrive with the click rather than after it; holding that budget is the design pressure that pushes the heavy work out of the click loop, and the amortization in Figure 2 below is the payoff.

SAM's answer is to split the model along the line between "depends only on the image" and "depends on the prompt." A heavyweight image encoder runs once per image and turns it into a reusable embedding. Then a lightweight prompt encoder and mask decoder run per prompt, reusing that embedding. The expensive part is paid once and amortized over every prompt you ask.

The first prompt on a fresh image is expensive, because it alone pays for the encoder. The second prompt is nearly free, and the tenth is nearly free, because they all reuse the same embedding. The average cost per prompt falls toward the 5050 ms floor. Drag the number of prompts and watch it drop:

Figure 2 · amortizing the heavy encoder
8 · 800ms
The image encoder runs once and produces a reusable embedding; each prompt after that is a 50\sim 50ms decode. So the amortized cost per prompt falls toward the 50ms floor the more prompts you ask of one image. The encoder cost shown is illustrative; the 50ms decode is from the paper.

In code the split is the whole story. Encode once, then loop over prompts:

# encode the image ONCE (heavy ViT-H, ~0.15s on a GPU)
img_embed = image_encoder(image)        # 1024x1024 -> 64x64 x 256

# then every prompt is a ~50ms decode that reuses img_embed
for prompt in prompts:                  # a point, box, mask, ...
    tokens = prompt_encoder(prompt)     # -> 256-dim embeddings
    masks, iou = mask_decoder(img_embed, tokens)  # 3 masks + scores
    mask = masks[argmax(iou)]           # rank by estimated IoU

This is also why SAM is, strictly, not real-time end to end. The 5050 ms figure is the prompt encoder plus mask decoder. The image encoder behind it is heavy, and is paid up front, once, before you start clicking.

What lives inside that 5050 ms? Roughly, a prompt embedding, two transformer blocks of cross-attention, an upsample to a full-resolution mask, and an IoU^\widehat{\mathrm{IoU}} head that scores each output mask. Slide KK below to see the wall time: the first prompt eats the encoder, every prompt after it is 5050 ms.

Figure 3 · where the 50 ms goes
K = 8
Top row: the first prompt on a fresh image pays for the image encoder once, then a thin ∼50 ms slice for prompt encoder + decoder + mask + IoU^\widehat{\mathrm{IoU}} head. Bottom row: KK subsequent prompts on the same embedding, 5050 ms each. Only the 5050 ms total is paper-stated; the sub-segments inside it are illustrative (the paper does not publish per-block decoder latencies), and the encoder bar is drawn long enough to dominate without claiming a measured ratio. The top row's time axis is compressed so the 5050 ms slice stays legible.

What a mask actually is

We have been saying "mask" loosely. Concretely it is a probability per pixel: how likely each location is to be part of the object. The decoder produces it in a neat way. The decoder carries a learned output token through the attention blocks, projects it to a vector vv, and then computes, at every location (x,y)(x,y), the dot product between vv and that location's image feature ex,ye_{x,y}:

M(x,y)=σ(v,ex,y)M(x,y) = \sigma\big(\langle v,\, e_{x,y}\rangle\big)
(2)

The mask is a similarity heatmap. The output token learns "what this object looks like in feature space," and the mask lights up wherever the image agrees, squashed through a sigmoid σ\sigma into a probability. It is a tiny linear classifier whose weights are produced on the fly from the prompt, which is why the paper calls it a dynamic classifier. Said with the emphasis it deserves: the decoder writes a brand-new classifier for every prompt, and the mask is that classifier evaluated at every pixel at once. Rotate vv below and watch the same equation produce the mask:

Figure 4 · the mask is a dot product
v∠125° · align 0.05
Equation (2) on a toy grid. Left, each cell's feature vector (the object region's features point one way, the background's roughly the other, with noise). Right, the mask: teal intensity is σ(〈v, e〉) per cell, the bright contour is the 0.5 threshold. Align vv with the object's feature direction (the amber notch) and the mask snaps onto the object; misalign it and the mask is mush. Toy 2D features; the real model uses 256-dim features and the decoder's attention, not a slider, picks vv. Same mechanism.

To train any of this you need a way to score a predicted mask against the truth. The standard yardstick is Intersection over Union: the area where prediction and truth overlap, divided by the area they cover together.

IoU(A,B)=ABAB\mathrm{IoU}(A,B) = \frac{|A \cap B|}{|A \cup B|}
(3)

It runs from 00 (no overlap) to 11 (perfect). Averaged over many examples it is reported as mIoU, the standard segmentation score. Drag the predicted mask onto the ground truth and watch the number climb:

Figure 5 · IoU, the yardstick
IoU is the overlap divided by the union of the ground truth and the prediction. Drag the teal mask to change it. The dashed mark sits at 0.900.90, where 94% of SA-1B's automatic masks land when checked against a human correction.

IoU is the metric, but it does not make a smooth training loss on its own. SAM supervises masks with a fixed blend of two losses, in a 20:120{:}1 ratio of focal to dice, a recipe it inherits from DETR. The two losses fix different problems. Focal loss is cross-entropy with a knob that turns down the easy pixels:

FL(pt)=αt(1pt)γlogpt\mathrm{FL}(p_t) = -\,\alpha_t\,(1 - p_t)^{\gamma}\,\log p_t
(4)

where ptp_t is the predicted probability of the correct label for a pixel. The factor (1pt)γ(1-p_t)^{\gamma} is the point of it: when a pixel is already classified confidently (ptp_t near 11) that factor collapses to near zero, so the loss barely notices it, and training attention stays on the hard pixels along the object's edge. (The separate αt\alpha_t just balances foreground against background; it is not what down-weights the easy cases.) Dice loss comes from the Dice overlap coefficient and pushes the predicted region to coincide with the true one:

Dice(A,B)=2ABA+B,DL=1Dice\mathrm{Dice}(A,B) = \frac{2\,|A \cap B|}{|A| + |B|}, \qquad \mathrm{DL} = 1 - \mathrm{Dice}
(5)

Focal works pixel by pixel and is good at boundaries; dice works on the region as a whole and is good when the object is small and the background dwarfs it. Putting them together, with the min-over-three trick from before, gives the full training step:

# one training step on an (image, ground-truth mask) pair
img_embed = image_encoder(image)
prompt    = sample_prompt(gt)           # a point, or a noisy box
tokens    = prompt_encoder(prompt)
masks, iou_pred = mask_decoder(img_embed, tokens)   # 3 candidates

# 20:1 focal+dice per candidate; supervise only the BEST match
loss_i    = [20 * focal(m, gt) + dice(m, gt) for m in masks]
mask_loss = min(loss_i)                 # multiple-choice / hindsight
iou_loss  = mse(iou_pred, iou(masks, gt))   # the rank head, scale 1
(mask_loss + iou_loss).backward()

One honest footnote to the loss. The ground-truth prompt is not a single tidy click. SAM simulates an interactive session of 1111 rounds per mask: an initial point or box, then points sampled from wherever the current prediction is wrong, plus a couple of rounds with no new input so the model learns to polish its own output. That simulation is exactly what lets the trained model later sit inside the data engine and behave like a tireless annotator.

The data engine

Now the chicken and egg. To train SAM you need a vast, diverse set of masks. To make masks cheaply you need SAM. The paper's resolution is a data engine: a loop where the model annotates, people correct, the model retrains on the corrections and gets better, and the next turn it carries more of the load. Three stages, and the human role shrinks at each one.

That stability test is worth dwelling on, because it is a clever way to get a confidence signal for free. Recall that a mask is a per-pixel probability, and you turn it into a hard boundary by thresholding at, say, 0.50.5. Now picture the probability as you walk a line across the object's edge. When the model is sure where the object ends, that probability falls off a cliff at the boundary: high inside, low outside, with almost no in-between. Slide the threshold from 0.450.45 to 0.550.55 and the boundary barely moves, because the cliff is nearly vertical and any cut through it lands in the same place. When the model is guessing, the probability instead ramps gently across a fuzzy band, so the same nudge to the threshold slides the boundary a long way. So the question "does the mask change much when I nudge the threshold?" is a stand-in for "is the model confident where this object ends?", and that is why the data engine keeps only the masks that pass it. Note this is a different signal from the estimated-IoU head. Estimated IoU is a separate learned number the model regresses to rank its three masks against each other; stability is a property you read off the probability map directly by perturbing the threshold, no extra head involved. The engine uses both: confident and stable.

Step through the stages and watch the work migrate from human to model and the mask count jump three orders of magnitude:

Figure 6 · the data-engine flywheel
1 / 3
Across three stages the effort shifts from human to model, masks per image climb (44, 72, ~100), and the cumulative mask count jumps from 4.3M to 10.2M to 1.1B. The final stage is 99.1% automatic. Numbers are from the paper; the human/model split is illustrative of who does the work.

The result, SA-1B, is the dataset the project is named for: 1.11.1 billion masks on 1111 million licensed, privacy-protecting images, with about 100100 masks per image. That is 400×400\times more masks and 11×11\times more images than the largest segmentation dataset that came before it. The single most important fact about it, and the one most often garbled: 99.1% of those masks were generated fully automatically by SAM. People drew on the order of ten million masks across the first two stages, just enough to bootstrap a model good enough to draw the rest.

Which raises the fair worry: are machine-made masks any good? The authors checked by handing 500500 images (about 5050k automatic masks) to professionals and asking them to fix every mask. Comparing before and after, 94% of the automatic masks already landed above 0.90.9 IoU with the human-corrected version, and 97% above 0.750.75. For reference, prior work pegs human-to-human agreement on segmentation at roughly 8585 to 9191 percent IoU (that range is a cited baseline, not SAM's own number). An ablation confirmed the punchline: training on the automatic masks alone costs only about 0.50.5 mIoU versus using every mask the engine ever made, so SA-1B ships automatic-only to keep things simple.

Zero-shot transfer

With SAM trained, the test is whether the promptable task actually bought generalization. The paper evaluates five tasks under zero-shot transfer, four of them quite different from the segment-from-a-prompt objective SAM was trained on, each reached by prompt engineering rather than retraining.

The most direct test is the hardest prompt: a single point. On a suite of 2323 diverse datasets, scoring SAM's most-confident mask against the ground truth, SAM beats RITM (a strong interactive segmenter) on 1616 of the 2323. Now recall the ambiguity machinery. If instead you let an "oracle" pick whichever of SAM's three masks best matches the truth, removing the penalty for guessing the wrong nesting level, SAM wins on all 2323. The gap between those two views is precisely the cost of ambiguity, and it is why predicting three masks matters. Slide between them:

Figure 7 · single-point, vs RITM across 23 datasets
confident
Each square is one dataset. Scoring SAM's most-confident mask, SAM beats RITM on 16 of 23. Switch to the oracle (keep the best of SAM's three masks against the truth) and SAM wins on all 23. The grid is an abstract tally of the counts, not specific named datasets.

(A human study agreed: shown the masks, annotators rated SAM's between 77 and 99 out of 1010, clearly above RITM and above an ablated single-mask SAM, on several datasets where the automatic metric had SAM behind. The metric and the eye do not always agree, a theme that returns below.)

The other tasks lean on composition, the real point of a promptable model. SAM is a component you drop into a larger system. Have a box detector for cats? Feed its boxes to SAM as prompts and you have cat instance segmentation, with no segmentation training. That is exactly how the instance-segmentation experiment runs: take a detector's boxes, prompt SAM, get masks. Here the honest result matters. Against ViTDet, a model trained directly on the target dataset, SAM trails on the mask-AP metric, 46.546.5 versus 51.051.0 on COCO and 44.744.7 versus 46.646.6 on LVIS. Yet when annotators were asked to rate the masks, they preferred SAM's, which have crisper boundaries. The disagreement has a mundane reading: AP scores pixel overlap against ground-truth annotations that are not themselves perfect, while a rater rewards a clean, coherent boundary, so a mask crisper than the annotation it is graded against loses points for the difference. Toggle the two verdicts:

Figure 8 · instance segmentation: metric vs humans
mask AP
Prompted with a detector's boxes, SAM trails dataset-trained ViTDet on mask AP (46.5 vs 51.0 on COCO, 44.7 vs 46.6 on LVIS). In the human study, annotators rate SAM's masks higher. The metric and the eye disagree. Human-study heights show direction only, not a scale.

The remaining tasks fill in the picture. Prompted to find edges (by segmenting from a grid of points and reading off mask boundaries), SAM produces sensible edge maps despite never training on edges, with high recall at the cost of precision, since it draws real edges the benchmark simply did not annotate. On object-proposal generation it is strong on medium, large, rare, and common objects, trailing only on small and frequent ones, where the dataset-trained baseline has memorized the dataset's quirks. And text-to-mask works as a proof of concept through a neat trick: SAM is never trained on text, it is trained with CLIP image embeddings of masked regions, and at inference you feed a CLIP text embedding instead. Because CLIP aligns image and text in one space, "a wheel" in words lands near the wheels it saw in pixels. The authors are clear that this one is exploratory and not robust, and the public release does not ship it.

One last knob worth a sentence: the image encoder's size. Going from ViT-B to ViT-H helps a lot, but ViT-H over ViT-L is only a marginal gain, and the authors note that scaling the encoder further did not look fruitful at the time.

What SAM is not

It helps to be exact about what was and was not claimed, because SAM is easy to oversell. The honest framing is narrower than the headline and more interesting for it.

SAM produces class-agnostic masks. It cuts out objects, it does not name them. Semantic segmentation and panoptic segmentation, which assign a category to every pixel, are not things you can coax out of SAM with a simple prompt, and the paper says outright that it is unclear how to. The labels, when you need them, come from whatever you compose SAM with.

It is also not a universal champion. It trails specialist models like ViTDet on their own metrics, it lags edge-detection methods that have learned a benchmark's particular biases, and dedicated interactive segmenters beat it once you give them many clicks. SAM is built for breadth. It does not top any single leaderboard. It misses fine structures, sometimes hallucinates small disconnected blobs, and its boundaries are less crisp than slow methods that zoom in. And the estimated-IoU score it reports is a self-assessment, not a certificate of quality.

What it got right is the shape of the bet. The task is general enough to pre-train on and to repurpose by prompting. The model is split so the heavy part is paid once and the interactive part is nearly free. And the data desert had a way out: use the model to label the data that trains the model, until a billion masks exist that no one drew by hand. Whether SAM itself is a foundation model, the authors leave to how the field uses it. The promptable framing, the open release of SA-1B, and the video successor SAM 2 suggest the answer has mostly been yes.

Provenance Verified against primary literature
MAE (2022)Self-supervised pretraining of the ViT-H image encoder (mask 75% of patches, reconstruct).
ViT (2021)The plain Transformer-on-patches backbone; SAM runs a ViT-H/16 once per image.
CLIP (2021)Aligned image/text space; enables the train-on-image, infer-on-text text prompt.
Focal + DiceMask supervision (20:1), a recipe inherited from DETR; focal from Lin, dice from V-Net.
DETR / MaskFormerThe set-prediction, two-way-attention mask-decoder lineage.
Multiple-choice learningPredict 3 masks, back-prop only the best, to cover ambiguity.
correctionSAM’s “estimated IoU” is an MSE-regressed self-estimate used to rank the 3 masks. It is not a calibrated probability and not a guarantee of mask quality.

Questions you might still have

?

Were the 1 billion masks drawn by people?
No. 99.1% of SA-1B was generated fully automatically by SAM. Humans drew on the order of 10M masks across the first two data-engine stages, just enough to bootstrap a model good enough to label the rest.

?

If SAM is never trained on text, how does it segment from a text prompt?
It trains with CLIP image embeddings of masked regions, and at inference you feed a CLIP text embedding instead. Because CLIP aligns image and text in one space, the swap works. The paper calls it a proof of concept, not robust, and it is not in the public release.

?

Is SAM state of the art at segmentation?
It is the most general, not the best at every task. It trails specialist ViTDet on instance-segmentation mask AP (though humans rate its masks higher), lags edge-detection methods that learn a benchmark’s biases, and cannot do semantic or panoptic segmentation by prompting. It is a component, not a leaderboard winner.

?

Why predict three masks instead of one?
A single point is ambiguous (pocket, shirt, or person), and one output forces the model to average those into a blurry compromise. Three masks cover the usual nested interpretations, and training back-props only the best match, so each slot specializes.

Footnotes & further reading

  1. The paper: Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Dollár, Girshick, Segment Anything (Meta AI / FAIR, ICCV 2023). Model, dataset, and demo at segment-anything.com.
  2. The image encoder is an MAE-pretrained Vision Transformer (ViT-H/16), with the high-resolution windowed-attention adaptation from ViTDet.
  3. Text prompts route through the text encoder of CLIP; the train-on-image, infer-on-text trick relies on CLIP's shared image-text space.
  4. The mask losses: focal loss from Lin et al. (RetinaNet) and dice loss from Milletari et al. (V-Net). The 20:1 combination and the two-way-attention mask decoder are inherited from DETR and MaskFormer.
  5. Predicting several outputs and supervising only the best is multiple-choice learning: Lee et al., building on Guzman-Rivera et al. (2012). The interactive baseline SAM is measured against is RITM.