Vision · Segmentation

U-Net: Convolutional Networks for Biomedical Image Segmentation

Pooling discards pixel location; the skip connections preserve a copy of it.

A classifier crushes an image down to a single label. Segmentation needs the opposite: a label for every pixel. U-Net does both by shrinking an image to read it, then growing it back to full resolution, with shortcuts that pass the lost detail straight across.

Explaining the paperU-Net: Convolutional Networks for Biomedical Image SegmentationRonneberger, Fischer, Brox · University of Freiburg · MICCAI 2015 · arXiv:1505.04597 ↗

Thirty labelled microscope images, and a network that has to put a class on every single pixel. That is the whole training set U-Net was built for.

For most of deep learning's short history, a convolutional network answered one question per image: what is this? You feed in a picture and out comes a single label. Cat. The image collapses to one word. That is image classification, and by 2015 convnets were very good at it.

A biologist staring at an electron-microscope image of brain tissue needs a different kind of answer. Not "this image contains cells," but "this pixel is the inside of a cell, that pixel is a membrane, and the hairline between these two cells that are pressed together is the wall that separates them." A class label for every single pixel. That task is called semantic segmentation, and it is harder than it looks for a structural reason.

To recognize what something is, a network needs context. It stacks convolutions and pooling layers, and each $2\times2$ pooling step replaces a small patch of the feature map with a single value. A neuron deeper in the stack therefore sees a wider region of the image and builds a more abstract feature, but it no longer localizes which exact pixel that feature came from. An image classifier is built on that tradeoff, and it is the wrong thing to do if you also need to know where a boundary sits down to the pixel. You gain what at the cost of where. Context and localization pull against each other, and this is the tension the paper is built to resolve.

U-Net, out of Olaf Ronneberger's group at the University of Freiburg, resolves it with a shape you can draw in one stroke. A contracting path downsamples the image the way a classifier would, to gather context. A symmetric expanding path upsamples back to full resolution, to recover location. And a set of skip connections send the high-resolution detail from the contracting side directly across to the expanding side, so the network never has to reconstruct, from the compressed bottleneck, what it already saw clearly on the way down. Wrap that in a training recipe built for scarce data, and the network trains end to end from a few dozen images. We will build it up one piece at a time.

Two questions that fight each other

Before U-Net, the strongest way to segment a biomedical image was the sliding-window network of Ciresan and colleagues, the method that had won the 2012 version of this very challenge. The idea is direct: to label one pixel, cut out a square patch centered on it and ask a classifier what the center pixel is. Slide that window over every pixel and you get a full segmentation. It works, and it has two problems the paper is blunt about.

The first is speed. You run the network once per pixel, and neighboring patches overlap almost completely, so you recompute nearly the same features millions of times for a single image. The second is the context-localization tension, now made concrete as a single knob: the patch size. A big patch carries more context but needs more pooling, which blurs out exactly where the center pixel's boundary is. A small patch localizes tightly but sees too little around it to know what it is looking at. You are forced to trade the two against each other, and neither setting is good at both. Drag the patch size and feel the trade:

Figure 1 · the patch-size dilemma

patch size20% of image

Left, a cell-like blob and the patch a sliding-window classifier sees when labelling the marked boundary pixel. Right, two schematic meters, not measurements: context rises with patch size while localization falls, and they cross, so no single size wins both. U-Net's way out, contract for context and expand back for location, is the next section.

On top of that sits the data problem. The breakthrough classifiers of the era trained on a million labelled images. A lab annotating electron-microscopy stacks by hand might produce thirty. So the method also has to be frugal with labels. One architecture and one training recipe handle all three pressures: speed, the context-localization trade, and label scarcity. Start with the architecture.

The U: shrink to read it, grow it back

U-Net descends from the fully convolutional network (FCN), which had shown a few months earlier that you can turn a classifier into a dense predictor by replacing its pooling-driven downsampling with learned upsampling, and by combining a coarse, deep layer with a finer, shallower one to sharpen the result. U-Net takes that skeleton and makes it symmetric, deeper in the decoder, and trainable from almost no data. (FCNadds the coarse and fine layers together. U-Net concatenates them, a difference that matters later and is bigger than it sounds.)

The network has the shape of the letter it is named for. Down the left arm runs the contracting path. It is an ordinary convnet: at each level, two $3\times3$ convolutions followed by ReLUs, then a $2\times2$ max-pool that halves the height and width. Every time the spatial size halves, the number of feature channels doubles, from 64 at the top to 1024 at the bottom. The image is getting smaller and "thicker": less space, more meaning per location. At the bottom of the U, the bottleneck, a tile that started at $572\times572$ has been reduced to $28\times28$ with 1024 channels. This is the most context-rich, least spatially-precise the data ever gets. After four poolings each surviving grid cell (one location in the shrunken feature map) has been shaped by a large patch of the original tile, and that accumulated reach is exactly what "context" means here; the price of the reach is that the same grid cell no longer encodes where inside that patch anything sat.

That accumulated reach has a name: the receptive field, the patch of the original input that a single neuron has been shaped by. It grows by simple arithmetic as you descend. Each valid $3\times3$ convolution widens it by two pixels, and each $2\times2$ pool doubles the stride, so every later step reaches twice as far. Step the depth slider from the top to the bottleneck and watch the teal square, the receptive field of one neuron, swell from a 5-pixel sliver to a 140-pixel patch, about a quarter of the $572\times572$ tile, while the feature map on the right shrinks from $568$ to $28$ . Deeper buys context and spends location:

Figure 2 · how context grows

depthlevel 0 · top

The receptive field of one neuron, the patch of the

572^2

input it has been shaped by, against that level's shrinking feature map. Walked forward from a single pixel by exact conv/pool arithmetic (each valid 3×3 conv adds 2 px, each pool doubles the reach):

5^2

at the top up to

\sim140^2

at the bottleneck, while the map falls

568^2 \to 28^2

. More context, coarser map; the skips give the location back.

Up the right arm runs the expanding path, and it undoes the squeeze. At each level an up-convolution doubles the height and width and halves the channels, two $3\times3$ convs refine the result, and the map climbs back toward full resolution. By the top the network is back to a large spatial map, now 64 channels deep, and a final $1\times1$ convolution turns each pixel's 64-number feature vector into a score for each class. Drag the marker down and back up the U and watch the shape of the data change at every step:

Figure 3 · the architecture

1/11

The U-Net, walked end to end. Contracting path (left): two valid 3×3 convs then a 2×2 max-pool per level; space halves, channels double, down to a 28²×1024 bottleneck. Expanding path (right): up-conv doubles space and halves channels, then two convs. A copy-and-crop skip bridges each level. The numbers are the paper's Figure-1 example: 572² in, 388² out.

That is the forward pass, end to end, an ordinary feedforward convnet. Here it is in pseudocode. The only thing to notice is that the contracting path saves each level's feature map in a list, and the expanding path pulls them back out in reverse. Those saved maps are the skip connections, and they are the reason the U works at all.

# U-Net forward pass: contract, then expand reusing the skips
skips = []
x = tile                              # one mirror-padded input tile
for ch in [64, 128, 256, 512]:        # contracting path
    x = relu(conv3x3(x)); x = relu(conv3x3(x))   # two valid convs
    skips.append(x)                   # keep this map for later
    x = maxpool2x2(x)                 # halve H,W; channels double next
x = relu(conv3x3(x)); x = relu(conv3x3(x))       # bottleneck, 1024 ch
for skip in reversed(skips):          # expanding path
    x = upconv2x2(x)                  # double H,W, halve channels
    x = concat(x, crop(skip, x))      # copy-and-crop the encoder map
    x = relu(conv3x3(x)); x = relu(conv3x3(x))
return conv1x1(x)                     # per-pixel class scores

A word on the up-convolution, because the name hides a choice. The paper describes it as "an upsampling of the feature map followed by a $2\times2$ convolution." Many later implementations collapse that into a single strided transposed convolution (also called a deconvolution, a misleading name since it is not the inverse of anything). The two are interchangeable in spirit, and people use the terms loosely, but they are not identical in practice: a strided transposed convolution can stamp a faint checkerboard into the output, which the upsample-then-convolve form avoids. The paper describes the upsample-first form. What matters for the architecture is the bookkeeping it does: double the resolution, halve the channels.

The skip connections carry the detail across

The U-shape forces a question. By the time information reaches the bottleneck, it has been through four max-pools. A $28\times28$ map cannot, even in principle, say which of the original 572 columns a boundary fell on. The location has been quantized away. So if the decoder had only the bottleneck to work from, the best it could do is upsample a blurry, blocky guess: it would know roughly where the cell is, but it could not trace the cell's outline to the pixel. The fine detail is gone from that path. So the decoder needs the geometry returned without losing the meaning it gathered: the decoder map should bring the meaning, a saved copy of the early features should bring the geometry, and the final convolutions should be free to weigh the two independently.

But the fine detail is not gone from the network. It was sitting right there in the contracting path's early layers, at full resolution, before the pooling destroyed it. The skip connection hands that lost detail back. At each level of the decoder, U-Net takes the saved feature map from the corresponding level of the contracting path, crops it to size, and concatenates it onto the upsampled decoder map before the next convolutions run. The decoder map brings the deep, wide-context meaning. The skip brings the crisp, high-resolution geometry. The convolutions that follow get to combine both, so the output can be both correctly classified and precisely placed.

U-Net's choice of concatenation over FCN's addition matters most here. Adding the two maps forces them into the same channels and lets the network blend them only in fixed proportion. Concatenating keeps every channel from both, so the following convolutions can learn for themselves how to weigh the precise-but-shallow skip against the meaningful-but-coarse decoder. It costs memory. It buys flexibility — and the encoder's full channel count flows into a symmetric decoder that can exploit that flexibility.

You can feel what the skip does by taking it away. Below, the dashed amber curve is a cell's true outline. The teal mask is the network's prediction. With the skip turned off, the decoder works from the coarse bottleneck alone and produces a blocky, staircased boundary that misses every lobe. Turn the skip up and the high-resolution detail returns, and the predicted boundary aligns with the true outline:

Figure 4 · why the skips matter

skip35%

The true cell outline (dashed) versus the predicted mask. With the skip off, the decoder has only the coarse bottleneck and draws a staircased boundary. Drag the skip up and the contracting path's high-resolution features flow back in; the boundary error falls and the mask traces the real outline.

Around that skip sits the engineering that made this shape train well on real microscopy data with few labels: the unpadded convolutions, the tiling, the weighted loss, the deformations.

Valid convolutions, and an image too big to fit

U-Net uses only the valid part of every convolution. A valid (or unpadded) $3\times3$ conv does not invent pixels at the image edge to keep the output the same size; it produces an output that is one pixel smaller on each side. Two convs per level, all the way down the U, and the losses add up: an input tile of $572\times572$ comes out as a $388\times388$ segmentation. The output is smaller than the input by a fixed border, on purpose.

Why accept that? Because the alternative, padding with zeros, fabricates a hard black edge that no real tissue ever had, and the convolutions near the border learn from that fiction. Valid convolutions guarantee the opposite: every pixel in the output was computed only from real image context, never from invented padding. The segmentation map contains exactly the pixels for which the full context was actually available.

This creates a logistics problem. If the output is always smaller than the input, how do you segment an image larger than what fits on the GPU, or label the pixels right at the image's own edge, where there is no surrounding context to feed in? U-Net handles this with the overlap-tile strategy. You predict the image in tiles. To produce one output tile you feed in a larger input tile that includes a border of surrounding context. Where that border runs off the edge of the actual image, the missing context is invented by mirroring: the image is reflected across its own boundary, so the network sees a realistic continuation with the same texture and statistics instead of a black void. Drag the tile to the image's edge and watch the missing context fill in as a reflection:

Figure 5 · overlap-tile with mirroring

tileleft edge

The input tile is larger than the output tile it can predict, by the border the valid convolutions consume. Slide the tile across the image; at the edge the input runs off into nothing, so the missing context is filled by mirroring the interior across the boundary. This is how U-Net segments arbitrarily large images seamlessly.

All the halving imposes a constraint: you have to choose an input tile size so that every $2\times2$ max-pool lands on a map with even height and width, or the downsampling would not divide cleanly. It is a detail, but it is why the example tile is $572$ and not a round number. Halving an odd dimension has no integer answer, so a $2\times2$ pool on a 37-wide map has nowhere to put the leftover column and is undefined, which is why the input size is picked to stay even through every level of the descent.

A weighted loss that forces sharp borders

The output is a stack of class scores at every pixel. To turn scores into probabilities, U-Net runs a softmax across the classes at each pixel position $\mathbf{x}$ :

p_k(\mathbf{x}) = \frac{\exp\!\big(a_k(\mathbf{x})\big)}{\sum_{k'=1}^{K}\exp\!\big(a_{k'}(\mathbf{x})\big)}

Here $a_k(\mathbf{x})$ is the network's score for class $k$ at pixel $\mathbf{x}$ , and $p_k(\mathbf{x})$ is the resulting probability, close to 1 for the winning class and near 0 for the rest. Training then pushes the probability of the correct class toward 1 at every pixel using a weighted cross-entropy, the paper's energy function:

E = -\sum_{\mathbf{x}\in\Omega} w(\mathbf{x})\,\log p_{\ell(\mathbf{x})}(\mathbf{x})

(1)

Read it piece by piece. $\ell(\mathbf{x})$ is the true label at pixel $\mathbf{x}$ , so $p_{\ell(\mathbf{x})}(\mathbf{x})$ is the probability the network assigned to the right answer there. When that probability is 1 the pixel contributes nothing; when it drops, $-\log$ of it grows, and the loss rises. Sum over every pixel $\mathbf{x}$ in the image domain $\Omega$ , with a per-pixel weight $w(\mathbf{x})$ we are about to define, and you have the quantity to minimize.

(A note on the sign, because the paper's printed Eq (1) drops it. As published, the formula reads $E = \sum w\,\log p$ with no leading minus. Since a probability is at most 1, its log is at most 0, so that expression is never positive, and to drive the right-class probability toward 1 you would have to maximize it. The cross-entropy loss you actually minimize carries the minus sign, as in (1) above.)

The weight map $w(\mathbf{x})$ solves a problem specific to cells: telling apart two cells of the same class that are touching. If the network only had to say "cell" or "background," it could merge two touching cells into one blob and pay almost nothing for it, since the thin wall between them is only a sliver of pixels. So U-Net precomputes, for each training image, a weight map that makes those sliver pixels expensive to get wrong:

w(\mathbf{x}) = w_c(\mathbf{x}) + w_0\cdot\exp\!\left(-\frac{\big(d_1(\mathbf{x}) + d_2(\mathbf{x})\big)^2}{2\sigma^2}\right)

(2)

$w_c$ is a baseline weight that balances how often each class appears, so a rare class is not drowned out. The interesting term is the second one. $d_1(\mathbf{x})$ is the distance from pixel $\mathbf{x}$ to the border of the nearest cell, and $d_2(\mathbf{x})$ the distance to the second-nearest. Out in open background, far from everything, both distances are large, the exponential is near 0, and the weight equals the baseline. But in the thin gap between two touching cells, both borders are close, so $d_1 + d_2$ is small, the exponential approaches 1, and the weight jumps by up to $w_0$ . The paper sets $w_0 = 10$ and $\sigma \approx 5$ pixels, so the separating membrane between two cells can carry roughly ten times the weight of an ordinary pixel. The exponential makes this transition abrupt: a near-zero exponent sits right at $e^0 = 1$ , so the bonus reaches full strength in the thin gaps, while a large $d_1 + d_2$ drives the exponent steeply negative and the bonus dies off to almost nothing out in open tissue. Drag the two cells together and watch the ridge of weight rise in the gap:

Figure 6 · the separation weight map

gap6 px

Equation (2) made visible. The amber weight is brightest in the thin background between two cells, where both borders are near, so

d_1+d_2

is small. Close the gap and the ridge climbs toward the peak weight

w_0 = 10

; pull the cells apart and it fades to the baseline. The weight map penalizes errors on these wall pixels, so the network must learn to separate touching cells.

Put the two equations together and the training step reads in one breath: run the tile through the U, softmax the scores, take the per-pixel cross-entropy against the labels, multiply by the precomputed weight map, and sum.

# pixelwise softmax + weighted cross-entropy (Eq 1),
# using the precomputed border weight map (Eq 2)
p  = softmax(scores, over=classes)    # p_k(x) at every pixel x
ce = -log(p[label[x], x])             # per-pixel cross-entropy
w  = w_c + w0 * exp(-(d1 + d2)**2 / (2 * sigma**2))   # w0=10
loss = sum(w * ce)                    # summed over the output map

Thirty images, and the deformations that multiply them

An architecture and a loss are not enough when you have thirty training images. A network this size would memorize them. U-Net's third leg is aggressive data augmentation, and the part that matters most is elastic deformation: warping each training image (and its label map) with a smooth, random distortion, so the network sees a fresh, plausible variant every time. Flips and rotations help too, but they only move a rigid image around. Real tissue is not rigid. It stretches and folds locally, and a network that has only seen rigid copies stays fragile to the warps it will actually meet.

The recipe is small. Lay a coarse $3\times3$ grid of control points over the image and give each one a random displacement drawn from a Gaussian with a standard deviation of 10 pixels. Interpolate those few displacements smoothly across every pixel (the paper uses bicubic interpolation) and you get a flowing, tissue-like warp, not a jagged scramble. Apply the same warp to the image and its segmentation so they stay aligned. Because real tissue deforms this way, the augmented images stay realistic, and the network learns to be invariant to exactly the kind of variation it will meet. Press for a new deformation and feel how one labelled cell becomes an endless supply of them:

Figure 7 · elastic deformation

σ = 10 px

A coarse 3×3 displacement field (the paper samples it from a Gaussian, 10px std) is interpolated smoothly across the image, warping the grid and the labelled cell together. Each draw is a different believable deformation of the same annotation. This is the augmentation that lets U-Net train from a few dozen images without overfitting.

Two smaller training choices round it out, and both are answers to the same constraint: tiny data on a small GPU. First, the batch is a single image. The authors would rather spend their memory on a large input tile, for more context per forward pass, than on a big batch, so they push the batch down to one and lean on a high momentum of $0.99$ to compensate. With momentum that high, each update is effectively an average over the last hundred-odd gradients, which steadies the very noisy single-image steps. (The "hundred-odd" is the intuition $1/(1-0.99)=100$ , not a number the paper states.) Mechanically each step blends the fresh gradient with a fading memory of the ones before it, an exponential moving average over roughly the last hundred updates, so the quirks of any single image wash out and only the direction the images agree on actually moves the weights. Second, the weights are initialized from a Gaussian with standard deviation $\sqrt{2/N}$ , where $N$ is the number of inputs feeding each neuron (for a $3\times3$ conv over 64 channels, $N = 9 \cdot 64 = 576$ ). That specific $\sqrt{2/N}$ is He initialization, designed so that ReLU layers neither blow up nor fade out as the signal passes through; the factor of 2 is the correction for ReLU zeroing half its inputs.

Three challenges, one architecture

U-Net was tested on three biomedical segmentation tasks, and the same architecture won all of them. On the ISBI 2012 challenge for segmenting neuronal structures in electron-microscopy stacks, trained on just 30 images, it took the top of the ranking by warping error, the challenge's primary metric, which penalizes splits and merges that change the topology of the segmentation (it cares whether each cell is correctly separated from its neighbors, not just whether pixels are individually right), and beat the prior sliding-window network it was designed to replace. On the ISBI 2015 cell-tracking challenge it won both light-microscopy categories outright, reaching an intersection-over-union (IoU, the overlap between predicted and true masks) of 92% on one dataset and 77.5% on the other, against second-best scores of 83% and 46%. Toggle between the two challenges:

Figure 8 · the results

Cell tracking (IoU, higher is better): U-Net lifts DIC-HeLa IoU from 46% to 77.5%. EM segmentation (warping error, lower is better): U-Net edges out DIVE-SCI and beats Ciresan's IDSIA net, with human-level marked as the floor. The caveat is right on the chart: U-Net topped the warping-error ranking, but on Rand error alone it was not first.

That last caveat deserves to be plain, since the paper is careful about it. The EM challenge reports three different error metrics, and U-Net was first by warping error, the one the ranking is sorted on. By Rand error (a per-pixel-pair agreement score that, unlike warping error, does not privilege the thin walls between touching cells) a couple of entries scored lower, and they leaned on heavy dataset-specific post-processing to do it. U-Net ranked first on the primary metric with no pre- or post-processing, from a handful of images. That ranking, achieved without pre- or post-processing, made it a clean, general method rather than a tuned pipeline.

And it was fast and cheap to train. Training took about ten hours on a single NVIDIA Titan with 6 GB of memory, and segmenting a $512\times512$ image at inference took under a second. A simple method that matched or beat specialized pipelines from a few dozen labelled images, and ran fast: that is why the architecture spread far past biology.

The contracting path gathers context, the expanding path recovers location, and the skips deliver the high-resolution detail across the gap so the recovery is traced rather than guessed. What made it train on thirty images was the rest: a loss weighted toward the borders between touching cells, and an elastic warp that turns those thirty into an endless supply. The encoder-decoder-with-skips it introduced is now the default backbone for dense prediction everywhere, from medical imaging to the denoising network at the heart of modern diffusion models. The network always extracted fine detail in its early layers. What it kept discarding was where things were, and U-Net kept a copy.

Provenance Verified against primary literature

FCN (Long et al., 2015)Fully convolutional nets: replace pooling with upsampling, fuse a coarse and a fine layer. U-Net fuses by concatenation, not summation.

He et al. (2015)Initialize weights at std √(2/N) so ReLU layers keep ~unit variance.

Simard et al. (2003)Elastic deformation as augmentation. U-Net cites Dosovitskiy for the value of augmentation, not Simard.

Arganda-Carreras et al.The ISBI 2012 warping / Rand / pixel error metrics (lower is better).

correctionThe paper prints Eq (1) as E = Σ w·log p, with no minus sign. As written, that quantity is ≤ 0 and would be maximized. The cross-entropy you minimize is E = −Σ w·log p. We teach the correct sign and flag the omission.

Questions you might still have

Why not just keep the resolution high the whole way through, and skip the downsampling?
Then every neuron would only ever see a tiny patch. Downsampling is how a convnet grows its receptive field (the region of the input one neuron can see) cheaply, so it can tell a cell from a membrane using context rather than local texture alone. The U keeps the wide context and recovers the resolution afterwards, instead of paying for full resolution at every layer.

If the skip just copies the encoder features over, why bother with the decoder at all?
The encoder map is high resolution but shallow: it encodes that there is an edge here, but not what the edge belongs to. The decoder carries the deep, large-context meaning back up. Concatenation lets the final convolutions combine both: precise location from the skip, semantic identity from the decoder.

Why are the convolutions unpadded, when same-padding would make the output match the input size?
Unpadded (valid) convolutions never invent edge pixels, so every output pixel is computed from real context only. The cost is a smaller output and the overlap-tile bookkeeping. Most modern U-Nets use same-padding for convenience; the original chose valid convs so every output pixel comes from real image context.

Did U-Net actually win, or just do well?
On the ISBI 2012 EM challenge it topped the ranking by warping error and beat the prior sliding-window net. On Rand error alone it was not first. On the ISBI 2015 cell-tracking datasets it won both categories outright, lifting IoU on DIC-HeLa from the runner-up’s 46% to 77.5%.

Footnotes & further reading

The paper: Ronneberger, Fischer, Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation (University of Freiburg, MICCAI 2015).
The architecture U-Net extends: Long, Shelhamer, Darrell, Fully Convolutional Networks for Semantic Segmentation (CVPR 2015), which fuses a coarse and a fine layer by summation; U-Net concatenates instead.
The prior best method on this challenge: Ciresan et al., Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images (NIPS 2012), the sliding-window network.
The initialization: He, Zhang, Ren, Sun, Delving Deep into Rectifiers (2015), the source of the $\sqrt{2/N}$ rule.
Elastic deformation as augmentation traces to Simard, Steinkraus, Platt, Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis (ICDAR 2003); U-Net cites Dosovitskiy et al. for the value of augmentation, not Simard.
The challenge metrics (warping, Rand, pixel error): Arganda-Carreras et al., Crowdsourcing the Creation of Image Segmentation Algorithms for Connectomics (Frontiers in Neuroanatomy, 2015).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.