Vision · 3D

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Q: Can I reuse a trained NeRF on a different scene?

No. A NeRF is overfit to one scene from scratch, one to two days of optimization, and learns nothing transferable. The weights are that scene, the way a compressed file is one song. A new object means a new optimization. (Later work like pixelNeRF and instant-NGP attacked exactly this.)

Q: Can I extract a mesh or relight it?

Not cleanly. The geometry is implicit in the density field, so a mesh needs a lossy extra step (run marching cubes on the densities). And the lighting is baked into the emitted radiance with no separate light source, so a vanilla NeRF cannot be relit. View-dependent color fakes appearance under capture-time lighting; it does not simulate light transport.

Q: If an MLP is a universal approximator, why does the positional encoding help?

It is not about capacity, it is about optimization. Plain networks have a spectral bias: gradient descent fits low frequencies first and high frequencies barely at all. The sin/cos encoding reshapes the training so high frequencies become learnable. It adds input dimensions, not representational power.

Q: Why two networks, coarse and fine?

The coarse pass is a cheap first pass: 64 even samples to find roughly where the surface is. The fine pass then spends its 128 samples there instead of on empty space. Both are trained together, since the coarse network has to learn to point the fine sampler at the right place.

Q: Is σ the opacity of a point?

No. σ is a density, a rate with units of one-over-length. The opacity of a ray segment of length δ is α = 1 − exp(−σδ), which depends on both the density and the segment length, and getting that relationship backwards is the most common NeRF error. Writing α ≈ σδ is only the first-order approximation, accurate when the segment is nearly transparent.

Fit a small network to a scene's photos, and render it from any new camera angle.

NeRF stores a 3D scene as a continuous function living inside one small network. To render a new view it marches a camera ray through the scene, asks the network for the color and density at points along the way, and adds them up. No mesh, no voxel grid, just the weights.

Explaining the paperNeRF: Representing Scenes as Neural Radiance Fields for View SynthesisMildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng · UC Berkeley · Google · UC San Diego · ECCV 2020 · arXiv:2003.08934 ↗

You have a few dozen photos of an object, shot from different angles. You want a video that flies the camera along a smooth path none of those photos ever took.

This is the view-synthesis problem, and it is older than deep learning. The classic recipe builds an explicit model of the object first, a triangle mesh or a grid of colored voxels, then renders the model from the new camera. Both run into the same wall. A mesh needs the right topology before you can optimize it, and real scenes rarely hand you one. A voxel grid that is detailed enough to look sharp costs memory that grows with the cube of the resolution, so high quality means gigabytes, and rendering a crisp image means sampling that grid ever more finely.

NeRF, from a Berkeley and Google team in 2020, throws the explicit model away. It represents the scene as a function: give it a point in space and a direction you are looking from, and it returns the color and the density there. That function is a small neural network, and the method is a way to fit the network so that images rendered from it match the photos you took. Once it is fit, you render a brand-new view by querying the function along every camera ray and compositing the results.

A handful of ideas carry the whole paper: a scene written as a queryable function, the split that makes geometry consistent across views, the classical volume-rendering integral that turns the function into a pixel, a coordinate trick that rescues the network from blurriness, and a sampling scheme that stops it wasting work on empty air. None is heavy on its own. Together they set a new state of the art for rendering photorealistic new views of real, complicated scenes from ordinary photos.

A scene is a function you query

Everything else rests on one idea. A NeRF is one continuous function, written $F_\Theta$ , with learnable weights $\Theta$ . It takes a 3D location $\mathbf{x}=(x,y,z)$ and a viewing direction $\mathbf{d}$ , and it returns an emitted color $\mathbf{c}=(r,g,b)$ and a scalar volume density $\sigma$ :

F_\Theta : (\mathbf{x},\, \mathbf{d}) \;\longmapsto\; (\mathbf{c},\, \sigma)

The input is called 5D because a position has three numbers and a direction on the sphere has two (the angles $\theta,\phi$ ). In practice the direction is fed as a 3-component unit vector, so the network actually reads six numbers, not five. The function is continuous: there is no grid, no list of stored points. Every location in space has a color and a density, defined by whatever the weights compute when you query that point.

The function is a multilayer perceptron, an MLP, which is the plainest kind of neural network: a stack of fully-connected layers with no convolutions and no attention. It is small. The entire scene fits in about 5 MB of weights, less than the input photos themselves.

One point trips up every newcomer, and stating it flatly now keeps everything later from resting on a wrong picture. A NeRF is not a model trained on a big dataset of scenes that then generalizes to a new one. It is fit from scratch to a single scene, by overfitting that one scene's photos, and it learns nothing transferable. The weights are the scene, the way a compressed file is one specific song. Point NeRF at a new object and you start a fresh optimization, one to two days of it on a high-end GPU. What you get out is not a mesh you can drop into a game engine and not a point cloud you can measure. The geometry only exists implicitly, as the pattern of densities the network will report if you query it. NeRF is a renderer for one scene, not a 3D scanner.

Density from where, color from how you look

The two outputs are not treated symmetrically, and the asymmetry is doing real work. Density $\sigma$ is predicted from the position $\mathbf{x}$ alone. Color $\mathbf{c}$ is predicted from the position and the viewing direction $\mathbf{d}$ . The network is wired so that $\sigma$ is computed and emitted before the direction is even fed in.

Why split it this way, rather than letting both depend on everything? Because density is geometry, and the geometry of a real object does not change when you walk around it. A corner of a table is in the same place from every angle. If you let density depend on viewing direction, the network could cheat: it could invent a different shape for each photo, fit them all perfectly, and reconstruct nonsense that happens to look right from exactly the training cameras and falls apart everywhere in between. Forcing $\sigma$ to depend on position only removes that freedom. The single density field has to explain all the photos at once, which is what pins it to the true geometry. Without that constraint, nothing would force the geometry to agree across views at all.

Color earns the opposite treatment because appearance genuinely does change with viewpoint. A glossy surface throws a bright specular highlight that sits in different places as you move; wet paint, varnished wood, and metal all look different head-on than at a glancing angle. The output here is radiance, the light a point sends out in a particular direction, which is why letting $\mathbf{c}$ depend on $\mathbf{d}$ lets NeRF paint a moving highlight that a position-only color could only smear into a dull average.

Orbit the viewing angle in the figure and watch the two outputs behave differently. The density bar never moves. The color swatch flares toward white as your view lines up with the light's reflection, then settles back to the surface's dull amber as you swing away. That flare is the specular highlight, and it is exactly the effect a model without view-dependence cannot reproduce.

Figure 1 · view-dependent color, view-independent geometry

view angle100°

One surface point under a fixed light. The teal lobe is the radiance it emits in each viewing direction, peaking toward the light's mirror reflection. As you orbit, the density stays fixed (geometry is the same from everywhere) while the color brightens into a specular highlight and fades away again. NeRF predicts

\sigma

from position alone,

\mathbf{c}

from position and direction.

The network that pulls this off is unglamorous. The encoded position runs through eight fully-connected layers of 256 units each, with one skip connection that re-injects the input partway through (a habit borrowed from DeepSDF, an earlier coordinate-to-shape network, which keeps the early coordinate information from washing out). That trunk emits the density $\sigma$ plus a 256-number feature vector. Only then is the viewing direction joined on, run through one more 128-unit layer, and turned into the RGB color. Direction gets just that one shallow layer because view-dependence is a small adjustment to an already-computed appearance, a moving highlight rather than a whole new object, so it does not need the depth that geometry does. The wiring enforces the constraint directly: direction cannot touch density because density has already left the building.

From a field to a pixel

A function that returns color and density at points in space is not yet a picture. To make a pixel you have to decide what a camera ray traveling through that field actually sees. NeRF answers with physics that computer graphics has used since the 1980s, made differentiable, so the same render that turns densities and colors into a pixel can be run backward to ask which densities and colors would have made that pixel match the photo.

Each pixel defines a ray leaving the camera center $\mathbf{o}$ in a direction $\mathbf{d}$ set by the pixel's position and the camera's calibration:

\mathbf{r}(t) = \mathbf{o} + t\,\mathbf{d}

March along this ray, sampling points, query the network at each, and combine what comes back into one color. Do that for every pixel and you have rendered the image. In the figure below, the camera on the left casts one ray per pixel into a little scene holding a teal blob and an amber blob. Drag the slider to pick a pixel: its ray lights up with sample points, each dot tinted by the color the network returns there and sized by the density, and compositing them paints the pixel on the image plane. Sweep the ray onto a blob and the pixel takes the blob's color; sweep it into the gap and the ray sails through to the dark background. The rendered column on the plane traces out the object's silhouette, one ray at a time.

Figure 2 · marching a ray through the field

pixelrow 14/34

A pinhole camera casts a ray per pixel into a scene. The selected ray shows its sample points, each colored and sized by the network's output there; compositing them gives the pixel on the image plane. The whole rendered column matches the object's shape. Every pixel of every NeRF image is one of these marches.

The exact rule for combining the samples is the volume-rendering integral. The expected color of a ray between a near bound $t_n$ and a far bound $t_f$ is:

C(\mathbf{r}) = \int_{t_n}^{t_f} T(t)\,\sigma(\mathbf{r}(t))\,\mathbf{c}(\mathbf{r}(t),\mathbf{d})\,dt, \qquad T(t) = \exp\!\Big(\!-\!\int_{t_n}^{t}\sigma(\mathbf{r}(s))\,ds\Big)

(1)

At each point along the ray the medium does two things. It adds light, the emission term $\sigma \cdot \mathbf{c}$ , a dense point glowing with its own color. And it blocks light coming from behind, the absorption tracked by $T(t)$ , the transmittance: the fraction of light that survives the trip from the near bound out to $t$ without being absorbed. The final color is every point's emission, each one dimmed by how clear the path in front of it was. Pile dense material near the front of the ray and its transmittance collapses, so everything behind it is in shadow and contributes nothing. That is occlusion, and it falls straight out of the integral.

One word deserves care, because the loose usage is everywhere and it will bite you later. $\sigma$ is the volume density, an extinction coefficient with units of one-over-length. It is a rate: how much light a tiny step absorbs per unit distance, and it ranges from zero up to as large as you like. It is not the opacity. A point is not 90% opaque; a short segment of the ray is, and how opaque depends on both the density and how long the segment is. Hold onto that distinction. Discretizing the integral is exactly where density turns into opacity, and getting that relationship backwards is the single most common NeRF error.

One caveat about the physics. This is an absorption-and-emission model only. Real light scatters: it bounces off one surface and lights another, which is how soft shadows and color bleeding happen. NeRF has none of that. Its points emit a fixed, baked-in radiance and absorb; they never relay light to each other. The view-dependent color is a learned stand-in for the appearance a real scene would have under its capture-time lighting, not a simulation of how light actually travels. This is also why you cannot relight a NeRF: the lighting is welded into the colors, with no separate notion of a light source to move.

Transmittance and compositing

The integral in (1) is exact but uncomputable: you cannot evaluate a continuous function at infinitely many points. So NeRF estimates it the way every renderer does, by sampling. Partition the ray into $N$ equal bins and draw one sample at random inside each:

t_i \sim \mathcal{U}\!\left[\, t_n + \tfrac{i-1}{N}(t_f - t_n),\;\; t_n + \tfrac{i}{N}(t_f - t_n) \,\right]

(2)

The randomness matters. If you sampled the same fixed depths every iteration, the network would only ever be asked about that handful of planes and would learn a scene made of slabs. Re-drawing the samples every pass means that over training the network is queried at a continuous spread of depths, so it learns a genuinely continuous field even though any single render only touches $N$ points.

With samples in hand, the integral becomes a sum. This is where density becomes opacity:

\hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i\, \alpha_i\, \mathbf{c}_i, \qquad \alpha_i = 1 - e^{-\sigma_i \delta_i}, \qquad T_i = \prod_{j=1}^{i-1} (1 - \alpha_j)

(3)

Three quantities build that sum. $\delta_i = t_{i+1} - t_i$ is the gap to the next sample. The opacity of that segment is $\alpha_i = 1 - e^{-\sigma_i \delta_i}$ , which is the density-times-length passed through $1 - e^{-x}$ . That form is not arbitrary. Over a span of constant density $\sigma$ and length $\delta$ , transmittance decays from 1 to $e^{-\sigma\delta}$ , so the fraction the segment absorbs is one minus that, the same exponential decay that governs light through fog or coffee. A naive guess would be $\alpha_i \approx \sigma_i \delta_i$ , and that is only the first term of the exponential, fine when the segment is nearly transparent and wrong when it is not. The accumulated transmittance $T_i = \prod_{j<i}(1-\alpha_j)$ is the light that survives every segment before sample $i$ , the product of all the prior see-through fractions. Note the bound carefully: it stops at $i-1$ . Sample $i$ contributes its own glow $\alpha_i \mathbf{c}_i$ dimmed by the pathto reach it, not including its own blocking. Including itself double-counts, and the off-by-one is a classic bug.

Stacked up, $\hat{C} = \sum_i T_i \alpha_i \mathbf{c}_i$ is ordinary front-to-back alpha compositing, the same "over" operator that stacks transparent layers in any image editor. The render weight on sample $i$ is $w_i = T_i \alpha_i$ , how visible that point is from the camera. The figure makes the mechanism draggable. A teal surface sits in front of an amber one. The teal line is the transmittance falling from 1 as the ray spends its light; the violet area is the weight, where along the ray the pixel's color is actually coming from. Make the front surface opaque and the transmittance crashes at the front, so all the weight piles onto the teal surface and the amber one behind it earns nothing: it is occluded, and the pixel goes teal. Thin the front surface out and the ray sees through, the weight splits across both surfaces, and the pixel blends toward the amber behind.

Figure 3 · transmittance gates what the pixel sees

front density0.82

One ray through a front and a back surface. The teal curve is transmittance

T(t)

, falling from 1; the violet area is the render weight

w = T\alpha

. Drag the front density: opaque, and the weight piles on the front while the back is occluded; see-through, and the weight splits and the pixel blends toward the back.

\Sigma w

is the accumulated opacity, the fraction of the pixel that is object rather than background.

A subtlety the figure also shows: the weights do not always add up to one. $\sum_i w_i = 1 - \prod_i(1-\alpha_i)$ , which equals one only if the ray is eventually fully blocked. A ray that passes through thin material or empty space keeps some transmittance to the end, and that leftover is the background showing through. So treating the weights as a probability distribution over "where the ray stops" is exact only when the ray is guaranteed to hit something. It is a near-perfect intuition with one asterisk, and the asterisk matters once these same weights get turned into a sampling distribution, which has to be normalized first.

The entire render is a handful of array operations, every one differentiable, so a gradient can flow from the pixel error all the way back to the density and color the network produced:

# render one camera ray  r(t) = o + t·d  ->  pixel color
ts    = stratified(t_near, t_far, N=64)   # Eq 2: one jittered sample/bin
pts   = o + ts[:, None] * d               # 3D points along the ray
c, sg = F(gamma_x(pts), gamma_d(d))       # MLP query: color + density
delta = diff(ts)                          # gap to the next sample
alpha = 1 - exp(-sg * delta)              # per-segment opacity (Eq 3)
T     = cumprod(1 - alpha, exclusive)     # transmittance to reach each
C     = sum(T * alpha * c)                # composite front to back

Why a plain MLP blurs

Feed raw coordinates straight into the network and the renderings come out soft and washed-out, missing every sharp edge and fine texture. The cause is a known bias. Plain ReLU networks, trained by gradient descent, fit low-frequency functions first and high-frequency ones reluctantly or not at all (the "spectral bias" that Rahaman and colleagues documented in 2019). A big enough MLP can represent almost any function, a sharp scene included, so the limit is not capacity. Gradient descent just will not find it: it fits the low frequencies first, and a crisp scene needs exactly the high frequencies that come last, if at all.

So NeRF does not feed the network raw coordinates. It feeds a bouquet of sines and cosines at doubling frequencies. NeRF writes $F_\Theta = F'_\Theta \circ \gamma$ , a fixed encoding $\gamma$ followed by the learned MLP $F'_\Theta$ . The encoding of one coordinate $p$ is:

\gamma(p) = \big(\, \sin(2^0\pi p),\, \cos(2^0\pi p),\, \dots,\, \sin(2^{L-1}\pi p),\, \cos(2^{L-1}\pi p) \,\big)

(4)

Each coordinate becomes a set of $2L$ numbers, the same coordinate wrapped at frequencies that double from $2^0\pi$ up to $2^{L-1}\pi$ . The paper uses $L=10$ for position and $L=4$ for direction. Two nearby points that the raw coordinate could barely tell apart now differ sharply in their high-frequency channels, so the network can hang different colors and densities on them and paint a crisp edge. The theory for why this works (it reshapes the network's effective kernel so high frequencies become learnable) came in a follow-up by the same group on Fourier features; the NeRF paper itself offers the encoding as a trick that works and points at spectral bias for intuition.

The figure shows the payoff directly. The amber curve is a target scene signal with detail at several frequencies; the teal curve is the best the network can do with $L$ frequency bands available. At $L=0$ (raw coordinate) it captures nothing above the average and the fit is flat, the oversmoothed NeRF. Each band you add roughly doubles the frequency it can reach, and the teal curve snaps onto more of the detail. Push past the signal's finest frequency and nothing more happens, which is exactly why the paper finds $L=10$ and $L=15$ tie: once the bands out-resolve the photos, extra bands have nothing left to fit.

Figure 4 · the encoding buys back high frequencies

bands LL = 2

A high-frequency target and the MLP's fit. With no frequency bands the fit is flat (a plain network can only manage the low frequencies). Each band of the positional encoding doubles the reachable frequency, so the fit sharpens, then plateaus once the bands out-resolve the signal. Drag

L

and watch the error fall and stop falling.

The encoding is fixed: no extra weights, no gradient, just sines and cosines of the coordinates. That alone separates a soft blob from a sharp edge, which is why the ablation later punishes its removal almost as hard as dropping view-dependence.

A note for anyone who opens the official code expecting equation (4) and finds something different. The released implementation also prepends the raw coordinate to the encoding, so position becomes $3 + 6L = 63$ numbers, not $60$ , and direction becomes $27$ , not $24$ . It also drops the $\pi$ factor, using frequencies $2^k$ rather than $2^k\pi$ . Neither changes the idea, and since the inputs are normalized to $[-1,1]$ the missing $\pi$ only rescales which band is which. They are real differences between the printed formula and the running code, useful to know if you are matching one against the other.

Don't sample empty space

Marching 64 evenly spaced samples down every ray is wasteful, and the integral itself tells you why. Most of a ray is empty space or sits behind an opaque surface, where the render weight $w_i = T_i\alpha_i$ is essentially zero. Those samples cost a full network query and contribute nothing to the pixel. The samples that matter cluster in the thin shell where the ray meets a surface, and you do not know where that is until you have looked.

So look cheaply first, then look carefully where it counts. NeRF trains two networks together, a "coarse" one and a "fine" one. The coarse network renders the ray with 64 stratified samples and produces a set of weights. Normalize those weights, $\hat{w}_i = w_i / \sum_j w_j$ , and they become a probability distribution along the ray, peaked wherever the coarse pass found something. (This is the normalization the earlier asterisk demanded: the raw weights sum to less than one, so you have to divide by their total before they are a valid distribution.) Draw 128 new samples from that distribution by inverse-transform sampling: build the running total of the weights along the ray (their cumulative sum), throw uniform numbers into the interval from 0 to 1, and read off where each one crosses that running total, which lands more samples where the weights climb fastest. Then run the fine network on all 192 samples together, the original 64 plus the new 128, and render the final color from those.

Drag the surface in the figure. The coarse samples are spread evenly and light up only near the surface; the weight distribution they produce is the violet curve; the fine samples drawn from it crowd the surface and ignore the empty space on either side. Move the surface and all three track it, because the fine samples are placed by what the coarse pass just found, not by any fixed schedule.

Figure 5 · coarse finds the surface, fine spends its samples there

surface deptht = 0.62

Top: 64 coarse samples, evenly spread, bright only where they carry weight. Middle: the normalized weight distribution along the ray. Bottom: 128 fine samples drawn from it, clustered at the surface. Drag the surface and the fine cluster follows the coarse weights, not a fixed grid.

Two clarifications matter here. This is not Monte-Carlo importance sampling, even though it looks like it. NeRF does not divide the integrand by the sampling density or treat samples as independent estimates of the integral. It just uses the drawn positions as better-placed evaluation points for the same deterministic compositing sum, a non-uniform grid concentrated where the action is. And the coarse network is not a throwaway. Its rendering is kept in the training loss alongside the fine one, precisely so its weights learn to point the fine sampler at the right place. Both networks improve together. Why a separate fine network at all, instead of just re-running the coarse one with more samples? Division of labor: the fine net can pour its capacity into the thin surface shell the coarse pass located, while the coarse net stays a cheap, blurry scout.

Geometry from nothing but photos

Everything so far is a differentiable function from network weights to rendered pixels. Fitting it is then the most ordinary thing in deep learning: render the pixels you have ground truth for, measure the error, and descend. The loss is the squared difference between rendered and true color, summed over a batch of rays, for both the coarse and the fine render:

\mathcal{L} = \sum_{\mathbf{r} \in \mathcal{R}} \Big[\, \big\lVert \hat{C}_c(\mathbf{r}) - C(\mathbf{r}) \big\rVert_2^2 + \big\lVert \hat{C}_f(\mathbf{r}) - C(\mathbf{r}) \big\rVert_2^2 \,\Big]

(6)

This loss contains no 3D supervision, no ground-truth depth, no silhouette, no hint about geometry at all. The only signal is "this pixel came out the wrong color." And yet a coherent 3D shape emerges, because the same density field has to explain every photo at once. The one arrangement of densities that renders correctly from all the cameras simultaneously is the true surface. Color consistency across views, and nothing else, sculpts the geometry. It is like carving a hidden object by checking only how it looks from every camera and adjusting until all the views agree.

The training details are routine. Each step samples 4096 random rays from across all the input images, runs the coarse-then-fine render, and steps Adam (with a learning rate decaying from $5\times10^{-4}$ to $5\times10^{-5}$ ). A scene converges in 100,000 to 300,000 iterations, one to two days on a single V100.

# fit ONE scene: gradient descent on photometric error (Eq 6)
for step in range(200_000):               # ~1-2 days on one V100
    rays, px = sample_rays(images, poses) # 4096 random rays / batch
    C_coarse = render(rays, coarse, N=64) # stratified pass
    pdf      = weights(C_coarse)          # where the mass landed
    C_fine   = render(rays, fine, 64+128) # resample, union of samples
    loss = mse(C_coarse, px) + mse(C_fine, px)
    loss.backward(); adam.step()          # only the photos supervise

Two pieces of real-world machinery make this work on actual photographs. Real images do not come with camera poses, so NeRF recovers them first with COLMAP, an off-the-shelf structure-from-motion package that estimates where each photo was taken. And real scenes, unlike a centered object, can stretch to the horizon, which breaks the near-and-far bounds the integral needs. For forward-facing captures NeRF reparameterizes depth into normalized device coordinates, the projective space a graphics pipeline uses, where distance is measured as disparity, meaning one over the depth. Inverse depth runs from a finite value near the camera down to zero at infinity, so the unbounded ray becomes a finite interval with the horizon parked at the far end. That reparameterization folds an unbounded scene into a finite box, so a mountain at infinity lands at a well-defined far plane.

To make the worked example concrete: take a synthetic object scaled into a cube two units wide, so a ray runs from $t_n=2$ to $t_f=6$ . With 64 stratified samples the gap is about $\delta \approx 0.06$ . A sample landing on a solid surface might report density $\sigma = 20$ , giving opacity $\alpha = 1 - e^{-20 \cdot 0.06} \approx 0.70$ , while a sample in empty space reports $\sigma \approx 0$ and contributes nothing. Stack a few opaque samples and the transmittance falls to near zero, so every sample behind them is invisible to the camera, exactly the occlusion the figure showed.

What it buys, and what it can't

On the paper's hardest benchmark, eight pathtraced objects with complicated geometry and shiny materials, NeRF reaches 31.01 dB PSNR, nearly five decibels above the next-best baseline (Neural Volumes at 26.05), with Local Light Field Fusion at 24.88 and the scene-network baseline SRN at 22.26. PSNR is a log scale for pixel error, where each extra 3 dB roughly halves the mean squared error, so a five-decibel gap is closer to a threefold cut in error than a narrow edge. If you have never stared at the images the number sounds modest; it is the difference between a clean specular highlight and a blurred gray smear. On simpler synthetic objects NeRF hits 40.15 dB. It does this while storing the entire scene in roughly 5 MB of weights, about a 3000-fold compression next to the 15+ GB that Local Light Field Fusion stores per scene.

The honest scope matters as much as the headline. NeRF wins on every metric except one: on real forward-facing scenes, LLFF edges it on the LPIPS perceptual score (0.212 against NeRF's 0.250), even as NeRF leads on PSNR and multiview consistency. It is the kind of exception worth stating plainly rather than rounding up to a clean sweep.

The ablation is where the design choices prove their worth, and it lines up with the figures above. Removing the view-dependent color costs the most, dropping PSNR from 31.0 to 27.7. Removing the positional encoding costs nearly as much, down to 28.8. The hierarchical sampler matters least, 30.1 without it, though it still earns its place. Hover the bars to read each configuration's full scores.

Figure 6 · every piece pulls its weight

NeRF's ablation on the realistic synthetic scenes (PSNR, higher is better). Removing the view direction or the positional encoding costs the most; the hierarchical sampler costs less but still helps. The minimal model sits at the bottom. Hover a bar for its SSIM and LPIPS.

What you cannot do is as instructive as what you can. A NeRF is slow to render: about 256 network queries per ray (the 64 coarse plus the 192 fine), 150 to 200 million per image, roughly 30 seconds a frame. It is slow to fit: a fresh day or two per scene, with no reuse across scenes. Its lighting is baked in, so you cannot relight it or pull out material properties. Its geometry is implicit, so there is no mesh to hand a downstream tool without a lossy extra step. Every one of these became a research program: faster training, real-time rendering, relightable and editable variants, generalization across scenes. A hundred follow-ups exist because the core idea, a scene as a small queryable function rendered by classical volume integration, was the right thing to build on. The first version was slow and narrow and still reset the field.

Provenance Verified against primary literature

Max (1995)The absorption-plus-emission volume-rendering quadrature NeRF discretizes (Eq 3).

Kajiya & Von Herzen (1984)Brought volume rendering / the equation of transfer into graphics.

Porter & Duff (1984)The front-to-back "over" compositing the discrete sum reduces to.

Rahaman et al. (2019)Spectral bias: networks fit low frequencies first.

Tancik et al. (2020)Fourier-features / NTK theory behind the encoding (a follow-up, not in NeRF).

COLMAP (2016)Structure-from-motion poses and intrinsics for real captures.

correctionThe printed positional-encoding formula (Eq 4) is not what the released code runs: the code prepends the raw coordinate (so the encoded position is 3+6L = 63 numbers, not 60) and uses frequencies 2^k, not 2^k·π. We teach both, and the conventions they change.

Questions you might still have

Can I reuse a trained NeRF on a different scene?
No. A NeRF is overfit to one scene from scratch, one to two days of optimization, and learns nothing transferable. The weights are that scene, the way a compressed file is one song. A new object means a new optimization. (Later work like pixelNeRF and instant-NGP attacked exactly this.)

Can I extract a mesh or relight it?
Not cleanly. The geometry is implicit in the density field, so a mesh needs a lossy extra step (run marching cubes on the densities). And the lighting is baked into the emitted radiance with no separate light source, so a vanilla NeRF cannot be relit. View-dependent color fakes appearance under capture-time lighting; it does not simulate light transport.

If an MLP is a universal approximator, why does the positional encoding help?
It is not about capacity, it is about optimization. Plain networks have a spectral bias: gradient descent fits low frequencies first and high frequencies barely at all. The sin/cos encoding reshapes the training so high frequencies become learnable. It adds input dimensions, not representational power.

Why two networks, coarse and fine?
The coarse pass is a cheap first pass: 64 even samples to find roughly where the surface is. The fine pass then spends its 128 samples there instead of on empty space. Both are trained together, since the coarse network has to learn to point the fine sampler at the right place.

Is σ the opacity of a point?
No. σ is a density, a rate with units of one-over-length. The opacity of a ray segment of length δ is α = 1 − exp(−σδ), which depends on both the density and the segment length, and getting that relationship backwards is the most common NeRF error. Writing α ≈ σδ is only the first-order approximation, accurate when the segment is nearly transparent.

Footnotes & further reading

The paper: Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (ECCV 2020). Project page · code.
The volume-rendering integral and its quadrature: Max, Optical Models for Direct Volume Rendering (1995); Kajiya & Von Herzen, Ray Tracing Volume Densities (SIGGRAPH 1984), which introduced the equation of transfer (with scattering) to graphics. NeRF uses only its no-scattering subset.
Spectral bias and Fourier features: Rahaman et al., On the Spectral Bias of Neural Networks (ICML 2019); Tancik et al., Fourier Features Let Networks Learn High Frequency Functions (NeurIPS 2020), the same group's theoretical follow-up. The positional-encoding form echoes the one in the Transformer, used there for a different purpose (ordering tokens, not lifting continuous coordinates).
Camera poses for real photos come from COLMAP: Schönberger & Frahm, Structure-from-Motion Revisited (CVPR 2016). The optimizer is Adam.
The baselines NeRF compares against: Scene Representation Networks (Sitzmann et al. 2019), Neural Volumes (Lombardi et al. 2019), and Local Light Field Fusion (Mildenhall et al. 2019). The last is the only one that generalizes across scenes; the rest, like NeRF, fit one scene at a time.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.