NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Fit a small network to a scene's photos, and render it from any new camera angle.
NeRF stores a 3D scene as a continuous function living inside one small network. To render a new view it marches a camera ray through the scene, asks the network for the color and density at points along the way, and adds them up. No mesh, no voxel grid, just the weights.
Explaining the paperNeRF: Representing Scenes as Neural Radiance Fields for View SynthesisYou have a few dozen photos of an object, shot from different angles. You want a video that flies the camera along a smooth path none of those photos ever took.
This is the view-synthesis problem, and it is older than deep learning. The classic recipe builds an explicit model of the object first, a triangle mesh or a grid of colored voxels, then renders the model from the new camera. Both run into the same wall. A mesh needs the right topology before you can optimize it, and real scenes rarely hand you one. A voxel grid that is detailed enough to look sharp costs memory that grows with the cube of the resolution, so high quality means gigabytes, and rendering a crisp image means sampling that grid ever more finely.
NeRF, from a Berkeley and Google team in 2020, throws the explicit model away. It represents the scene as a function: give it a point in space and a direction you are looking from, and it returns the color and the density there. That function is a small neural network, and the method is a way to fit the network so that images rendered from it match the photos you took. Once it is fit, you render a brand-new view by querying the function along every camera ray and compositing the results.
A handful of ideas carry the whole paper: a scene written as a queryable function, the split that makes geometry consistent across views, the classical volume-rendering integral that turns the function into a pixel, a coordinate trick that rescues the network from blurriness, and a sampling scheme that stops it wasting work on empty air. None is heavy on its own. Together they set a new state of the art for rendering photorealistic new views of real, complicated scenes from ordinary photos.
A scene is a function you query
Everything else rests on one idea. A NeRF is one continuous function, written , with learnable weights . It takes a 3D location and a viewing direction , and it returns an emitted color and a scalar volume density :
The input is called 5D because a position has three numbers and a direction on the sphere has two (the angles ). In practice the direction is fed as a 3-component unit vector, so the network actually reads six numbers, not five. The function is continuous: there is no grid, no list of stored points. Every location in space has a color and a density, defined by whatever the weights compute when you query that point.
The function is a multilayer perceptron, an MLP, which is the plainest kind of neural network: a stack of fully-connected layers with no convolutions and no attention. It is small. The entire scene fits in about 5 MB of weights, less than the input photos themselves.
One point trips up every newcomer, and stating it flatly now keeps everything later from resting on a wrong picture. A NeRF is not a model trained on a big dataset of scenes that then generalizes to a new one. It is fit from scratch to a single scene, by overfitting that one scene's photos, and it learns nothing transferable. The weights are the scene, the way a compressed file is one specific song. Point NeRF at a new object and you start a fresh optimization, one to two days of it on a high-end GPU. What you get out is not a mesh you can drop into a game engine and not a point cloud you can measure. The geometry only exists implicitly, as the pattern of densities the network will report if you query it. NeRF is a renderer for one scene, not a 3D scanner.
Density from where, color from how you look
The two outputs are not treated symmetrically, and the asymmetry is doing real work. Density is predicted from the position alone. Color is predicted from the position and the viewing direction . The network is wired so that is computed and emitted before the direction is even fed in.
Why split it this way, rather than letting both depend on everything? Because density is geometry, and the geometry of a real object does not change when you walk around it. A corner of a table is in the same place from every angle. If you let density depend on viewing direction, the network could cheat: it could invent a different shape for each photo, fit them all perfectly, and reconstruct nonsense that happens to look right from exactly the training cameras and falls apart everywhere in between. Forcing to depend on position only removes that freedom. The single density field has to explain all the photos at once, which is what pins it to the true geometry. Without that constraint, nothing would force the geometry to agree across views at all.
Color earns the opposite treatment because appearance genuinely does change with viewpoint. A glossy surface throws a bright specular highlight that sits in different places as you move; wet paint, varnished wood, and metal all look different head-on than at a glancing angle. The output here is radiance, the light a point sends out in a particular direction, which is why letting depend on lets NeRF paint a moving highlight that a position-only color could only smear into a dull average.
Orbit the viewing angle in the figure and watch the two outputs behave differently. The density bar never moves. The color swatch flares toward white as your view lines up with the light's reflection, then settles back to the surface's dull amber as you swing away. That flare is the specular highlight, and it is exactly the effect a model without view-dependence cannot reproduce.
The network that pulls this off is unglamorous. The encoded position runs through eight fully-connected layers of 256 units each, with one skip connection that re-injects the input partway through (a habit borrowed from DeepSDF, an earlier coordinate-to-shape network, which keeps the early coordinate information from washing out). That trunk emits the density plus a 256-number feature vector. Only then is the viewing direction joined on, run through one more 128-unit layer, and turned into the RGB color. Direction gets just that one shallow layer because view-dependence is a small adjustment to an already-computed appearance, a moving highlight rather than a whole new object, so it does not need the depth that geometry does. The wiring enforces the constraint directly: direction cannot touch density because density has already left the building.
From a field to a pixel
A function that returns color and density at points in space is not yet a picture. To make a pixel you have to decide what a camera ray traveling through that field actually sees. NeRF answers with physics that computer graphics has used since the 1980s, made differentiable, so the same render that turns densities and colors into a pixel can be run backward to ask which densities and colors would have made that pixel match the photo.
Each pixel defines a ray leaving the camera center in a direction set by the pixel's position and the camera's calibration:
March along this ray, sampling points, query the network at each, and combine what comes back into one color. Do that for every pixel and you have rendered the image. In the figure below, the camera on the left casts one ray per pixel into a little scene holding a teal blob and an amber blob. Drag the slider to pick a pixel: its ray lights up with sample points, each dot tinted by the color the network returns there and sized by the density, and compositing them paints the pixel on the image plane. Sweep the ray onto a blob and the pixel takes the blob's color; sweep it into the gap and the ray sails through to the dark background. The rendered column on the plane traces out the object's silhouette, one ray at a time.
The exact rule for combining the samples is the volume-rendering integral. The expected color of a ray between a near bound and a far bound is:
At each point along the ray the medium does two things. It adds light, the emission term , a dense point glowing with its own color. And it blocks light coming from behind, the absorption tracked by , the transmittance: the fraction of light that survives the trip from the near bound out to without being absorbed. The final color is every point's emission, each one dimmed by how clear the path in front of it was. Pile dense material near the front of the ray and its transmittance collapses, so everything behind it is in shadow and contributes nothing. That is occlusion, and it falls straight out of the integral.
One word deserves care, because the loose usage is everywhere and it will bite you later. is the volume density, an extinction coefficient with units of one-over-length. It is a rate: how much light a tiny step absorbs per unit distance, and it ranges from zero up to as large as you like. It is not the opacity. A point is not 90% opaque; a short segment of the ray is, and how opaque depends on both the density and how long the segment is. Hold onto that distinction. Discretizing the integral is exactly where density turns into opacity, and getting that relationship backwards is the single most common NeRF error.
One caveat about the physics. This is an absorption-and-emission model only. Real light scatters: it bounces off one surface and lights another, which is how soft shadows and color bleeding happen. NeRF has none of that. Its points emit a fixed, baked-in radiance and absorb; they never relay light to each other. The view-dependent color is a learned stand-in for the appearance a real scene would have under its capture-time lighting, not a simulation of how light actually travels. This is also why you cannot relight a NeRF: the lighting is welded into the colors, with no separate notion of a light source to move.
Transmittance and compositing
The integral in (1) is exact but uncomputable: you cannot evaluate a continuous function at infinitely many points. So NeRF estimates it the way every renderer does, by sampling. Partition the ray into equal bins and draw one sample at random inside each:
The randomness matters. If you sampled the same fixed depths every iteration, the network would only ever be asked about that handful of planes and would learn a scene made of slabs. Re-drawing the samples every pass means that over training the network is queried at a continuous spread of depths, so it learns a genuinely continuous field even though any single render only touches points.
With samples in hand, the integral becomes a sum. This is where density becomes opacity:
Three quantities build that sum. is the gap to the next sample. The opacity of that segment is , which is the density-times-length passed through . That form is not arbitrary. Over a span of constant density and length , transmittance decays from 1 to , so the fraction the segment absorbs is one minus that, the same exponential decay that governs light through fog or coffee. A naive guess would be , and that is only the first term of the exponential, fine when the segment is nearly transparent and wrong when it is not. The accumulated transmittance is the light that survives every segment before sample , the product of all the prior see-through fractions. Note the bound carefully: it stops at . Sample contributes its own glow dimmed by the pathto reach it, not including its own blocking. Including itself double-counts, and the off-by-one is a classic bug.
Stacked up, is ordinary front-to-back alpha compositing, the same "over" operator that stacks transparent layers in any image editor. The render weight on sample is , how visible that point is from the camera. The figure makes the mechanism draggable. A teal surface sits in front of an amber one. The teal line is the transmittance falling from 1 as the ray spends its light; the violet area is the weight, where along the ray the pixel's color is actually coming from. Make the front surface opaque and the transmittance crashes at the front, so all the weight piles onto the teal surface and the amber one behind it earns nothing: it is occluded, and the pixel goes teal. Thin the front surface out and the ray sees through, the weight splits across both surfaces, and the pixel blends toward the amber behind.
A subtlety the figure also shows: the weights do not always add up to one. , which equals one only if the ray is eventually fully blocked. A ray that passes through thin material or empty space keeps some transmittance to the end, and that leftover is the background showing through. So treating the weights as a probability distribution over "where the ray stops" is exact only when the ray is guaranteed to hit something. It is a near-perfect intuition with one asterisk, and the asterisk matters once these same weights get turned into a sampling distribution, which has to be normalized first.
The entire render is a handful of array operations, every one differentiable, so a gradient can flow from the pixel error all the way back to the density and color the network produced:
# render one camera ray r(t) = o + t·d -> pixel color
ts = stratified(t_near, t_far, N=64) # Eq 2: one jittered sample/bin
pts = o + ts[:, None] * d # 3D points along the ray
c, sg = F(gamma_x(pts), gamma_d(d)) # MLP query: color + density
delta = diff(ts) # gap to the next sample
alpha = 1 - exp(-sg * delta) # per-segment opacity (Eq 3)
T = cumprod(1 - alpha, exclusive) # transmittance to reach each
C = sum(T * alpha * c) # composite front to backWhy a plain MLP blurs
Feed raw coordinates straight into the network and the renderings come out soft and washed-out, missing every sharp edge and fine texture. The cause is a known bias. Plain ReLU networks, trained by gradient descent, fit low-frequency functions first and high-frequency ones reluctantly or not at all (the "spectral bias" that Rahaman and colleagues documented in 2019). A big enough MLP can represent almost any function, a sharp scene included, so the limit is not capacity. Gradient descent just will not find it: it fits the low frequencies first, and a crisp scene needs exactly the high frequencies that come last, if at all.
So NeRF does not feed the network raw coordinates. It feeds a bouquet of sines and cosines at doubling frequencies. NeRF writes , a fixed encoding followed by the learned MLP . The encoding of one coordinate is:
Each coordinate becomes a set of numbers, the same coordinate wrapped at frequencies that double from up to . The paper uses for position and for direction. Two nearby points that the raw coordinate could barely tell apart now differ sharply in their high-frequency channels, so the network can hang different colors and densities on them and paint a crisp edge. The theory for why this works (it reshapes the network's effective kernel so high frequencies become learnable) came in a follow-up by the same group on Fourier features; the NeRF paper itself offers the encoding as a trick that works and points at spectral bias for intuition.
The figure shows the payoff directly. The amber curve is a target scene signal with detail at several frequencies; the teal curve is the best the network can do with frequency bands available. At (raw coordinate) it captures nothing above the average and the fit is flat, the oversmoothed NeRF. Each band you add roughly doubles the frequency it can reach, and the teal curve snaps onto more of the detail. Push past the signal's finest frequency and nothing more happens, which is exactly why the paper finds and tie: once the bands out-resolve the photos, extra bands have nothing left to fit.
The encoding is fixed: no extra weights, no gradient, just sines and cosines of the coordinates. That alone separates a soft blob from a sharp edge, which is why the ablation later punishes its removal almost as hard as dropping view-dependence.
A note for anyone who opens the official code expecting equation (4) and finds something different. The released implementation also prepends the raw coordinate to the encoding, so position becomes numbers, not , and direction becomes , not . It also drops the factor, using frequencies rather than . Neither changes the idea, and since the inputs are normalized to the missing only rescales which band is which. They are real differences between the printed formula and the running code, useful to know if you are matching one against the other.
Don't sample empty space
Marching 64 evenly spaced samples down every ray is wasteful, and the integral itself tells you why. Most of a ray is empty space or sits behind an opaque surface, where the render weight is essentially zero. Those samples cost a full network query and contribute nothing to the pixel. The samples that matter cluster in the thin shell where the ray meets a surface, and you do not know where that is until you have looked.
So look cheaply first, then look carefully where it counts. NeRF trains two networks together, a "coarse" one and a "fine" one. The coarse network renders the ray with 64 stratified samples and produces a set of weights. Normalize those weights, , and they become a probability distribution along the ray, peaked wherever the coarse pass found something. (This is the normalization the earlier asterisk demanded: the raw weights sum to less than one, so you have to divide by their total before they are a valid distribution.) Draw 128 new samples from that distribution by inverse-transform sampling: build the running total of the weights along the ray (their cumulative sum), throw uniform numbers into the interval from 0 to 1, and read off where each one crosses that running total, which lands more samples where the weights climb fastest. Then run the fine network on all 192 samples together, the original 64 plus the new 128, and render the final color from those.
Drag the surface in the figure. The coarse samples are spread evenly and light up only near the surface; the weight distribution they produce is the violet curve; the fine samples drawn from it crowd the surface and ignore the empty space on either side. Move the surface and all three track it, because the fine samples are placed by what the coarse pass just found, not by any fixed schedule.
Two clarifications matter here. This is not Monte-Carlo importance sampling, even though it looks like it. NeRF does not divide the integrand by the sampling density or treat samples as independent estimates of the integral. It just uses the drawn positions as better-placed evaluation points for the same deterministic compositing sum, a non-uniform grid concentrated where the action is. And the coarse network is not a throwaway. Its rendering is kept in the training loss alongside the fine one, precisely so its weights learn to point the fine sampler at the right place. Both networks improve together. Why a separate fine network at all, instead of just re-running the coarse one with more samples? Division of labor: the fine net can pour its capacity into the thin surface shell the coarse pass located, while the coarse net stays a cheap, blurry scout.
Geometry from nothing but photos
Everything so far is a differentiable function from network weights to rendered pixels. Fitting it is then the most ordinary thing in deep learning: render the pixels you have ground truth for, measure the error, and descend. The loss is the squared difference between rendered and true color, summed over a batch of rays, for both the coarse and the fine render:
This loss contains no 3D supervision, no ground-truth depth, no silhouette, no hint about geometry at all. The only signal is "this pixel came out the wrong color." And yet a coherent 3D shape emerges, because the same density field has to explain every photo at once. The one arrangement of densities that renders correctly from all the cameras simultaneously is the true surface. Color consistency across views, and nothing else, sculpts the geometry. It is like carving a hidden object by checking only how it looks from every camera and adjusting until all the views agree.
The training details are routine. Each step samples 4096 random rays from across all the input images, runs the coarse-then-fine render, and steps Adam (with a learning rate decaying from to ). A scene converges in 100,000 to 300,000 iterations, one to two days on a single V100.
# fit ONE scene: gradient descent on photometric error (Eq 6)
for step in range(200_000): # ~1-2 days on one V100
rays, px = sample_rays(images, poses) # 4096 random rays / batch
C_coarse = render(rays, coarse, N=64) # stratified pass
pdf = weights(C_coarse) # where the mass landed
C_fine = render(rays, fine, 64+128) # resample, union of samples
loss = mse(C_coarse, px) + mse(C_fine, px)
loss.backward(); adam.step() # only the photos superviseTwo pieces of real-world machinery make this work on actual photographs. Real images do not come with camera poses, so NeRF recovers them first with COLMAP, an off-the-shelf structure-from-motion package that estimates where each photo was taken. And real scenes, unlike a centered object, can stretch to the horizon, which breaks the near-and-far bounds the integral needs. For forward-facing captures NeRF reparameterizes depth into normalized device coordinates, the projective space a graphics pipeline uses, where distance is measured as disparity, meaning one over the depth. Inverse depth runs from a finite value near the camera down to zero at infinity, so the unbounded ray becomes a finite interval with the horizon parked at the far end. That reparameterization folds an unbounded scene into a finite box, so a mountain at infinity lands at a well-defined far plane.
To make the worked example concrete: take a synthetic object scaled into a cube two units wide, so a ray runs from to . With 64 stratified samples the gap is about . A sample landing on a solid surface might report density , giving opacity , while a sample in empty space reports and contributes nothing. Stack a few opaque samples and the transmittance falls to near zero, so every sample behind them is invisible to the camera, exactly the occlusion the figure showed.
What it buys, and what it can't
On the paper's hardest benchmark, eight pathtraced objects with complicated geometry and shiny materials, NeRF reaches 31.01 dB PSNR, nearly five decibels above the next-best baseline (Neural Volumes at 26.05), with Local Light Field Fusion at 24.88 and the scene-network baseline SRN at 22.26. PSNR is a log scale for pixel error, where each extra 3 dB roughly halves the mean squared error, so a five-decibel gap is closer to a threefold cut in error than a narrow edge. If you have never stared at the images the number sounds modest; it is the difference between a clean specular highlight and a blurred gray smear. On simpler synthetic objects NeRF hits 40.15 dB. It does this while storing the entire scene in roughly 5 MB of weights, about a 3000-fold compression next to the 15+ GB that Local Light Field Fusion stores per scene.
The honest scope matters as much as the headline. NeRF wins on every metric except one: on real forward-facing scenes, LLFF edges it on the LPIPS perceptual score (0.212 against NeRF's 0.250), even as NeRF leads on PSNR and multiview consistency. It is the kind of exception worth stating plainly rather than rounding up to a clean sweep.
The ablation is where the design choices prove their worth, and it lines up with the figures above. Removing the view-dependent color costs the most, dropping PSNR from 31.0 to 27.7. Removing the positional encoding costs nearly as much, down to 28.8. The hierarchical sampler matters least, 30.1 without it, though it still earns its place. Hover the bars to read each configuration's full scores.
What you cannot do is as instructive as what you can. A NeRF is slow to render: about 256 network queries per ray (the 64 coarse plus the 192 fine), 150 to 200 million per image, roughly 30 seconds a frame. It is slow to fit: a fresh day or two per scene, with no reuse across scenes. Its lighting is baked in, so you cannot relight it or pull out material properties. Its geometry is implicit, so there is no mesh to hand a downstream tool without a lossy extra step. Every one of these became a research program: faster training, real-time rendering, relightable and editable variants, generalization across scenes. A hundred follow-ups exist because the core idea, a scene as a small queryable function rendered by classical volume integration, was the right thing to build on. The first version was slow and narrow and still reset the field.
Questions you might still have
Can I reuse a trained NeRF on a different scene?
No. A NeRF is overfit to one scene from scratch, one to two days of optimization, and learns nothing transferable. The weights are that scene, the way a compressed file is one song. A new object means a new optimization. (Later work like pixelNeRF and instant-NGP attacked exactly this.)
Can I extract a mesh or relight it?
Not cleanly. The geometry is implicit in the density field, so a mesh needs a lossy extra step (run marching cubes on the densities). And the lighting is baked into the emitted radiance with no separate light source, so a vanilla NeRF cannot be relit. View-dependent color fakes appearance under capture-time lighting; it does not simulate light transport.
If an MLP is a universal approximator, why does the positional encoding help?
It is not about capacity, it is about optimization. Plain networks have a spectral bias: gradient descent fits low frequencies first and high frequencies barely at all. The sin/cos encoding reshapes the training so high frequencies become learnable. It adds input dimensions, not representational power.
Why two networks, coarse and fine?
The coarse pass is a cheap first pass: 64 even samples to find roughly where the surface is. The fine pass then spends its 128 samples there instead of on empty space. Both are trained together, since the coarse network has to learn to point the fine sampler at the right place.
Is σ the opacity of a point?
No. σ is a density, a rate with units of one-over-length. The opacity of a ray segment of length δ is α = 1 − exp(−σδ), which depends on both the density and the segment length, and getting that relationship backwards is the most common NeRF error. Writing α ≈ σδ is only the first-order approximation, accurate when the segment is nearly transparent.
Footnotes & further reading
- The paper: Mildenhall, Srinivasan, Tancik, Barron, Ramamoorthi, Ng, NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (ECCV 2020). Project page · code.
- The volume-rendering integral and its quadrature: Max, Optical Models for Direct Volume Rendering (1995); Kajiya & Von Herzen, Ray Tracing Volume Densities (SIGGRAPH 1984), which introduced the equation of transfer (with scattering) to graphics. NeRF uses only its no-scattering subset.
- Spectral bias and Fourier features: Rahaman et al., On the Spectral Bias of Neural Networks (ICML 2019); Tancik et al., Fourier Features Let Networks Learn High Frequency Functions (NeurIPS 2020), the same group's theoretical follow-up. The positional-encoding form echoes the one in the Transformer, used there for a different purpose (ordering tokens, not lifting continuous coordinates).
- Camera poses for real photos come from COLMAP: Schönberger & Frahm, Structure-from-Motion Revisited (CVPR 2016). The optimizer is Adam.
- The baselines NeRF compares against: Scene Representation Networks (Sitzmann et al. 2019), Neural Volumes (Lombardi et al. 2019), and Local Light Field Fusion (Mildenhall et al. 2019). The last is the only one that generalizes across scenes; the rest, like NeRF, fit one scene at a time.
How could this explainer be improved? Found an error, or something unclear? I read every message.