VerifiedarXiv:2512.0892424 min
Vision · 4D Reconstruction

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Rebuild a moving scene by asking it one point at a time.

Most 4D reconstruction pipelines bolt together a depth model, a pose model, and a motion model, then optimize to make them agree. D4RT replaces all of it with a single question, asked over and over: point at a pixel, name two moments, and the model returns that point's 3D location. Depth, point clouds, tracks, and cameras are just patterns of that one query.

Explaining the paperEfficiently Reconstructing Dynamic Scenes One D4RT at a TimeZhang, Le Moing, Koppula, Rocco, … Sajjadi · Google DeepMind · Oxford · UCL · arXiv:2512.08924

What if you could ask a video, for any pixel and any two moments, where that point is in 3D, and get an answer without ever decoding the whole frame?

A video of a swan gliding across a pond is, underneath, a 3D scene changing in time. Geometry plus motion. People call that 4D: three dimensions of space and one of time. Recovering it from the flat pixels of a single moving camera is one of the oldest hard problems in computer vision. You have to work out how far away every surface is, where each point travels, and where the camera was standing. The world keeps moving, and a video records only its projection onto a sensor.

The standard way to attack this is to chop it into pieces and bolt the pieces together. One model estimates depth, another estimates camera pose, another segments what is moving, and then an expensive optimization runs at test time to force the pieces into agreement. MegaSaM works roughly this way. Newer feedforward models like VGGT drop the optimization but still hang a separate decoder head on the backbone for each output, and, more tellingly, none of these methods can say where a moving point goes. They reconstruct the static furniture of a room and lose the swan.

D4RT, out of Google DeepMind with Oxford and UCL, throws out the per-frame, per-task decoding. The name unpacks to Dynamic 4D Reconstruction and Tracking. Instead of producing a dense output for every frame, it answers one small question at a time. Point at a pixel in one frame, name a moment for its position and a frame for the coordinate system, and the model returns a single 3D point. Everything else (depth maps, point clouds, point tracks, camera pose) turns out to be just a pattern of those questions.

A video is compressed once into a scene representation. A query names a point in space and time. A small decoder reads the answer off the representation. And because the answers are independent of each other, you can ask billions of them to reconstruct everything, or just the few you actually need. None of the pieces is hard on its own.

Look once, remember everything

The architecture has two stages, and the split is the whole trick. It is borrowed from the Scene Representation Transformer (SRT), which first showed that you can compress a set of images into one latent and then decode novel views from it by cross-attention. D4RT carries that shape over to video and geometry.

Stage one is the encoder E\mathcal{E}. It reads the entire video VV at once, every frame together, and produces a single bundle of feature vectors called the Global Scene Representation FF:

F=E(V)RN×C,VRT×H×W×3F = \mathcal{E}(V) \in \mathbb{R}^{N \times C}, \qquad V \in \mathbb{R}^{T \times H \times W \times 3}
(1)

Read the shapes. The input is TT frames of an H×WH \times W image with 3 color channels. The output is NN tokens, each a CC-dimensional vector. The encoder is a Vision Transformer with the alternating attention that VGGT introduced: layers that let tokens attend within a single frame interleave with layers that let them attend across all frames, so the representation knows both what each frame looks like and how the frames relate over time. (The biggest model uses a ViT-g encoder with 40 layers, about a billion parameters.)

The second word of the name is doing the work. FF is global. It captures the whole environment: dense correspondence across every frame, and how time changes the scene, held in one place. Once computed, FF is fixed. The expensive part of looking at the video happens exactly once. Everything after that just reads from this memory.

A model that decodes a dense map for every frame pays the heavy cost again and again. D4RT pays it once, builds FF, and then answers questions against it. Drag the number of questions and watch what the split buys:

Figure 1 · encode once, query many
Q = 8 · avg 13.5
The split, costed. Re-encoding the video for every question pays the heavy encoder price Q times; D4RT pays it once, and every question after that is one cheap decoder pass against the frozen F, so the average cost per query collapses toward the cost of a single decode. Units are relative, the encoder is drawn at 100× a decode for illustration; the real split is a billion-parameter encoder against the eight-layer, 144M-parameter decoder.

A query names a point in space and time

Stage two is a small decoder D\mathcal{D}, a lightweight cross-attention transformer (eight layers, 144M parameters, against the encoder's billion). You hand it a query, it cross-attends into the frozen FF, and it returns one 3D point. The query has five fields:

q=(u,  v,  tsrc,  ttgt,  tcam),P=D(q,F)R3\mathbf{q} = (\,u,\; v,\; t_{\text{src}},\; t_{\text{tgt}},\; t_{\text{cam}}\,), \qquad \mathbf{P} = \mathcal{D}(\mathbf{q}, F) \in \mathbb{R}^3
(2)

Three of the fields describe the source: a normalized 2D pixel (u,v)[0,1]2(u,v) \in [0,1]^2 in a source frame tsrct_{\text{src}}. That is the point you are asking about, pinned by where it appears in one frame. The other two fields are time indices in [1,,T][1,\dots,T]:

These two indices need not coincide. You can ask for the position of a point at frame 30, expressed in the camera coordinates of frame 1. Where a point is, and which frame you measure it from, are separate knobs. A method that ties them together cannot place a moving object into a single shared coordinate frame, which is the gap that loses the swan. Tied indices would mean a point's position at frame 30 can only ever be written in frame 30's own coordinates, every moment stuck in its own private frame, with no common coordinate system in which to assemble the moments into one 4D scene; decoupling them is what lets any moment be rendered from any camera.

It is worth saying exactly why a shared coordinate system is the thing at stake. A 3D position is meaningless without an origin: "two meters forward" is only a location once you fix forward from where. The camera in each frame supplies one such origin, its own viewpoint. If ttgtt_{\text{tgt}} and tcamt_{\text{cam}} were forced equal, every point would be reported in the origin belonging to its own moment, the frame-30 swan in frame 30's origin, the frame-31 swan in frame 31's, and so on, with no fixed point any two of them share. You would have a stack of separate little reconstructions, each correct in its own frame, and no way to say where the frame-30 swan sits relative to the frame-31 swan. Letting tcamt_{\text{cam}} stay free is exactly what lets you pin tcamt_{\text{cam}} to one reference frame for every query you ask. That one frame's camera becomes the single origin of the whole 4D scene, every point at every moment expressed against it, which is precisely what makes one consistent point cloud possible rather than a pile of incompatible ones.

The simplest use of this is a point track. Pin a pixel in a source frame, then sweep both time indices together across the whole video, ttgt=tcam=1,,Tt_{\text{tgt}} = t_{\text{cam}} = 1,\dots,T. Each answer is the 3D location of that one physical point at one moment, and the sequence of answers is its full 3D trajectory. Pin the pixel, sweep time, read off the path. Drag the slider:

Figure 2 · one point through time
t_tgt 1
Pin one source pixel in the first frame, then sweep the target time. The decoder returns that same physical point's 3D position at every moment, and the teal trail is its full trajectory. It keeps moving even after it would leave the source view, because each answer is read from the whole video's representation, not from frame-to-frame matching.

The query token is assembled from a few parts. The continuous coordinates (u,v)(u,v) are passed through a Fourier feature embedding, sines and cosines at a range of frequencies, the standard way to feed a smooth coordinate into a network without losing fine spatial detail. The three time indices get learned embeddings. And there is one more piece, covered in the training section.

Every query is decoded fully independently. Queries do not attend to each other, only to FF. That sounds like it would cost consistency, but the consistency was already paid for: it lives in the shared FF. Two queries about the same surface agree because they read from the same frozen source, not because they compare notes; the agreement is inherited from FF, never negotiated between answers. Keeping queries apart turned out to help, not hurt. The authors report that letting queries attend to one another caused large performance drops in early experiments, a kind of out-of-distribution effect once you change how many you ask at inference. Independence also buys trivial parallelism: a few queries for a cheap supervision signal during training, or billions in parallel at inference.

One interface, every task

A whole catalog of geometry tasks comes from choosing which fields you pin and which you sweep. No task-specific heads, no separate models. Just different patterns over the same five-field query and the same decoder.

Figure 3 · the same decoder, four tasks
Point track
The five query fields. Amber fields are pinned to one value; teal fields sweep a grid or the whole timeline. A point track fixes the pixel and sweeps time. A point cloud sweeps every pixel of every frame into one shared frame. A depth map ties the three times together and keeps Z. Camera pose sweeps a small grid across two frames. Switch tasks with the slider.

Walk the cases:

# encode the whole video once, then ask cheap point questions
F = encoder(video)                      # Global Scene Rep, computed ONCE

# a query: a 2D pixel + three time indices -> one 3D point
def P(u, v, t_src, t_tgt, t_cam):
    return decoder(query(u, v, t_src, t_tgt, t_cam), F)

track = [P(u, v, t_src, k, k) for k in range(T)]   # fix pixel, sweep time
cloud = [P(u, v, t, t, ref) for (u, v, t) in pixels]  # all pixels, one frame
depth = [P(u, v, t, t, t).z for (u, v) in frame(t)]   # keep only Z

One representation, one decoder, and the "heads" are gone. What used to be four trained components is four ways of writing the query. Cameras need slightly more than a pattern of queries, but only slightly.

Cameras fall out of the points

Camera extrinsics, where the camera sat and how it pointed, are a relative thing: the pose of frame jj with respect to frame ii. D4RT gets it almost for free. Take a small grid of points and decode each one twice: once expressed in frame ii's coordinates, once in frame jj's. The only field that changes is tcamt_{\text{cam}}:

qi,k=(uk,vk,i,i,i),qj,k=(uk,vk,i,i,j)\mathbf{q}_{i,k} = (u_k, v_k, i, i, i), \qquad \mathbf{q}_{j,k} = (u_k, v_k, i, i, j)
(3)

These describe the same 3D points, just written in two different coordinate frames. So the two point sets differ by exactly one rigid motion, a rotation and a translation. Recovering it is a classic problem with a closed-form answer: Umeyama's algorithm from 1991. Stack the two clouds, center them, and take the singular value decomposition of their 3×33 \times 3 cross-covariance matrix. The rotation reads straight off the SVD (with a small determinant correction so you get a proper rotation and not a mirror image), and the translation follows from the centroids. That rotation and translation are the relative camera pose. Drag the slider to apply it:

Figure 4 · pose from two decodings
0% applied
The same points, decoded in frame i's coordinates and in frame j's. They differ by one rigid transform. Umeyama's algorithm reads that transform off an SVD of their cross-covariance; applying it lands one cloud on the other and the misalignment falls to zero. The recovered rotation and translation are the relative camera pose.

Camera intrinsics, the focal length, come out by inverting the pinhole camera equation. A pinhole camera projects a 3D point P=(px,py,pz)\mathbf{P} = (p_x, p_y, p_z) to a pixel by u=fx(px/pz)+cxu = f_x\,(p_x/p_z) + c_x. Assume the principal point sits at the image center, (cx,cy)=(0.5,0.5)(c_x, c_y) = (0.5, 0.5) in normalized coordinates, and solve for the focal length:

fx=pz(u0.5)/px,fy=pz(v0.5)/pyf_x = p_z\,(u - 0.5)\,/\,p_x, \qquad f_y = p_z\,(v - 0.5)\,/\,p_y
(4)

Each decoded point gives one estimate of the focal length. D4RT takes the median over the grid of points for robustness. So a model that was only ever asked to report 3D positions of points hands you full camera calibration as a side effect, with no calibration head anywhere. (The assumptions are the usual pinhole ones: a centered principal point and no lens distortion. Distortion can be handled by adding a nonlinear refinement on top.)

Teaching it to point

Training is supervised regression, which is about as plain as it gets. Sample a batch of queries whose true 3D answers are known from the training data, decode them, and penalize the error. The main term is an L1L_1 loss on the 3D point, borrowed in spirit from DUSt3R, with two wrinkles. It is one term of a larger weighted loss:

Lpoint=iciψ(P^i/zˉ)ψ(Pi/z)1    λconflogci,ψ(x)=sign(x)log(1+x)\mathcal{L}_{\text{point}} = \sum_i\, c_i\,\big\lVert\, \psi(\hat{\mathbf{P}}_i / \bar{z}) - \psi(\mathbf{P}_i / z) \,\big\rVert_1 \; - \; \lambda_{\text{conf}} \log c_i, \qquad \psi(x) = \operatorname{sign}(x)\,\log(1 + |x|)
(5)

The first wrinkle is the normalization. Predicted and target point sets are each divided by their own mean depth (zˉ\bar{z} and zz) before comparison, then squashed by ψ(x)=sign(x)log(1+x)\psi(x) = \operatorname{sign}(x)\log(1+|x|). Dividing by mean depth makes the loss scale-invariant, which it has to be, because from one video the absolute scale of the world cannot be recovered. The log squash then keeps a few very distant points from dominating the error. (This is also why the benchmarks align predictions to ground truth by a single scale, or a scale and a shift, before scoring. The geometry is correct up to that global factor, not in literal meters.)

The second is the confidence cic_i, a weighting from Kendall and Gal. The model predicts, per point, how much it trusts its own answer. That confidence multiplies the error, so a confident wrong answer is punished hard, but the term λconflogci-\lambda_{\text{conf}} \log c_i stops it from declaring everything uncertain to dodge the loss. The model is rewarded for being confident and right. In practice this term is what made camera pose estimation work: turning it off sent the translation error up by more than its own value. There is a reason pose needs this more than depth does: pose is read off by aligning whole point clouds (the Umeyama step above), and one confidently wrong point can swing the whole alignment, so down-weighting the points the model distrusts lets the fit lean on the points it trusts; a depth map, read out pointwise, just inherits one bad pixel.

A handful of auxiliary losses ride alongside, each a small linear head on the decoder output: an L1L_1 loss on the point's 2D image position, a cosine loss on surface normals, a binary cross-entropy on whether the point is visible or occluded, and an L1L_1 on its motion. They are applied only where ground truth exists. The ablations show 2D position and normals sharpen depth the most, and confidence is what rescues pose.

Each query also carries an embedding of the local 9×9 RGB patch around its source pixel, alongside the coordinates and time indices. That small window of raw color is, empirically, the most useful single addition to a query. It gives the decoder a low-level appearance fingerprint to match against FF, and it carries the edges and texture needed to nail an object boundary. Most prior methods need dedicated machinery for this (skip connections from encoder to decoder, in the style of DPT); here it is a few extra numbers in the query. Toggle it on the swan:

Figure 5 · the local patch sharpens everything
on
Without the local 9×9 patch, depth can only be as sharp as the encoder's coarse token grid, and the outline goes blocky. With it, the query reads detail straight from the source frame and the boundary snaps into focus. On Sintel the depth error drops from 0.366 to 0.302 and the camera pose error (ATE) from 0.173 to 0.091.

The whole model is trained end to end in just over two days on 64 TPU chips, on 48-frame clips at 256×256, decoding 2048 random queries per step across a mixture of synthetic and real datasets. Because supervision only needs a sparse set of decoded queries, training is cheap even though the outputs can be dense.

Tracking every pixel without redundant work

The independent query is what makes the headline capability possible: a dense, holistic reconstruction where every pixel, static or moving, gets a full trajectory. That is what fills in the gaps a tracker leaves when an object is occluded in the first frame and then reappears.

Make that concrete with one point. Say there is a speck on the swan's wing, clearly visible in frames 1 through 12, hidden behind a reed from frame 13 to frame 18, and visible again from frame 19 on. A frame-to-frame tracker, the usual design, carries the point forward by matching it from one frame to the next. At frame 13 the speck is gone, the match fails, and the tracker either freezes the point on the reed or drops it; when the speck comes back at frame 19 the tracker has no idea it is the same speck. D4RT does not chain frame to frame. To get the speck's position at the hidden frame 15, you pin its source pixel (u,v)(u, v) in frame 1, set ttgt=15t_{\text{tgt}} = 15 to ask where that physical point is at frame 15, and read it in some visible frame's coordinates with tcamt_{\text{cam}}. The decoder answers that single (point,time)(\text{point}, \text{time}) question against the one frozen scene representation FF, which was built from the whole video at once, including the frames where the speck reappears. So it returns a position for frame 15 even though nothing in frame 15 itself shows the speck, the same way it would for a visible frame. (As everywhere here, that position is recovered up to the video's global scale, not in literal meters.) The occlusion was only ever a problem for a method that walks the timeline step by step; a method that asks each frame independently never has a chain to break.

But the naive version does far too much work. Tracking every pixel of every frame across the whole video is O(T2HW)O(T^2 H W) queries, and almost all of them are redundant. A point you tracked from frame 1 already tells you where its pixel went in frames 2 through TT, so re-tracking those pixels from scratch is wasted work.

The fix is an occupancy grid G{0,1}T×H×WG \in \{0,1\}^{T \times H \times W}, one bit per spatio-temporal pixel, marking whether it has already been explained. The algorithm only ever starts a new track from an unvisited pixel. Each full-video track, once decoded, marks every pixel it visibly passes through as covered. So a single track claims a whole swath of the video at once, and you spawn new tracks only for what is left. Read it as amortization: one track answers for every pixel it visibly passes through, the grid marks them done, and the budget of queries is spent only on pixels no earlier track has claimed:

# Algorithm 1: track all pixels without redundant work
F = encoder(video)                  # encode once
G = zeros(T, H, W)                  # occupancy grid of visited pixels
tracks = []
while not G.all():                  # while any pixel is still unvisited
    for (u, v, t_src) in sample_unvisited(G):    # a batch, in parallel
        q = [(u, v, t_src, k, k) for k in range(T)]   # one full-video track
        pts = [decoder(qk, F) for qk in q]
        G[visible(pts)] = 1         # mark every pixel it passes through
        tracks.append(pts)
return tracks
Figure 6 · the occupancy grid
0 / 18 tracks
One frame's pixels. Each new track starts from an unvisited pixel and immediately marks a whole connected patch covered, because it visibly passes through those pixels across the video. A handful of tracks blankets the frame. The paper reports a 5–15× speedup over the naive sweep, depending on how much the scene moves.

This only works because the decoder is both sparse (you can ask for one point) and lightweight (asking is cheap). Methods with dense per-frame decoders are stuck paying the full cost; methods with heavy sparse decoders pay too much per query to do this at scale. The query design is what unlocks the algorithm.

What it actually does

D4RT sets a new state of the art on dynamic 4D reconstruction and tracking, and it is far faster. On TAPVid-3D, the standard 3D point-tracking benchmark, it leads on the headline 3D Average Jaccard metric on most subsets, and on all of them when given ground-truth intrinsics. Without intrinsics, SpatialTrackerV2 edges it on one subset (ADT). On point-cloud reconstruction it beats the strong static models π³ and VGGT, cutting the Sintel point-cloud error from above 1.1 down to 0.77. On video depth it is top-tier across Sintel, ScanNet, KITTI and Bonn, and on camera pose it leads across Sintel, ScanNet and Re10K, with the largest gaps on the dynamic Sintel sequences.

Because each point is a tiny parallel cross-attention and there is no iterative refinement, D4RT produces far more tracks per second than the dense or iterative baselines: 18 to 300 times faster, and two orders of magnitude more throughput than MegaSaM on camera pose. Pick a target frame rate:

Figure 7 · tracks per second
24 FPS
How many full-video tracks each model can produce while hitting a target frame rate (log scale, because the gap is that large). D4RT leads DELTA and SpatialTrackerV2 by a wide margin at every rate. At 24 FPS it produces 1,570 tracks to DELTA's 5. The cost scales with the points you ask for, not with the frame.

What "state of the art" rests on is worth spelling out. Every comparison is after a global alignment of scale, because monocular geometry is only ever recovered up to that scale. Camera pose is reported after a similarity alignment to the ground truth. And the headline encoder is large, a billion-parameter ViT-g; the ablations show quality climbing steadily on depth and rotation error from ViT-B to ViT-g, so a good chunk of the result is the backbone doing its job. None of this undercuts the contribution. The win is a unified, fast interface that matches or beats specialized pipelines. It is not metric 3D from nothing.

The limits the design implies are real and the paper is candid about them where it can be. The intrinsics recovery assumes a centered principal point and no distortion, handled only by an extra refinement step. Long videos are processed by splitting into overlapping segments and stitching them with the same Umeyama alignment, rather than in one shot. And the scale ambiguity is structural, not a bug to be fixed: one camera, one video, no way to know if you are watching a real swan or a model of one.

Step back and the argument is short. Compress the video once into a global representation. Make the only operation a query that names a point in space and time and returns its 3D position. Then depth, point clouds, tracks, and cameras stop being separate problems with separate machinery. They become patterns of that one question. 4D reconstruction looked like it needed a pipeline of specialists because the old methods decoded whole frames. D4RT only asks for points.

Provenance Verified against primary literature
SRT (2022)Encode once into a scene latent, then decode each query independently by cross-attention.
VGGT (2025)The alternating frame-wise / global self-attention video encoder.
DUSt3R (2024)Scale-normalized 3D point regression: divide each point set by its mean depth before the L1 loss.
Kendall & Gal (2017)The −log(c) confidence weighting that lets the model down-weight points it is unsure of.
Umeyama (1991)Closed-form rigid transform between two point sets via an SVD of their cross-covariance.
Fourier features (2020)Sinusoidal coordinate embedding so the decoder can resolve fine spatial detail.
TAPVid-3D (2024)The AJ / APD₃D / OA tracking metrics, evaluated with and without ground-truth intrinsics.
correctionFrom one video, absolute scale is unknowable. D4RT predicts geometry up to a global scale (and often a shift): every benchmark aligns the prediction to ground truth before scoring. We say "up to scale" where the paper’s evaluation protocol implies it, rather than calling the output metric.

Questions you might still have

?

If every point is decoded on its own, how do they stay consistent?
They all read from the same fixed scene representation F. Consistency lives in F, not in any link between queries. Decoding independently even helps: letting queries attend to each other caused large drops in early experiments, so they are kept apart.

?

Is the output metric, in real meters?
No. From a single video the absolute scale is ambiguous. D4RT predicts geometry up to a global scale, and often an extra shift, which is why every benchmark aligns the prediction to ground truth before scoring. Feeding in known intrinsics pins it down further, but the headline numbers normalize scale.

?

What is the difference between t_tgt and t_cam?
t_tgt is when the point is, so it slides along the point’s trajectory through time. t_cam is which frame’s camera coordinate system you read that position in. Decoupling them lets one query place a moving point in any frame’s coordinates.

?

Why is it so much faster than a tracker like SpatialTrackerV2?
No iterative refinement and no dense per-frame decoder. Each point is a single small cross-attention into a representation computed once, and the queries run in parallel. You pay for exactly the points you ask for, not for the whole frame.

Footnotes & further reading

  1. The paper: Zhang, Le Moing, Koppula, Rocco, Momeni, Xie, Sun, Sukthankar, Barral, Hadsell, Ghahramani, Zisserman, Zhang, Sajjadi, Efficiently Reconstructing Dynamic Scenes One D4RT at a Time (Google DeepMind / Oxford / UCL, 2025). Project page with animated results.
  2. The encode-once, decode-by-query design comes from Sajjadi et al., Scene Representation Transformer (CVPR 2022). The alternating frame-wise / global attention encoder is from Wang et al., VGGT (CVPR 2025), which predicts geometry with separate heads but produces no dynamic correspondences.
  3. The scale-normalized point regression follows Wang et al., DUSt3R (CVPR 2024); the −log(c) confidence weighting follows Kendall & Gal, What Uncertainties Do We Need in Bayesian Deep Learning? (NeurIPS 2017). Fourier feature coordinate embeddings: Tancik et al., Fourier Features Let Networks Learn High Frequency Functions (NeurIPS 2020).
  4. The rigid-transform recovery is Umeyama, Least-Squares Estimation of Transformation Parameters Between Two Point Patterns (IEEE PAMI 1991): rotation from the SVD of the cross-covariance, with a determinant correction so the result is a proper rotation rather than a reflection.
  5. The tracking benchmark and its AJ / APD₃D / OA metrics: Koppula et al., TAPVid-3D (NeurIPS 2024). The baselines compared against include MegaSaM, π³, SpatialTrackerV2, St4RTrack, and DELTA.