Vision · 4D Reconstruction

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Rebuild a moving scene by asking it one point at a time.

Most 4D reconstruction pipelines bolt together a depth model, a pose model, and a motion model, then optimize to make them agree. D4RT replaces all of it with a single question, asked over and over: point at a pixel, name two moments, and the model returns that point's 3D location. Depth, point clouds, tracks, and cameras are patterns of that one query.

Explaining the paperEfficiently Reconstructing Dynamic Scenes One D4RT at a TimeZhang, Le Moing, Koppula, Rocco, … Sajjadi · Google DeepMind · Oxford · UCL · arXiv:2512.08924 ↗

Encode the whole video once; after that, any 3D point is a single question away, billions of them or just the few you need, and no frame is ever fully decoded.

A video of a swan gliding across a pond is, underneath, a 3D scene changing in time. Geometry plus motion. People call that 4D: three dimensions of space and one of time. Recovering it from the flat pixels of a single moving camera is one of the oldest hard problems in computer vision. You have to work out how far away every surface is, where each point travels, and where the camera was standing. The world keeps moving, and a video records only its projection onto a sensor.

The standard way to attack this is to chop it into pieces and bolt the pieces together. One model estimates depth, another estimates camera pose, another segments what is moving, and then an expensive optimization runs at test time to force the pieces into agreement. MegaSaM works roughly this way. Newer feedforward models like VGGT (Visual Geometry Grounded Transformer) drop the optimization but still hang a separate decoder head on the backbone for each output, and, more tellingly, none of these methods can say where a moving point goes. They reconstruct the static furniture of a room and lose the swan.

D4RT, out of Google DeepMind with Oxford and UCL, throws out the per-frame, per-task decoding. The name unpacks to Dynamic 4D Reconstruction and Tracking. Instead of producing a dense output for every frame, it answers one small question at a time. Point at a pixel in one frame, name a moment for its position and a frame for the coordinate system, and the model returns a single 3D point. Everything else (depth maps, point clouds, point tracks, camera pose) is a pattern of those questions.

A video is compressed once into a scene representation. A query names a point in space and time. A small decoder reads the answer off the representation. And because the answers are independent of each other, you can ask billions of them to reconstruct everything, or only the few you actually need.

Look once, remember everything

The architecture has two stages, and everything else rests on the split between them. It is borrowed from the Scene Representation Transformer (SRT), which first showed that you can compress a set of images into one latent and then decode novel views from it by cross-attention. D4RT carries that shape over to video and geometry.

Stage one is the encoder $\mathcal{E}$ . It reads the entire video $V$ at once, every frame together, and produces a single bundle of feature vectors called the Global Scene Representation $F$ :

F = \mathcal{E}(V) \in \mathbb{R}^{N \times C}, \qquad V \in \mathbb{R}^{T \times H \times W \times 3}

(1)

Read the shapes. The input is $T$ frames of an $H \times W$ image with 3 color channels. The output is $N$ tokens, each a $C$ -dimensional vector. The encoder is a Vision Transformer with the alternating attention that VGGT introduced: layers that let tokens attend within a single frame interleave with layers that let them attend across all frames, so the representation captures both what each frame looks like and how the frames relate over time. (The biggest model uses a ViT-g encoder with 40 layers, about a billion parameters.)

$F$ is global. It captures the entire environment: dense correspondence across every frame, and how time changes the scene, held in one place. Once computed, $F$ is fixed. The expensive part of looking at the video happens exactly once. Everything after that reads from this memory.

A model that decodes a dense map for every frame pays the heavy cost again and again. D4RT pays it once, builds $F$ , and then answers questions against it. Drag the number of questions and watch what the split buys:

Figure 1 · encode once, query many

queries QQ = 8 · avg 13.5

The split, costed. Re-encoding the video for every question pays the heavy encoder price Q times; D4RT pays it once, and every question after that is one cheap decoder pass against the frozen F, so the average cost per query collapses toward the cost of a single decode. Units are relative, the encoder is drawn at 100× a decode for illustration; the real split is a billion-parameter encoder against the eight-layer, 144M-parameter decoder.

A query names a point in space and time

Stage two is a small decoder $\mathcal{D}$ , a lightweight cross-attention transformer (eight layers, 144M parameters, against the encoder's billion). You hand it a query, it cross-attends into the frozen $F$ , and it returns one 3D point. The query has five fields:

\mathbf{q} = (\,u,\; v,\; t_{\text{src}},\; t_{\text{tgt}},\; t_{\text{cam}}\,), \qquad \mathbf{P} = \mathcal{D}(\mathbf{q}, F) \in \mathbb{R}^3

(2)

Three of the fields describe the source: a normalized 2D pixel $(u,v) \in [0,1]^2$ in a source frame $t_{\text{src}}$ . That is the point you are asking about, pinned by where it appears in one frame. The other two fields are time indices in $[1,\dots,T]$ :

$t_{\text{tgt}}$ is when the point is. The same physical speck of swan is in a different place at frame 5 than at frame 30. Choosing $t_{\text{tgt}}$ picks which moment's position you want, so it slides the answer along the point's trajectory through time.
$t_{\text{cam}}$ is whose coordinates you want it in. A 3D position is only meaningful relative to some origin. $t_{\text{cam}}$ names the frame whose camera viewpoint defines that origin.

These two indices need not coincide. You can ask for the position of a point at frame 30, expressed in the camera coordinates of frame 1. Where a point is, and which frame you measure it from, are independent. A method that ties them together cannot place a moving object into a single shared coordinate frame, and so it loses the swan. Tied indices would mean a point's position at frame 30 can only ever be written in frame 30's own coordinates, every moment stuck in its own private frame, with no common coordinate system in which to assemble the moments into one 4D scene; decoupling them lets any moment be rendered from any camera.

A shared coordinate system matters for a concrete reason. A 3D position is meaningless without an origin: "two meters forward" is only a location once you fix forward from where. The camera in each frame supplies one such origin, its own viewpoint. If $t_{\text{tgt}}$ and $t_{\text{cam}}$ were forced equal, every point would be reported in the origin belonging to its own moment, the frame-30 swan in frame 30's origin, the frame-31 swan in frame 31's, and so on, with no fixed point any two of them share. You would have a stack of separate little reconstructions, each correct in its own frame, and no way to say where the frame-30 swan sits relative to the frame-31 swan. Letting $t_{\text{cam}}$ stay free lets you pin $t_{\text{cam}}$ to one reference frame for every query you ask. That one frame's camera becomes the single origin of the whole 4D scene, every point at every moment expressed against it, which makes one consistent point cloud possible rather than a pile of incompatible ones.

The simplest use of this is a point track. Pin a pixel in a source frame, then sweep both time indices together across the whole video, $t_{\text{tgt}} = t_{\text{cam}} = 1,\dots,T$ . Each answer is the 3D location of that one physical point at one moment, and the sequence of answers is its full 3D trajectory. Drag the slider:

Figure 2 · one point through time

t_tgt 1

Pin one source pixel in the first frame, then sweep the target time. The decoder returns that same physical point's 3D position at every moment, and the teal trail is its full trajectory. It keeps moving even after it would leave the source view, because each answer is read from the whole video's representation, not from frame-to-frame matching.

The query token is assembled from a few parts. The continuous coordinates $(u,v)$ are passed through a Fourier feature embedding, sines and cosines at a range of frequencies, the standard way to feed a smooth coordinate into a network without losing fine spatial detail. The three time indices get learned embeddings. And there is one more piece, covered in the training section.

Every query is decoded fully independently. Queries do not attend to each other, only to $F$ . That sounds like it would cost consistency, but the consistency was already paid for: it lives in the shared $F$ . Two queries about the same surface agree because they read from the same frozen source. The agreement comes from $F$ , not from any exchange between the answers. Keeping queries apart helps rather than hurts. The authors report that letting queries attend to one another caused large performance drops in early experiments, a kind of out-of-distribution effect once you change how many you ask at inference. Independence also buys trivial parallelism: a few queries for a cheap supervision signal during training, or billions in parallel at inference.

One interface, every task

A range of geometry tasks comes from choosing which fields you pin and which you sweep. No task-specific heads, no separate models. Different patterns over the same five-field query and the same decoder are all it takes.

Figure 3 · the same decoder, four tasks

taskPoint track

The five query fields. Amber fields are pinned to one value; teal fields sweep a grid or the whole timeline. A point track fixes the pixel and sweeps time. A point cloud sweeps every pixel of every frame into one shared frame. A depth map ties the three times together and keeps Z. Camera pose sweeps a small grid across two frames. Switch tasks with the slider.

The cases:

Point track: pin $(u,v,t_{\text{src}})$ , sweep $t_{\text{tgt}} = t_{\text{cam}}$ over all frames. One point, its whole path.
Point cloud: sweep $(u,v)$ over every pixel and $t_{\text{src}}$ over every frame, pin $t_{\text{cam}}$ to one reference. Every pixel of the whole video lands in a single shared coordinate frame directly, with no stitching together of separately estimated cameras, the usual source of noisy reconstructions.
Depth map: for a frame, set $t_{\text{src}} = t_{\text{tgt}} = t_{\text{cam}}$ and keep only the $Z$ coordinate of each returned point. Depth is the third coordinate of the geometry you already have.

# encode the whole video once, then ask cheap point questions
F = encoder(video)                      # Global Scene Rep, computed ONCE

# a query: a 2D pixel + three time indices -> one 3D point
def P(u, v, t_src, t_tgt, t_cam):
    return decoder(query(u, v, t_src, t_tgt, t_cam), F)

track = [P(u, v, t_src, k, k) for k in range(T)]   # fix pixel, sweep time
cloud = [P(u, v, t, t, ref) for (u, v, t) in pixels]  # all pixels, one frame
depth = [P(u, v, t, t, t).z for (u, v) in frame(t)]   # keep only Z

One representation, one decoder, and the "heads" are gone. What used to be four trained components is four ways of writing the query. Cameras need slightly more than a pattern of queries, but only slightly.

Cameras fall out of the points

Camera extrinsics, where the camera sat and how it pointed, are a relative thing: the pose of frame $j$ with respect to frame $i$ . D4RT gets it at almost no extra cost. Take a small grid of points and decode each one twice: once expressed in frame $i$ 's coordinates, once in frame $j$ 's. The only field that changes is $t_{\text{cam}}$ :

\mathbf{q}_{i,k} = (u_k, v_k, i, i, i), \qquad \mathbf{q}_{j,k} = (u_k, v_k, i, i, j)

(3)

These describe the same 3D points, written in two different coordinate frames. So the two point sets differ by exactly one rigid motion, a rotation and a translation. Recovering it is a classic problem with a closed-form answer: Umeyama's algorithm from 1991. Stack the two clouds, center them, and take the singular value decomposition of their $3 \times 3$ cross-covariance matrix. The rotation reads straight off the SVD (with a small determinant correction so you get a proper rotation and not a mirror image), and the translation follows from the centroids. That rotation and translation are the relative camera pose. Drag the slider to apply it:

Figure 4 · pose from two decodings

0% applied

The same points, decoded in frame i's coordinates and in frame j's. They differ by one rigid transform. Umeyama's algorithm reads that transform off an SVD of their cross-covariance; applying it lands one cloud on the other and the misalignment falls to zero. The recovered rotation and translation are the relative camera pose.

Camera intrinsics, the focal length, come out by inverting the pinhole camera equation. A pinhole camera projects a 3D point $\mathbf{P} = (p_x, p_y, p_z)$ to a pixel by $u = f_x\,(p_x/p_z) + c_x$ . Assume the principal point sits at the image center, $(c_x, c_y) = (0.5, 0.5)$ in normalized coordinates, and solve for the focal length:

f_x = p_z\,(u - 0.5)\,/\,p_x, \qquad f_y = p_z\,(v - 0.5)\,/\,p_y

(4)

Each decoded point gives one estimate of the focal length. D4RT takes the median over the grid of points for robustness. So a model that was only ever asked to report 3D positions of points yields full camera calibration as a side effect, with no calibration head anywhere. (The assumptions are the usual pinhole ones: a centered principal point and no lens distortion. Distortion can be handled by adding a nonlinear refinement on top.)

Teaching it to point

Training is supervised regression, which is about as plain as it gets. Sample a batch of queries whose true 3D answers are known from the training data, decode them, and penalize the error. The main term is an $L_1$ loss on the 3D point, borrowed in spirit from DUSt3R, with two wrinkles. It is one term of a larger weighted loss:

\mathcal{L}_{\text{point}} = \sum_i\, c_i\,\big\lVert\, \psi(\hat{\mathbf{P}}_i / \bar{z}) - \psi(\mathbf{P}_i / z) \,\big\rVert_1 \; - \; \lambda_{\text{conf}} \log c_i, \qquad \psi(x) = \operatorname{sign}(x)\,\log(1 + |x|)

(5)

The first wrinkle is the normalization. Predicted and target point sets are each divided by their own mean depth ( $\bar{z}$ and $z$ ) before comparison, then squashed by $\psi(x) = \operatorname{sign}(x)\log(1+|x|)$ . Dividing by mean depth makes the loss scale-invariant, which it has to be, because from one video the absolute scale of the world cannot be recovered. The log squash then keeps a few very distant points from dominating the error. (This is also why the benchmarks align predictions to ground truth by a single scale, or a scale and a shift, before scoring. The geometry is correct up to that global factor, not in literal meters.)

The second is the confidence $c_i$ , a weighting from Kendall and Gal. The model predicts, per point, a confidence value for its own answer. That confidence multiplies the error, so a confident wrong answer incurs a large penalty, but the term $-\lambda_{\text{conf}} \log c_i$ penalizes uniformly low confidence, so the model cannot reduce the loss by marking every point uncertain. The model does best by being confident only where it is right. In practice this term made camera pose estimation work: turning it off sent the translation error up by more than its own value. Pose needs this more than depth does for a specific reason: pose is read off by aligning whole point clouds (the Umeyama step above), and one confidently wrong point can swing the entire alignment, so down-weighting the low-confidence points lets the fit lean on the high-confidence ones; a depth map, read out pointwise, inherits only one bad pixel.

A handful of auxiliary losses ride alongside, each a small linear head on the decoder output: an $L_1$ loss on the point's 2D image position, a cosine loss on surface normals, a binary cross-entropy on whether the point is visible or occluded, and an $L_1$ on its motion. They are applied only where ground truth exists. The ablations show 2D position and normals sharpen depth the most, and confidence improves pose the most.

Each query also carries an embedding of the local 9×9 RGB patch around its source pixel, alongside the coordinates and time indices. That small window of raw color is, empirically, the most useful single addition to a query. It gives the decoder a low-level appearance fingerprint to match against $F$ , and it carries the edges and texture needed to nail an object boundary. Most prior methods need dedicated machinery for this (skip connections from encoder to decoder, in the style of DPT, the Dense Prediction Transformer); here it is a few extra numbers in the query. Toggle it on the swan:

Figure 5 · the local patch sharpens everything

local patchon

Without the local 9×9 patch, depth can only be as sharp as the encoder's coarse token grid, and the outline goes blocky. With it, the query reads detail straight from the source frame and the boundary snaps into focus. On Sintel the depth error drops from 0.366 to 0.302 and the camera pose error (ATE) from 0.173 to 0.091.

The model is trained end to end in a little over two days on 64 TPU chips, on 48-frame clips at 256×256, decoding 2048 random queries per step across a mixture of synthetic and real datasets. Because supervision only needs a sparse set of decoded queries, training is cheap even though the outputs can be dense.

Tracking every pixel without redundant work

The independent query enables the central capability: a dense, holistic reconstruction where every pixel, static or moving, gets a full trajectory. That fills in the gaps a tracker leaves when an object is occluded in the first frame and then reappears.

Say there is a speck on the swan's wing, clearly visible in frames 1 through 12, hidden behind a reed from frame 13 to frame 18, and visible again from frame 19 on. A frame-to-frame tracker, the usual design, carries the point forward by matching it from one frame to the next. At frame 13 the speck is gone, the match fails, and the tracker either freezes the point on the reed or drops it; when the speck comes back at frame 19 the tracker has no idea it is the same speck. D4RT does not chain frame to frame. To get the speck's position at the hidden frame 15, you pin its source pixel $(u, v)$ in frame 1, set $t_{\text{tgt}} = 15$ to ask where that physical point is at frame 15, and read it in some visible frame's coordinates with $t_{\text{cam}}$ . The decoder answers that single $(\text{point}, \text{time})$ question against the one frozen scene representation $F$ , which was built from the whole video at once, including the frames where the speck reappears. So it returns a position for frame 15 even though nothing in frame 15 itself shows the speck, the same way it would for a visible frame. (As everywhere here, that position is recovered up to the video's global scale, not in literal meters.) The occlusion is only a problem for a method that propagates point positions frame by frame; a method that queries each frame independently has no chain to break.

But the naive version does far too much work. Tracking every pixel of every frame across the whole video is $O(T^2 H W)$ queries, and almost all of them are redundant. A point you tracked from frame 1 already tells you where its pixel went in frames 2 through $T$ , so re-tracking those pixels from scratch is wasted work.

D4RT keeps an occupancy grid $G \in \{0,1\}^{T \times H \times W}$ , one bit per spatio-temporal pixel, marking whether it has already been explained. The algorithm only ever starts a new track from an unvisited pixel. Each full-video track, once decoded, marks every pixel it visibly passes through as covered. So a single track claims a large swath of the video at once, and you spawn new tracks only for what is left. The work amortizes: one track answers for every pixel it visibly passes through, the grid marks them done, and the budget of queries is spent only on pixels no earlier track has claimed:

# Algorithm 1: track all pixels without redundant work
F = encoder(video)                  # encode once
G = zeros(T, H, W)                  # occupancy grid of visited pixels
tracks = []
while not G.all():                  # while any pixel is still unvisited
    for (u, v, t_src) in sample_unvisited(G):    # a batch, in parallel
        q = [(u, v, t_src, k, k) for k in range(T)]   # one full-video track
        pts = [decoder(qk, F) for qk in q]
        G[visible(pts)] = 1         # mark every pixel it passes through
        tracks.append(pts)
return tracks

Figure 6 · the occupancy grid

0 / 18 tracks

One frame's pixels. Each new track starts from an unvisited pixel and immediately marks a whole connected patch covered, because it visibly passes through those pixels across the video. A handful of tracks blankets the frame. The paper reports a 5–15× speedup over the naive sweep, depending on how much the scene moves.

This only works because the decoder is both sparse (you can ask for one point) and lightweight (asking is cheap). Methods with dense per-frame decoders are stuck paying the full cost; methods with heavy sparse decoders pay too much per query to do this at scale.

Faster, and state of the art

D4RT sets a new state of the art on dynamic 4D reconstruction and tracking, and it is far faster. On TAPVid-3D, the standard 3D point-tracking benchmark, it leads on the primary 3D Average Jaccard metric (AJ, a combined position-and-visibility accuracy score, averaged over distance thresholds) on most subsets, and on all of them when given ground-truth intrinsics. Without intrinsics, SpatialTrackerV2 edges it on one subset (ADT). On point-cloud reconstruction it beats the strong static models π³ and VGGT, cutting the Sintel point-cloud error from above 1.1 down to 0.77. On video depth it is top-tier across Sintel, ScanNet, KITTI and Bonn, and on camera pose it leads across Sintel, ScanNet and Re10K, with the largest gaps on the dynamic Sintel sequences.

Because each point is a tiny parallel cross-attention and there is no iterative refinement, D4RT produces far more tracks per second than the dense or iterative baselines: 18 to 300 times faster, and two orders of magnitude more throughput than MegaSaM on camera pose. Pick a target frame rate:

Figure 7 · tracks per second

target FPS24 FPS

How many full-video tracks each model can produce while hitting a target frame rate (log scale, because the gap is that large). D4RT leads DELTA and SpatialTrackerV2 by a wide margin at every rate. At 24 FPS it produces 1,570 tracks to DELTA's 5. The cost scales with the points you ask for, not with the frame.

A few caveats sit under those numbers. Every comparison is after a global alignment of scale, because monocular geometry is only ever recovered up to that scale. Camera pose is reported after a similarity alignment to the ground truth. And the flagship encoder is large, a billion-parameter ViT-g; the ablations show quality climbing steadily on depth and rotation error from ViT-B to ViT-g, so a good chunk of the result comes from the size of the backbone. None of this undercuts the contribution, a unified, fast interface that matches or beats specialized pipelines.

The design carries real limits, and the paper states them where it can. The intrinsics recovery assumes a centered principal point and no distortion, handled only by an extra refinement step. Long videos are processed by splitting into overlapping segments and stitching them with the same Umeyama alignment, rather than in one shot. And the scale ambiguity is structural, not a bug to be fixed: one camera, one video, no way to know if you are watching a real swan or a model of one.

The pipeline of specialists was always an artifact of decoding whole frames. Compress the video once, make the only operation a query that names a point in space and time and returns its 3D position, and depth, point clouds, tracks, and cameras stop being separate problems. They are patterns of that one question.

Provenance Verified against primary literature

SRT (2022)Encode once into a scene latent, then decode each query independently by cross-attention.

VGGT (2025)The alternating frame-wise / global self-attention video encoder.

DUSt3R (2024)Scale-normalized 3D point regression: divide each point set by its mean depth before the L1 loss.

Kendall & Gal (2017)The −log(c) confidence weighting that lets the model down-weight low-confidence points.

Umeyama (1991)Closed-form rigid transform between two point sets via an SVD of their cross-covariance.

Fourier features (2020)Sinusoidal coordinate embedding so the decoder can resolve fine spatial detail.

TAPVid-3D (2024)The AJ / APD₃D / OA tracking metrics, evaluated with and without ground-truth intrinsics.

caveatFrom one video, absolute scale is unknowable. D4RT predicts geometry up to a global scale (and often a shift): every benchmark aligns the prediction to ground truth before scoring. We say "up to scale" where the paper’s evaluation protocol implies it, rather than calling the output metric.

Questions you might still have

If every point is decoded on its own, how do they stay consistent?
They all read from the same fixed scene representation F. Consistency lives in F, not in any link between queries. Decoding independently even helps: letting queries attend to each other caused large drops in early experiments, so they are kept apart.

Is the output metric, in real meters?
No. From a single video the absolute scale is ambiguous. D4RT predicts geometry up to a global scale, and often an extra shift, which is why every benchmark aligns the prediction to ground truth before scoring. Feeding in known intrinsics pins it down further, but the headline numbers normalize scale.

What is the difference between t_tgt and t_cam?
t_tgt is when the point is, so it slides along the point’s trajectory through time. t_cam is which frame’s camera coordinate system you read that position in. Decoupling them lets one query place a moving point in any frame’s coordinates.

Why is it so much faster than a tracker like SpatialTrackerV2?
No iterative refinement and no dense per-frame decoder. Each point is a single small cross-attention into a representation computed once, and the queries run in parallel. You pay for exactly the points you ask for, not for the whole frame.

Footnotes & further reading

The paper: Zhang, Le Moing, Koppula, Rocco, Momeni, Xie, Sun, Sukthankar, Barral, Hadsell, Ghahramani, Zisserman, Zhang, Sajjadi, Efficiently Reconstructing Dynamic Scenes One D4RT at a Time (Google DeepMind / Oxford / UCL, 2025). Project page with animated results.
The encode-once, decode-by-query design comes from Sajjadi et al., Scene Representation Transformer (CVPR 2022). The alternating frame-wise / global attention encoder is from Wang et al., VGGT (CVPR 2025), which predicts geometry with separate heads but produces no dynamic correspondences.
The scale-normalized point regression follows Wang et al., DUSt3R (CVPR 2024); the −log(c) confidence weighting follows Kendall & Gal, What Uncertainties Do We Need in Bayesian Deep Learning? (NeurIPS 2017). Fourier feature coordinate embeddings: Tancik et al., Fourier Features Let Networks Learn High Frequency Functions (NeurIPS 2020).
The rigid-transform recovery is Umeyama, Least-Squares Estimation of Transformation Parameters Between Two Point Patterns (IEEE PAMI 1991): rotation from the SVD of the cross-covariance, with a determinant correction so the result is a proper rotation rather than a reflection.
The tracking benchmark and its AJ / APD₃D / OA metrics: Koppula et al., TAPVid-3D (NeurIPS 2024). The baselines compared against include MegaSaM, π³, SpatialTrackerV2, St4RTrack, and DELTA.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.