3D generation · Representation

Native and Compact Structured Latents for 3D Generation

Record where a shape's surface crosses each voxel, not whether each point is inside or outside, and any topology fits.

A field-free voxel grid captures open, hollow, and tangled shapes along with their material, and a sparse autoencoder squeezes a full 1024³ asset into about 9,600 tokens, small enough to generate from an image in seconds.

Explaining the paperNative and Compact Structured Latents for 3D GenerationMicrosoft (TRELLIS.2 team) · arXiv · 2025 · arXiv:2512.14692 ↗

A 3D generator can only produce what its shape format can represent, and until this paper that format could hold only clean, closed, opaque shapes.

Image generators had it easy. A picture is already a grid of numbers, so a network can read it and write it with no translation step. A 3D asset is not a grid of numbers. It is a mesh: a soup of triangles, with a texture image wrapped on by a separate set of coordinates, and maybe material settings layered on top of that. Before any network can learn from it, somebody has to decide how to turn that soup into numbers a network can process and, just as importantly, turn the numbers back into a usable asset. That choice of representation is the bottleneck, and most of the recent progress in 3D generation has come from chipping away at it.

This paper, the system its authors call TRELLIS.2 (the sequel to TRELLIS), attacks the representation head-on. It introduces a new way to hold a 3D asset called O-Voxel (short for omni-voxel, because it holds geometry and appearance together), designed so that two things are true at once: it can represent any shape, including the awkward ones earlier formats had to discard, and it can be squeezed into a small enough latent that a large generator can produce a fully textured, relightable asset from a single image in a few seconds. To see why both halves are hard, and how the paper gets them together, we have to start with the format everyone else uses and the shapes it cannot hold.

Why a field can't hold every shape

The dominant way to hand a shape to a neural network is an iso-surface field. Pick a function over all of space that is negative inside the object and positive outside, the canonical one being a signed distance function (SDF), which returns how far you are from the surface with a sign for which side you are on. The surface is wherever the function crosses zero. Sample that field on a regular grid and you get the 3D analog of an image, a dense grid of numbers, which is why generators reach for it.

Everything turns on the word signed. A sign requires a consistent answer to "inside or outside?" at every point, and that answer only exists for a closed, watertight surface that cleanly separates an inside from an outside. Three common kinds of geometry break that assumption. An open surface, like a leaf, a sheet of cloth, or a coat, is infinitely thin and has no interior, so there is no inside for the sign to mark. A non-manifold shape, where surfaces meet along an edge or a point instead of tiling cleanly (think of three walls meeting at a corner, or two cones joined at their tips), has places where "which side" has no single answer. And a fully-enclosed interior, like the seats sealed inside a car body or a smaller shape nested inside a larger shell, is invisible to a field that only encodes the outermost boundary.

The usual fix makes things worse, not better. To force a messy asset into a signed field, pipelines run a watertight repair: flood-fill or winding-number heuristics that seal every gap. Repair does not just hide the awkward parts. It actively rewrites the geometry. An open leaf gets inflated into a solid slab so it has an inside to be negative in; a hollow interior gets filled with concrete; a thin gap snaps shut. By the time the shape is "clean" enough to be a field, it is a different shape. A thermometer makes the picture concrete: a signed field reads cold inside the body and hot outside, and the surface is where cold meets hot. A bedsheet on a line has no inside to be cold, so the thermometer has nothing to read.

O-Voxel throws out the field. Instead of asking "is this point inside or outside," it reads where the actual surface crosses the edges of a voxel grid, a purely local fact that does not care whether the surface ever closes. The two readings come apart the moment a shape opens up, which is the motivation for the design. Drag the gap open below and watch the signed field collapse while the edge-crossing reading holds:

Figure 1 · field vs field-free

gapclosed

The same loop-shaped surface in the same grid. Left, a signed field floods the outside in from the border and marks the enclosed cells as inside. Right, O-Voxel marks the voxel edges the surface crosses and the active voxels. Open the gap and the left panel's interior leaks out, so "inside" drops to zero; the right panel captures the open arc unchanged. The signed-field panel shows the failure a field cannot avoid on an open surface.

That picture carries the argument; everything after it is engineering to make a field-free representation trainable, compressible, and generatable. But first, the format O-Voxel is descended from, because every choice it makes is a fix for something its predecessor could not do.

Structured latents, and the blind spot

The direct ancestor is TRELLIS (Xiang et al., 2024), which introduced the structured latent, abbreviated SLAT. The idea is to stop treating a 3D asset as one monolithic blob and instead attach a small latent vector to each of a sparse set of active voxels on a coarse grid. The voxel positions carry the coarse layout of the shape, which voxels are occupied at all, and each attached vector carries the local detail. The design has two useful properties: it is sparse, so it spends capacity only where the object is, and it is structured, so a generator can reason about geometry and detail separately. TRELLIS.2 keeps this skeleton.

But look at where TRELLIS's latent vectors come from. It does not encode the 3D asset directly. It renders the asset from roughly 150 camera views, runs each rendered image through a 2D vision encoder (DINOv2), then projects those 2D image features back onto the active voxels and averages them. The latent is multiview-derived: it only captures what the cameras saw. Anything the cameras could not see, a sealed interior, the back of an occluded part, never enters the latent. And material is reduced to whatever the renders happened to capture under whatever lighting was used, not the intrinsic surface properties you would need to relight the asset later.

The title's word native names exactly that fix. TRELLIS.2 builds the structured latent by encoding raw 3D data, the O-Voxel, with a 3D autoencoder, rather than by photographing the asset from the outside. Native encoding lets the latent carry interiors and true material, the two things a camera rig structurally misses. With that contrast set, we can build O-Voxel itself.

O-Voxel: read the surface, drop the field

An O-Voxel is a sparse set of active voxels on an $N\times N\times N$ grid, each carrying a tuple of features:

\boldsymbol{f}=\big\{(\boldsymbol{f}^{\text{shape}}_{i},\,\boldsymbol{f}^{\text{mat}}_{i},\,\boldsymbol{p}_{i})\big\}_{i=1}^{L},\qquad \boldsymbol{p}_{i}\in\{0,1,\dots,N-1\}^{3}

(1)

Here $\boldsymbol{p}_i$ is the integer grid coordinate of the $i$ -th active voxel, $\boldsymbol{f}^{\text{shape}}_i$ describes the local geometry, and $\boldsymbol{f}^{\text{mat}}_i$ the local material. Only voxels the surface passes through are stored; empty space costs nothing. The question is what to put in $\boldsymbol{f}^{\text{shape}}$ so that you can rebuild a real mesh from it, sharp edges and all, without ever computing a field.

The answer comes from a 2002 graphics algorithm, so it helps to know what it improved on. The textbook way to turn a grid into a mesh is Marching Cubes: label each grid corner inside or outside, and wherever an edge has one corner of each kind, drop a vertex on that edge and stitch the vertices into triangles. It always produces a closed mesh, but because every vertex is pinned to an edge, it can never land on a crease. Run it on a cube and you get a cube with shaved, rounded edges, at any resolution.

Dual Contouring (Ju et al., 2002) flips the bookkeeping. Instead of vertices on edges, it places exactly one vertex inside each cell the surface passes through, a dual vertex, and connects the dual vertices of cells that share a crossed edge into quads. Because the vertex now lives inside the cell, free to sit anywhere, it can slide right into a sharp corner. Where it goes is decided by Hermite data: for every edge the surface crosses, you record the crossing point $\boldsymbol{q}_i$ and the surface normal $\boldsymbol{n}_i$ there. Each pair defines a little tangent plane the vertex should lie on, and the vertex is placed where it is collectively closest to all of them, by minimizing a quadratic error function (QEF). TRELLIS.2 keeps this dual-grid machinery but, crucially, drops Dual Contouring's reliance on a signed field: it reads the asset's mesh surface directly to find the crossings and normals. That single change makes it field-free, and its QEF (their "Flexible Dual Grid") adds two terms to the classic one:

\min_{\boldsymbol{v}\in\text{cell}}\; e(\boldsymbol{v}) = \underbrace{\sum_i \big(\boldsymbol{n}_i\!\cdot\!(\boldsymbol{v}-\boldsymbol{q}_i)\big)^2}_{\text{align to surface planes}} + \lambda_{\text{bound}}\underbrace{\sum_j d_{L,j}^2}_{\text{open boundaries}} + \lambda_{\text{reg}}\underbrace{\lVert\boldsymbol{v}-\bar{\boldsymbol{q}}\rVert^2}_{\text{stay near center}}

(2)

The first term is the original Dual Contouring energy, and the detail that matters is that it measures point-to-plane distance, not point-to-point. $\big(\boldsymbol{n}_i\!\cdot\!(\boldsymbol{v}-\boldsymbol{q}_i)\big)^2$ is the squared distance from the vertex to the tangent plane through the crossing, so the vertex is free to slide along a flat surface and only gets penalized for leaving it. Where two planes meet at an angle, the one point flush with both is their intersection, the sharp corner. A point-to-point penalty would instead drag the vertex toward the average crossing and blur every feature. The second term, $d_{L,j}$ , is new in this paper: a point-to-line distance to the mesh's open boundary edges, which lets a dual vertex track the rim of an open surface where no "other side" exists. The third, $\lVert\boldsymbol{v}-\bar{\boldsymbol{q}}\rVert^2$ , pulls the vertex toward the centroid $\bar{\boldsymbol{q}}$ of the crossings, a stabilizer for cells where the surface data leaves the vertex under-determined.

The regularizer is the knob to try by hand, because it controls the exact tradeoff between sharp and smooth. Drag $\lambda_{\text{reg}}$ below: at zero, the vertex lands on the true corner and the reconstruction is crisp; raise it and the vertex slides toward the centroid and the corner rounds off.

Figure 2 · placing the dual vertex

λ_reg0.00

One cell, a surface passing through as a sharp corner. The two crossings carry surface normals (the amber arrows); the dual vertex minimizes the QEF. With

\lambda_{\text{reg}}=0

it sits exactly on the true corner and the teal reconstruction is sharp. Raise

\lambda_{\text{reg}}

and it slides toward the centroid

\bar{q}

, and the corner rounds off. Only point-to-plane alignment makes that sharp setting possible.

So $\boldsymbol{f}^{\text{shape}}_i$ ends up holding three things per voxel: the dual vertex $\boldsymbol{v}_i\in[0,1]^3$ (where the surface sits within the cell), three edge-intersection flags $\boldsymbol{\delta}_i\in\{0,1\}^3$ (which of the cell's three canonical edges the surface crosses, telling you how to connect neighbors into faces), and a splitting weight $\gamma_i$ that decides how each quad is cut into two triangles. (Connecting dual vertices yields four-sided quads, and a quad can be cut along either diagonal; the weight picks the cut that follows the surface.) That last piece is borrowed, and only that piece, from FlexiCubes. It is worth being precise about what FlexiCubes does and does not give you, because its reputation oversells it for this purpose: FlexiCubes is still a field-based method, and its "arbitrary topology" means an arbitrary number of handles on a closed, watertight surface, not the open and non-manifold shapes O-Voxel is built for. Those come from reading the surface directly, not from FlexiCubes, which contributes only the triangle-splitting rule.

The payoff of reading the surface directly is that conversion is cheap and lossless in both directions. Going mesh to O-Voxel is a few seconds of CPU work, no optimization and no rendering; going back is tens of milliseconds. The whole build, per active voxel, is short enough to read:

# mesh -> O-Voxel  (field-free, one entry per surface voxel)
for cell in voxels_touched_by(mesh):          # only surface voxels activate
    q, n = edge_crossings(cell, mesh)          # Hermite data: points + normals
    v     = solve_qef(q, n, mean(q))           # dual vertex (Eq 2), point-to-plane
    delta = crossed_edges(cell, mesh)          # 3 face flags in {0,1}
    gamma = split_weights(cell)                # quad -> triangle rule (FlexiCubes)
    c,m,r,a = sample_material(cell, texture)   # color, metallic, rough, opacity
    store(cell.pos, shape=(v, delta, gamma), mat=(c, m, r, a))

Appearance: six numbers per voxel

Geometry is half of an asset. The other half is how it looks, and O-Voxel stores that in the same place, attached to the same voxels, as six numbers:

\boldsymbol{f}^{\text{mat}}_{i}=(\boldsymbol{c}_{i},\,m_{i},\,r_{i},\,\alpha_{i}),\qquad \boldsymbol{c}_i\in[0,1]^3,\;\; m_i,r_i,\alpha_i\in[0,1]

(3)

This is the standard physically-based rendering (PBR) metallic-roughness material, the same one game engines and glTF files use, so an O-Voxel asset drops straight into a normal renderer. The three channels of $\boldsymbol{c}$ are the base color, $m$ is metallic, $r$ is roughness, and $\alpha$ is opacity. Each cashes out into how light leaves the surface. Roughness sets how spread out the reflection is: near 0 the surface is a mirror and the highlight is a tight bright dot, near 1 it is matte and the highlight is a broad dim sheen. Metallic is close to a switch rather than a dial: at 0 the surface is a dielectric (plastic, wood, stone) with a colored diffuse body and a small white highlight, and at 1 it is a conductor (gold, steel) with no diffuse body at all, where the base color instead tints the reflection itself. This dual role of the base color, diffuse color for non-metals and reflection tint for metals, is specific to the metallic-roughness model. Opacity, finally, lets O-Voxel hold translucent surfaces like glass, which earlier shape-only formats had no channel for.

Because these are intrinsic surface properties and not lighting baked into a texture, the asset can be relit. Drive metallic, roughness, and opacity below, and let the orbiting light show the same stored material re-shading correctly under new illumination, which a baked-in texture could never do.

Figure 3 · PBR material

metallic0.00

roughness0.35

opacity1.00

A sphere lit by a slowly orbiting light. Metallic turns it from plastic to gold; roughness swings the highlight from a tight mirror dot to a broad matte sheen; opacity makes it translucent so the checker shows through. The highlight tracks the moving light because the material is stored intrinsically, not baked in. (Shading is illustrative, not a full BRDF.)

One subtlety the figure smooths over, because it matters when you go to render for real: the "roughness" a renderer feeds its reflection model is usually $r^2$ , not $r$ , a reparameterization that keeps the slider feeling linear to the eye. The sphere here uses roughness directly for the highlight size, which is fine as intuition but would give too-sharp midtones in a physically exact pipeline.

Squeezing a grid into 9,600 tokens

O-Voxel can describe any asset, but at full resolution it is far too big to generate directly. A $1024^3$ grid has a billion cells; even sparsely populated, the active voxels number in the hundreds of thousands. No generator can afford to emit that many tokens. So, just as image diffusion learned to run in a compressed latent rather than on raw pixels, TRELLIS.2 trains a Sparse Compression VAE (SC-VAE) to encode the O-Voxel into a far smaller latent and decode it back. The headline is a $16\times$ spatial reduction: a fully-textured $1024^3$ asset becomes roughly 9,600 latent tokens with little visible loss.

Two design problems stand between that goal and a working autoencoder, and the SC-VAE has a specific answer for each. The first is that 3D data is almost all empty, so an ordinary 3D convolution wastes nearly all its work on void. Submanifold sparse convolution answers that: it computes only at active voxels, and, the defining rule, it keeps an output voxel active only if the central input voxel was active. That freezes the set of active sites instead of letting it grow. An ordinary sparse convolution activates a voxel if any input in its window was active, so a thin shell bloats outward layer after layer (one voxel becomes 27, then 125, in 3D), and the data stops being sparse. The submanifold rule keeps a deep sparse network on a thin surface thin.

The second problem is harder, and it is the one the ablation makes vivid. Push spatial compression high enough and the autoencoder stops training well. This is not a capacity limit, since a network of the same latent size has perfectly good solutions and the same architecture trains fine at lower compression; it is an optimization problem, those good solutions are just hard to reach. TRELLIS.2 borrows the answer from DC-AE: a residual autoencoding shortcut. When you downsample by a factor of two, instead of letting a learned layer figure out how to merge a voxel's eight children, you first stack those eight children into the channel dimension and average them into the coarse feature, a fixed, non-parametric reshuffle of space into channels:

\boldsymbol{F}^{\text{raw}}_{\text{coarse}}=\operatorname{stack}\!\big(\boldsymbol{F}_{\text{child}_1},\dots,\boldsymbol{F}_{\text{child}_8}\big)\in\mathbb{R}^{8C},\qquad \boldsymbol{F}_{\text{coarse}}=\operatorname{avg\_groups}\!\big(\boldsymbol{F}^{\text{raw}}_{\text{coarse}}\big)\in\mathbb{R}^{C'}

(4)

Stacking eight children gives $8C$ channels; averaging them in groups collapses that to the coarse width $C'=2C$ , twice as wide as one child. Upsampling runs it backward, unstacking the channels into eight children and duplicating within groups to refill the width. The learned layers then only have to predict a residual on top of this reshuffle, a much easier target than reconstructing everything from scratch, and that is enough for the optimization to reach the good high-compression solution. (The shortcut is not an exact bijection, because the average and the duplication lose information; it is a strong prior, not a perfect inverse.) Toggle it off below and watch the reconstruction error climb, gently at $16\times$ and catastrophically at $32\times$ :

Figure 4 · the residual shuffle

compressionresidual

Downsampling stacks a voxel's eight children into channels, then averages to the coarse width

C'=2C

; upsampling reverses it. That reshuffle acts as the residual shortcut the learned block corrects on top of. Turn it off and reconstruction error (Mesh Distance, lower is better) rises from 1.03 to 1.75 at

16\times

, and from 1.41 to 7.39 at

32\times

. Without it, high compression does not train.

A couple of smaller pieces round out the autoencoder. The decoder uses an early-pruning upsampler: before subdividing a coarse voxel, it predicts a small binary mask of which of the eight children will actually be occupied, and skips the empty ones, so the decoder never wastes work materializing voxels it is about to discard. And the residual blocks themselves are slimmed in a ConvNeXt style, one convolution plus a wide pointwise MLP instead of two convolutions, which improves reconstruction at the same cost. The VAE trains in two stages: a first stage on direct O-Voxel targets (squared error on the dual vertices, binary cross-entropy on the face flags and the pruning mask, an L1 term on material, and the KL term that keeps the latent close to a unit Gaussian so it stays smooth enough to sample from),

\mathcal{L}_{\text{s1}}=\lambda_{v}\lVert\hat{\boldsymbol{v}}-\boldsymbol{v}\rVert_2^2+\lambda_{\delta}\,\operatorname{BCE}(\hat{\boldsymbol{\delta}},\boldsymbol{\delta})+\lambda_{\rho}\,\operatorname{BCE}(\hat{\boldsymbol{\rho}},\boldsymbol{\rho})+\lambda_{\text{mat}}\lVert\hat{\boldsymbol{f}}^{\text{mat}}-\boldsymbol{f}^{\text{mat}}\rVert_1+\lambda_{\text{KL}}\,\mathcal{L}_{\text{KL}}

(6)

and a second stage that adds rendering supervision. Per-voxel error rewards getting each number close, but not whether the decoded surface looks right, so the second stage renders the decoded mesh into mask, depth, and normal maps and compares those against the ground truth with L1, SSIM, and LPIPS terms. One nice trick there: the training cameras are placed with a shallow near-plane that slices through the surface, which forces the decoder to get the interior right rather than only the parts a normal camera would see. To support generating shape first and material second, TRELLIS.2 actually trains two SC-VAEs with decoupled latent spaces, one for geometry and one for material, with the material encoder conditioned on the geometry's voxel structure so the two stay aligned. The ~9,600-token figure above is the geometry latent.

What does $16\times$ compression buy? Fewer tokens at higher fidelity than anyone else. Plotting reconstruction quality against token count, the two TRELLIS.2 points land in the top-left corner, best quality for the fewest tokens. Hover the points:

Figure 5 · compact and accurate

Reconstruction quality (normal-map PSNR, higher is better) against latent token count, log scale, on the Toys4K benchmark. TRELLIS.2 at 9,600 tokens reaches 43.1 PSNR, while SparseFlex needs 225,000 tokens to reach 37.3 and the original TRELLIS scores 30.3 at the same 9,600 tokens. The 512-resolution model hits 39.5 PSNR with only 2,200 tokens. Top-left is best; that is the "compact" claim, checkable.

Generating in the native latent

With a compact latent in hand, generation is the part that looks familiar. TRELLIS.2 trains DiT models, plain Transformers over the latent tokens, with the flow matching objective. Flow matching learns a time-dependent velocity field $\boldsymbol{v}_\theta$ that transports noise to data along a straight path. The paper's loss is the conditional flow-matching loss:

\mathcal{L}_{\text{CFM}}(\theta)=\mathbb{E}_{t,\,\boldsymbol{x}_0,\,\boldsymbol{\epsilon}}\,\big\lVert\,\boldsymbol{v}_\theta(\boldsymbol{x}(t),t)-(\boldsymbol{\epsilon}-\boldsymbol{x}_0)\,\big\rVert_2^2,\qquad \boldsymbol{x}(t)=(1-t)\,\boldsymbol{x}_0+t\,\boldsymbol{\epsilon}

(9)

Read the convention carefully, because it is the mirror of the original flow-matching papers and will look sign-flipped if you do not. Here $\boldsymbol{x}_0$ is the clean data latent at $t=0$ , $\boldsymbol{\epsilon}$ is pure noise at $t=1$ , and the straight path between them has constant velocity $\boldsymbol{\epsilon}-\boldsymbol{x}_0$ , which points from data toward noise. The network learns to predict that velocity at every point and time. To generate, you start at noise ( $t=1$ ) and integrate the field backward down to $t=0$ , walking against the velocity into the data. (Lipman's and Liu's flow-matching papers put noise at $t=0$ and data at $t=1$ ; the math is identical, only the labeling of the time axis is reversed. We follow this paper's convention throughout.)

The training and sampling loops are short. Training regresses the velocity on a randomly-timed point of the straight path:

# train one flow-matching DiT (geometry stage shown)
x0   = encode(o_voxel)                  # data latent     [tokens, 32]
eps  = randn_like(x0)                   # noise           [tokens, 32]
t    = rand()                           # time in [0, 1]
xt   = (1 - t) * x0 + t * eps           # straight path: t=0 data, t=1 noise
vhat = dit(xt, t, cond=dinov3(image))   # predict the velocity
loss = mse(vhat, eps - x0)              # target points data -> noise  (Eq 9)

# sample: start at noise (t=1), integrate back to data (t=0)
x = randn(tokens, 32)                   # pure noise
for t in linspace(1, 0, steps):         # walk the time axis backward
    x = x - dt * dit(x, t, cond=dinov3(image))   # step against the velocity
latent = x                              # decode -> O-Voxel -> textured mesh

The generation pipeline is three of these models in sequence, which is the second place "native" pays off. The first DiT generates the sparse structure, which voxels are active at all, the coarse occupancy of the shape. The second fills those voxels with geometry latents. The third, new in this paper, generates material latents, conditioned not only on the input image but also on the geometry the second stage just produced, so the material lands exactly on the surface it belongs to. The first two stages follow TRELLIS, which had only those two; the material stage is the addition that makes the output a finished, textured asset rather than a bare shape. Scrub the pipeline below:

Figure 6 · the native cascade

material

One input image conditions three flow-matching DiTs in sequence: sparse structure (which voxels exist) → geometry latents → material. The material stage also sees the generated geometry, so color and PBR land on the right surface. Press play or scrub to watch the asset build up, occupancy to geometry to textured material.

The conditioning stack is built from recent, well-chosen parts. The image features come from DINOv3-L, a self-supervised vision Transformer (it is trained with no labels and no text, unlike CLIP, and TRELLIS.2 freezes it and feeds its features in by cross-attention). The timestep modulates each block through AdaLN-single, the parameter-thrifty variant from PixArt that computes one global modulation and shares it across blocks, and positions are encoded with rotary embeddings. Each DiT is about 1.3 billion parameters (width 1536, 30 blocks, 12 heads, an 8192-wide MLP), so the three together come to roughly 4 billion. Because the latent is already so compact, the DiTs can drop the convolutional packing and skip connections TRELLIS needed and run as plain vanilla Transformers, which is simpler and faster.

Take a concrete case. You hand the system a photo of a translucent glass with a metal rim. Stage one looks at the photo and emits a coarse occupancy, the few thousand voxels the glass occupies, hollow center included. Stage two, conditioned on the photo and that occupancy, denoises a tensor of geometry latents, one 32-dim vector per active voxel, from pure noise down through the flow to the clean latent, which the geometry SC-VAE decodes into dual vertices and face flags, then into a mesh. Stage three denoises material latents on those same voxels and decodes them into per-voxel color, metallic, roughness, and opacity, so the body comes out transparent and the rim comes out reflective metal. The hollow center and the transparency are the two things the old multiview latent could not have carried. Decode, and you have a relightable asset, in a few seconds for a $512^3$ asset and tens of seconds at $1536^3$ .

What it buys

The numbers back the design. On shape reconstruction, TRELLIS.2 beats Dora, TRELLIS, Direct3D-S2, and SparseFlex across every metric while using far fewer tokens, the result Figure 5 plots. Material reconstruction, which has no real baseline because no prior method encodes intrinsic material this way, comes in at 38.89 dB PSNR on the PBR attribute maps and 38.69 dB on the shaded renders. On image-to-3D generation it tops every alignment score (CLIP 0.894, ULIP-2 0.477, Uni3D 0.436), and a user study of about 40 participants preferred its results 66.5% of the time overall and 69.0% on shape, against 13.3% for the next best, Hunyuan3D 2.1.

A word on the metrics, because two of them look alike and are not the same lens. Mesh Distance and Chamfer Distance are the same formula, a symmetric average of squared nearest-neighbor distances between two point sets. What differs is the point set. Chamfer Distance samples points from depth maps rendered over a hundred views, so it only ever sees the outer shell. Mesh Distance samples a million points from the full mesh surface, interiors included, so it is the metric that scores the enclosed structure O-Voxel was built to keep. TRELLIS.2's margin is largest there, which is the field-free argument showing up in the numbers. (The F-scores that accompany them count points landing within a distance threshold; note those thresholds, $10^{-8}$ and $10^{-6}$ , are on squared distances.)

The ablation gives the cleanest evidence for the compression design. At $16\times$ compression, removing the residual shuffle lifts Mesh Distance 69%, from 1.03 to 1.75. At $32\times$ the same removal sends it to 7.39, about 5.3 times the baseline of 1.41. The paper phrases that as "worsening to 526%," which reads as a 526%-of-baseline figure (a 5.3× value), not a 526% increase; the increase is about 426%. Either way the lesson holds: without the shuffle, high compression collapses.

The compact latent enables one more capability: test-time scaling beyond the training resolution. Because generating is cheap, you can run the geometry stage, downsample the result into a coarse sparse occupancy, then run the geometry stage again to emit it at a higher resolution, cascading up to a $1536^3$ asset the model was never trained to produce directly. The same trick, downsampling and regenerating within the trained resolution, also cleans up local errors. It is a controllable quality-for-compute dial that a heavier latent could not afford.

The bottleneck in 3D generation was a representation that could only hold clean, closed, opaque shells, and only the appearance a camera caught. TRELLIS.2 replaces it with a representation that reads the surface directly, so it keeps open sheets, hollow interiors, non-manifold joins, and real material; then it compresses that representation hard enough that a 4-billion-parameter generator can produce a finished, relightable asset from one photo in seconds. Give the generator a format that can hold any asset, and it finally has any asset to generate.

Provenance Verified against primary literature

TRELLIS / SLAT (Xiang 2024)The structured-latent skeleton and two-stage pipeline. TRELLIS’s latent was multiview-derived; this one is native.

Dual Contouring (Ju 2002)The dual-grid + QEF vertex placement. TRELLIS.2 makes it field-free and adds the open-boundary term.

FlexiCubes (Shen 2023)Only the splitting weights for adaptive quad-to-triangle subdivision; not its field machinery.

DC-AE (Chen 2024)The space-to-channel residual autoencoding that makes high compression trainable.

DiT + Flow MatchingPeebles & Xie (2022) and Lipman et al. (2022): the transformer generator and its objective.

PixArt / DINOv3 / RoPEAdaLN-single modulation, self-supervised image features, rotary positions.

correctionField methods like FlexiCubes are often described as handling 'arbitrary topology', but that means arbitrary genus of a closed, watertight surface, not the open, non-manifold, or fully-enclosed shapes O-Voxel's field-free design captures. The paper reuses only FlexiCubes' triangle-splitting weights, not its field machinery.

Questions you might still have

Why not just use an SDF and repair the mesh first?
Repair is destructive, not cosmetic. Flood-fill or winding-number watertighting inflates an open sheet into a solid slab, fills hollow interiors with material, and snaps thin gaps shut. By the time the asset is a valid signed field it is a different shape. Reading where the surface crosses voxel edges needs no inside/outside, so nothing has to be repaired away.

How is this different from TRELLIS?
TRELLIS built its structured latent by rendering ~150 views, running a 2D encoder, and averaging the features onto voxels, so it only captured what cameras saw: no interiors, no intrinsic material. TRELLIS.2 encodes the raw 3D O-Voxel with a 3D VAE (native, not multiview-derived), and adds a third generation stage for PBR material on top of TRELLIS’s two.

Why a VAE at all? Why not generate the voxels directly?
A 1024³ grid has hundreds of thousands of active voxels, far too many tokens for a generator to emit. The Sparse Compression VAE squeezes that to ~9,600 tokens at 16× spatial compression, which makes a large flow-matching model affordable and fast.

Does FlexiCubes not already handle arbitrary topology?
Only in the genus sense: FlexiCubes is field-based, so its output is a watertight mesh that can have any number of handles, but it still cannot represent open surfaces, non-manifold junctions, or sealed interiors. Those need the field-free reading. TRELLIS.2 borrows only FlexiCubes’ triangle-splitting weights.

Is the flow-matching direction the same as the flow-matching paper?
The math is the same; the time axis is reversed. This paper puts data at t=0 and noise at t=1, with the velocity target pointing data→noise, so sampling integrates backward from t=1. The original flow-matching and rectified-flow papers put noise at t=0. Watch the convention if you cross-reference them.

Footnotes & further reading

The paper: Microsoft (TRELLIS.2 team), Native and Compact Structured Latents for 3D Generation (2025). Project page (code, model, data).
The predecessor: Xiang et al., Structured 3D Latents for Scalable and Versatile 3D Generation (TRELLIS / SLAT, CVPR 2025).
Dual Contouring of Hermite data: Ju, Losasso, Schaefer, Warren, Dual Contouring of Hermite Data (SIGGRAPH 2002); and FlexiCubes, Shen et al., Flexible Isosurface Extraction for Gradient-Based Mesh Optimization (SIGGRAPH 2023), source of the splitting weights.
High-compression autoencoding: Chen et al., Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models (DC-AE), the space-to-channel residual idea; and submanifold sparse convolution, Graham & van der Maaten, Submanifold Sparse Convolutional Networks (2017).
The generator: Peebles & Xie, Scalable Diffusion Models with Transformers (DiT); Lipman et al., Flow Matching for Generative Modeling; AdaLN-single from PixArt-α; image features from DINOv3.
Reported runtimes (~3s at 512³, ~17s at 1024³, ~60s at 1536³) are stated on an NVIDIA H100 in the abstract, while the experiments section notes its runtime statistics are on an A100; the paper is internally inconsistent on the device, so read the times as approximate.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.