Theory · Architecture

Neural Ordinary Differential Equations

A residual network, taken all the way to its limit.

Stop writing down the layers. Write down the derivative of the hidden state, hand it to an ODE solver, and let the solver decide how deep the network needs to be. It can then be trained with memory that does not grow with depth.

Explaining the paperNeural Ordinary Differential EquationsChen, Rubanova, Bettencourt, Duvenaud · U. Toronto / Vector Institute · NeurIPS 2018 · arXiv:1806.07366 ↗

The same reframing makes generative flows cheap to train and handles data that arrives at irregular times.

A deep network has a number you set before training: its depth. Pick 50 layers, or 100, or 1000, and you have committed. Each layer is a discrete step, and the steps are baked in. This explainer is about a paper that takes that integer and dissolves it into a continuous knob, by noticing that a particular kind of layer is already doing arithmetic that has a much older name.

The residual connection in every modern network is, line for line, one step of the oldest method for solving a differential equation. Once you see that, the layers stop being the model. The model is the equation underneath, and the layers are one crude way to solve it. Swap in a better solver and several things you thought were fixed become dials.

The method follows a chain of equivalences. A residual layer is one Euler step. Many Euler steps are an ODE solve. An ODE solve can be differentiated cheaply by an old method from control theory. And the same continuous view that buys cheap gradients also converts an expensive determinant in generative models into a cheap trace.

A network is a stack of small nudges

Start with the residual block, the idea that let networks get deep. A plain layer replaces its input with something new. A residual layer does not. It computes a small correction and adds it to what it was handed:

\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t)

(1)

Reading the symbols: $\mathbf{h}_t$ is the hidden state going into layer $t$ (a vector, or for a Transformer a whole tensor of token vectors), $\mathbf{h}_{t+1}$ is what comes out, and $f$ is that layer's learned transformation with its own weights $\theta_t$ . The layer does not rebuild the state from scratch. It nudges it. This is the idea behind ResNets, and it is in every Transformer block: each layer only has to learn how to adjust a running state, never to regenerate it.

The hidden state is a point, and the network is a sequence of nudges that walk that point from where it starts (the input) to where it ends (a representation you can read an answer off). Layer one nudges it a little. Layer two nudges the result a little more. Stack $T$ layers and you have walked the point along a path in $T$ discrete hops.

The shape of (1) is familiar. "New position equals old position plus a step." If you have ever solved a differential equation by hand, that is the most basic thing you can write down.

Shrink the step to nothing

An ordinary differential equation is a rule that says, at every point and every moment, which way to move and how fast. A velocity. Euler's method, the simplest way to actually solve one on a computer, says: stand where you are, read off the velocity, take a small step of size $h$ in that direction, repeat.

\mathbf{h}_{t+1} = \mathbf{h}_t + h \cdot f(\mathbf{h}_t, t)

Set the step size $h = 1$ and that is exactly equation (1). A residual layer is an Euler step, with the step size fixed to one and left implicit, and the papers that first drew this picture wrote the step size in on purpose, precisely so the connection to a continuous equation would be explicit.

A coarse Euler solver takes a few big steps and cuts the corners of the true curve. A good solver takes many small steps and tracks it closely. What happens to a residual network if you add more layers and shrink the step toward zero? In the limit, the discrete stack of nudges becomes a smooth, continuous flow, and the network is no longer specifying a sequence of layers. It is specifying the velocity field itself:

\frac{\mathrm{d}\mathbf{h}(t)}{\mathrm{d}t} = f(\mathbf{h}(t), t, \theta)

(2)

Here a single small network $f$ , with one shared set of weights $\theta$ , reports the velocity $\mathrm{d}\mathbf{h}/\mathrm{d}t$ of the hidden state at any position $\mathbf{h}$ and any depth $t$ . The layer index has turned into a continuous time, and "depth" is now how long you let the equation run. To get the output, you start at the input $\mathbf{h}(0)$ and integrate this velocity forward to some final time $T$ :

\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f(\mathbf{h}(t), t, \theta)\,\mathrm{d}t = \texttt{ODESolve}(\mathbf{h}(0),\, f,\, 0,\, T,\, \theta)

(3)

The forward pass is now a call to an off-the-shelf ODE solver. The solver evaluates the little velocity network $f$ wherever it needs to, as many times as it needs to, and hands back $\mathbf{h}(T)$ . That box, ODESolve, is the entire network. It is a layer whose internals you do not write down; you write down its derivative and let the solver do the rest.

Drag the slider below. The smooth amber curves are four trajectories of the true continuous-depth ODE, with depth running up the page. The teal staircases are a residual network trying to follow each one with $L$ discrete Euler steps, one dot per layer. Few layers and the staircase overshoots and cuts corners. Add layers, shrink the step $h = T/L$ , and the staircase melts onto the curve. The limit being taken is precise: step size to zero and layer count to infinity together, with the total depth $T$ held fixed. That an Euler staircase converges onto the smooth solution as its steps shrink is exactly the convergence an ODE solver guarantees.

Figure 1 · depth becomes time

layers LL = 4 · h = 1.05

Depth runs up the vertical axis; the hidden state runs across. The continuous ODE flow is fixed. A residual network approximates it with

L

Euler steps, one dot per layer. As

L

grows and the step

h=T/L

shrinks, the staircase converges onto the curve. A layer is one Euler step; depth is elapsed time.

For a real trained residual network, though, the weights $\theta_t$ change however they like from layer to layer, so "take the limit and you get a clean ODE" is a motivating story, not a theorem; a genuine continuous limit needs the weights to vary smoothly with depth. The paper sidesteps the gap: instead of hoping a stack converges, it defines the model as the ODE from the start, with a single $f$ shared across all of $t$ , and trains that.

Let the solver pick the depth

Once the forward pass is an ODE solve, two centuries of numerical analysis come with it, and the first thing they buy you is adaptive depth. Euler is the crudest solver there is. Modern ones (Runge–Kutta and its adaptive descendants) carry an internal estimate of their own error. Where the trajectory is bending sharply they take small, careful steps; where it is nearly straight they stride. You hand the solver an error tolerance, and it spends exactly as many evaluations of $f$ as that tolerance demands.

That count, the number of times the solver calls $f$ , is the closest thing a Neural ODE has to a depth. The paper calls it the number of function evaluations, $\tilde{L}$ (NFE), and unlike a layer count it is not fixed. It is chosen per input, and it can change as training proceeds. The forward-pass time is proportional to it.

Drag the tolerance below. Tighten it and the solver places more evaluation points (they cluster where the curve bends) and the error shrinks. Loosen it and the same trained network runs cheaper, with visible error. You can train at high accuracy and then, at deployment, dial the tolerance down for a real-time or low-power setting, without retraining anything.

Figure 2 · the solver chooses how hard to work

tolerance1e-2

The true solution is fixed. An adaptive solver places evaluation points where it needs them: dense where the curve bends, sparse where it is flat. Tightening the tolerance spends more function evaluations (NFE) for a closer fit; loosening it trades accuracy for speed on the same network. Compute scales with NFE.

The paper reports the flip side too: as a Neural ODE trains, its NFE creeps up; the NFE rises as the dynamics it is fitting grow more complex. Depth stops being a hyperparameter and becomes something the solver sets per input.

Backprop without storing the path

The forward pass is a black-box solver call. How do you train it? The obvious answer is to record every operation the solver performed and backpropagate through all of them. That works, and it is exactly what you want to avoid, because it means storing the entire forward trajectory: every intermediate state, every internal stage of every step. Memory grows with the number of evaluations, which is the very thing we just made adaptive and potentially large.

The alternative is a method from 1960s optimal control called the adjoint sensitivity method. The idea is to ask, at every moment along the trajectory, a single question: how much would the final loss $L$ change if the state at this instant were nudged? Call that quantity the adjoint, following the paper's notation with $\mathbf{z}$ for the state:

\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}

In an ordinary network, the gradient at one layer comes from the gradient at the next layer by the chain rule. The adjoint is the continuous version of that. It obeys its own differential equation, the instantaneous chain rule, run backward in time:

\frac{\mathrm{d}\mathbf{a}(t)}{\mathrm{d}t} = -\,\mathbf{a}(t)^{\!\top} \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}}

(4)

The minus sign matters, and it is correct as printed: the adjoint flows from the final time back to the start, accumulating sensitivity the way backprop accumulates gradient from output toward input. The term $\partial f/\partial \mathbf{z}$ is the Jacobian of the little velocity network, and the product $\mathbf{a}^{\top}\,\partial f/\partial \mathbf{z}$ is a vector–Jacobian product, the same primitive ordinary backprop already computes, at about the cost of one evaluation of $f$ . The gradient you actually want, with respect to the weights, is then one more integral along the way back:

\frac{\mathrm{d}L}{\mathrm{d}\theta} = -\int_{T}^{0} \mathbf{a}(t)^{\!\top} \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \theta}\,\mathrm{d}t

(5)

The limits run from $T$ down to $0$ : this is a single solve backward in time. This single backward solve is the reason the method costs constant memory. The backward equations (4) and (5) both need the state $\mathbf{z}(t)$ along the trajectory, but we threw the trajectory away. So we recompute it on the fly: we re-integrate the original ODE backward from the final state $\mathbf{z}(T)$ , right alongside the adjoint, in one combined reverse solve. The state, the adjoint, and the parameter gradient are stacked into one augmented system and integrated together from $T$ back to $0$ .

# gradients with O(1) memory in depth: never store the forward trajectory
def ode_block_backward(hT, f, theta, dL_dhT):
    # augmented state s = [h, a, grad], with the adjoint a = dL/dh
    def aug(s, t):
        h, a = s.h, s.a
        return [ f(h, t, theta),               # rebuild h by re-integrating
                 -a @ df_dh(f, h, t, theta),    # the adjoint ODE, eq (4)
                 -a @ df_dtheta(f, h, t, theta) ]  # the eq (5) integrand
    s0 = [hT, dL_dhT, zeros_like(theta)]
    h0, dL_dh0, dL_dtheta = odesolve(s0, aug, t1=1.0, t0=0.0, theta)
    return dL_dh0, dL_dtheta                  # one reverse solve, eq (5)

Because nothing from the forward pass is stored, the memory cost does not depend on how many steps the solver took. It is constant in depth. On MNIST (the standard handwritten-digit classification benchmark) the paper measures exactly this: the ODE-Net trains at $O(1)$ memory, against $O(\tilde{L})$ for backpropagating through the solver and $O(L)$ for an ordinary residual network. Press play and watch where the memory goes.

Figure 3 · the memory trick

forward

Forward (first half): the solver runs the state from

t_0

t_1

. The store-every-step baseline saves an activation at each step, so its meter climbs to full. The adjoint method saves only the endpoint. Backward: the adjoint flows back while the state is reconstructed on the way. The baseline still holds its whole tape (

O(\tilde{L})

); the adjoint never held more than one point (

O(1)

), at any depth.

Concrete numbers make the gap clear. Say the hidden state $\mathbf{z}$ is a vector of 100 numbers and the solver takes, on this input, some number of steps $\tilde{L}$ . Backpropagating through the solver keeps one copy of the state for every step it took, so its activation memory is $100 \times \tilde{L}$ numbers and climbs with every step the solver adds. At 50 steps that is 5,000 stored numbers; at 500 steps, 50,000; the meter tracks the trajectory. The adjoint method stores the endpoint $\mathbf{z}(T)$ and reconstructs the rest, so during the backward solve it is holding the current state, the current adjoint, and the running parameter gradient, a fixed handful of buffers whose size is set by the state and the parameter count, not by how many steps were taken. Double the number of solver steps and the backprop column doubles while the adjoint column does not move. That is the precise sense of the claim: the cost is flat in depth, the number of solver steps $\tilde{L}$ . It is emphatically not flat in bytes, since the 100-number state, the parameters, and the augmented adjoint buffers are all still resident; what depth no longer multiplies is the part that used to grow without bound.

The reconstruction is affordable for the same reason the forward pass is. The backward equations (4) and (5) advance using $\partial f/\partial \mathbf{z}$ and $\partial f/\partial \theta$ , the Jacobians of the little velocity network $f$ , and they touch them only through the vector–Jacobian products $\mathbf{a}^{\top}\,\partial f/\partial \mathbf{z}$ and $\mathbf{a}^{\top}\,\partial f/\partial \theta$ . A vector–Jacobian product is exactly what ordinary reverse-mode autodiff computes in one backward sweep of $f$ , and for a small network that sweep costs about one forward evaluation of $f$ , never the full $D \times D$ Jacobian written out. So a single backward step of the augmented system costs about the same as a single forward step, the backward solve runs at roughly the cost of the forward one (the paper measures its NFE at roughly half the forward count), and the trade is a fair one: you pay a forward-pass-sized recomputation on the way back to avoid storing a trajectory whose length you no longer control.

Three caveats. First, "constant memory" means constant in depth, not constant in bytes: you still store the parameters, the current state, and the augmented adjoint state. Second, what you save is memory, not time: the backward pass is itself an ODE solve, the adjoint system run backward with its own adaptive steps and its own NFE (the paper finds roughly half the forward count), and NFE rises during training as the learned dynamics get stiffer and harder to integrate and the solver takes more steps to hold the same tolerance. Third, the gradient is exact only for the continuous problem: on a real computer, reconstructing $\mathbf{z}(t)$ by running the dynamics backward can drift away from the true forward path, especially on stiff problems (ones where the dynamics has sharply different time scales, forcing a solver into very small steps to stay stable), and then the gradient is slightly wrong. The paper flags this itself and offers checkpointing (store a few states on the way forward, re-integrate between them) as the fix; later work, ANODE, makes the failure case precise.

The same trick makes flows cheap

The continuous view has a side benefit the authors call unexpected, and it lands in generative modeling. A normalizing flow builds a complicated probability distribution by taking simple noise and pushing it through an invertible function $f$ . To know the density of a generated sample, you use the change of variables formula, which corrects for how much the function locally stretches or squeezes space:

\mathbf{z}_1 = f(\mathbf{z}_0) \quad\Longrightarrow\quad \log p(\mathbf{z}_1) = \log p(\mathbf{z}_0) - \log\left| \det \frac{\partial f}{\partial \mathbf{z}_0} \right|

(6)

That determinant is expensive. For a transformation on $D$ dimensions it costs on the order of $D^3$ to compute. So ordinary flows are forced into awkward, restricted architectures (triangular Jacobians, or layers with a single hidden unit) just to keep the determinant tractable.

Now make the transformation continuous, a flow driven by an ODE instead of a stack of discrete maps. The paper's Theorem 1, the instantaneous change of variables, says the log-density then obeys its own simple differential equation:

\frac{\mathrm{d}\log p(\mathbf{z}(t))}{\mathrm{d}t} = -\operatorname{tr}\!\left( \frac{\partial f}{\partial \mathbf{z}(t)} \right)

(7)

The determinant has become a trace. Instead of an $O(D^3)$ determinant of the Jacobian, you need only the sum of its diagonal, which is cheap and, when the dynamics is a sum over hidden units, scales linearly in their number. The geometry explains the swap: the determinant measures the volume change of one whole discrete jump, while in continuous time volume changes at a rate, the divergence of the velocity field, and the trace of the Jacobian is exactly that divergence, the sum of each coordinate's own stretching. That rate is also the rate at which probability mass piles up or thins out: where the flow squeezes, density rises; where it spreads, density falls.

The figure makes the price difference checkable. Slide the dimension $D$ : the determinant needs every cell of the $D \times D$ Jacobian, so the determinant's cost rises steeply with $D$ , while the trace needs only the diagonal and its cost stays nearly flat.

Figure 4 · determinant cost vs trace cost

DD = 16

Left, the

D \times D

Jacobian. Equation (6)'s log-determinant needs all

D^2

cells and costs on the order of

D^3

, which is what forces discrete flows into restricted architectures. Equation (7)'s trace needs only the

D

diagonal entries and scales linearly. Right, both costs against

D

on a log axis, the gap is a factor of

D^2

. That cheapness is what lets the dynamics

f

be a free-form network.

Drag the time slider. A square mesh of probability mass starts as a plain Gaussian and is carried by a smooth velocity field out toward three data clusters. Where the cells shrink, density concentrates (the trace accounts for it); where they stretch, it thins. The tracer points ride the same flow from noise to data.

Figure 5 · a flow you can read the density off

time t0.00

A mesh of probability mass (the base Gaussian) is carried by a velocity field toward three data clusters. Cells that shrink pack density in; cells that stretch thin it out. That local area-change rate is

-\operatorname{tr}(\partial f/\partial \mathbf{z})

, the trace that replaces the discrete flow's

\log\det

. The same flow run backward evaluates any point's density.

Two consequences fall out. The velocity field $f$ no longer has to be invertible by construction: as long as the ODE has a unique solution, the flow it generates is automatically reversible, so you can run it backward to score real data and forward to sample. And the architecture is freed. You can use wide, ordinary networks for $f$ instead of the contortions a tractable determinant used to force. The paper calls these continuous normalizing flows, and shows a continuous flow with $M$ hidden units can be at least as expressive as a discrete one with $M$ stacked single-unit layers, and often more.

Cheap is relative, though. The trace this paper buys is cheap next to the determinant, but computing an exact trace of a free-form Jacobian still costs on the order of $D$ passes. What collapses it to a single pass, a stochastic estimator, came a few months later in FFJORD; that speedup belongs to the follow-up.

Data that arrives whenever it likes

The continuous picture has a third home, time series whose observations do not land on a tidy grid, and of the three it is the most natural fit. Medical records, network traffic, a patient's lab tests. A recurrent network requires evenly spaced steps, so the usual fix is to chop time into bins and pretend the data is regular, which throws away exactly the timing information that often matters most.

A continuous model has no grid to work around. The paper's latent ODE works like a variational autoencoder over a trajectory. An encoder (a recurrent net reading the observations backward) infers a distribution over a single latent starting point $\mathbf{z}_{t_0}$ . One shared ODE then evolves that point forward through continuous time, and a decoder reads out a prediction at any moment you ask for, observed or not:

\mathbf{z}_{t_0} \sim p(\mathbf{z}_{t_0}), \qquad \mathbf{z}_{t_1}, \ldots, \mathbf{z}_{t_N} = \texttt{ODESolve}(\mathbf{z}_{t_0}, f, \theta, t_0, \ldots, t_N), \qquad \mathbf{x}_{t_i} \sim p(\mathbf{x} \mid \mathbf{z}_{t_i})

(8)

One starting point is enough because the ODE is deterministic: the entire trajectory is pinned by where it starts, so all the uncertainty lives in which $\mathbf{z}_{t_0}$ to start from, and inferring that is the encoder's whole job. Because the latent trajectory is defined for all $t$ , you can evaluate it at irregular observation times with no special handling, since the solver can stop wherever an observation happens to land, and you can run it past the last observation to extrapolate. The figure shows the paper's spiral benchmark: a handful of noisy, irregularly-timed points over the first part of a spiral are all the model gets. The latent ODE fits a smooth trajectory through them and keeps going, past the data, to extrapolate. A recurrent baseline tracks the early part but wobbles and diverges once the observations run out.

Figure 6 · fit the irregular points, then keep going

observationsn = 30

A few noisy, irregularly-timed observations cover only the first part of a spiral. The latent ODE reconstructs a smooth trajectory (solid) and extrapolates past the data (dashed). A gray RNN baseline drifts once the points run out. Drag the observation count: the paper's Table 2 predictive RMSE (lower is better) favors the latent ODE at every setting.

On that benchmark the latent ODE's predictive error is roughly half the RNN's with only 30 of 100 points observed (RMSE $0.164$ versus $0.394$ ), and it stays ahead as you add points. The model can also tie a Poisson process onto the same latent trajectory, so the times at which events occur become evidence about the hidden state too, which is the right way to think about a patient who takes a test because they already feel sick.

What the three uses deliver

For a forward-pass ODESolve block, the layer looks like this. "Depth" is now a solver argument rather than an architecture decision:

# an ODE block: the "layer" is a solver call, depth is continuous time
def ode_block(h0, f, theta):
    # f(h, t, theta) is a small net: it returns the velocity dh/dt at (h, t)
    return odesolve(h0, f, t0=0.0, t1=1.0, theta=theta)   # #steps is adaptive

On MNIST, the ODE-Net reaches $0.42\%$ test error with about $0.22$ M parameters, against a residual network's $0.41\%$ at $0.60$ M. The accuracy is tied (the ODE-Net is a hair worse), at roughly a third of the parameters, and at constant memory in depth. The paper itself walked back an early parameter-efficiency claim in its acknowledgements, and a plain Runge–Kutta network hits the same parameter count without being a continuous model at all, so the ODE-Net matches accuracy at far less memory rather than classifying better.

The continuous normalizing flows fit two-dimensional densities at least as well as discrete planar flows while training by maximum likelihood, and with smoother, more interpretable transformations. The latent ODE wins on the irregular time-series benchmark by a clear margin. Across all three, the through-line is the same: an ODE solver, dropped in as a differentiable model component, trained end-to-end by the adjoint method.

The contribution is sharper than "networks are ODEs," which predates this paper. The paper's own additions are the three that needed the continuous view to exist at all: a way to backpropagate through any black-box solver at memory constant in depth; the instantaneous change of variables that turns a determinant into a trace; and a generative time-series model that eats irregular data without a grid. The shared engine is one general-purpose vector–Jacobian product that lets a solver sit inside a larger model and be trained like any other layer.

Where the continuous picture breaks

A continuous flow is elegant, and the elegance is also the constraint. Because the state moves along a smooth trajectory governed by a single velocity field, two trajectories can never cross. If they touched, the velocity at that point would have to send the state two different directions at once, which a function cannot do. The map from input to output is therefore a homeomorphism, a continuous deformation with a continuous inverse: it can stretch and bend space, but it cannot tear it or fold one region through another.

That sounds abstract until you notice it forbids simple things. A Neural ODE on the line cannot learn the map that sends $-1$ to $+1$ and $+1$ to $-1$ , because the two paths would have to cross at the origin. A plain residual network can, precisely because its discrete jumps are allowed to leap over each other. The fix, from Augmented Neural ODEs, is to give the state a few extra dimensions to move in, so trajectories that would collide on the line can pass each other in the added room. It also tends to make training faster and the solver cheaper. Both halves of the argument, the forbidden crossing and the augmented escape, are drawn below; scrub the time and watch the swap need a collision in one dimension and a detour in two:

Figure 7 · why trajectories cannot cross, and the way out

time tt = 1.00 · T

Left: state

x

against time

t

. The teal trajectories belong to one smooth flow and never intersect. The dashed pink pair is the forbidden swap of

-1

and

+1

: the two paths must cross, and at the crossing both share the same

(x, t)

yet need different slopes, while one velocity field gives one slope. That is the homeomorphism limit in a single picture. Right: augment the state with one extra dimension and the same two endpoints swap by circling around each other, the Augmented Neural ODEs fix.

The smaller frictions are practical. Mini-batching is awkward, because batching distinct inputs into one solver means the solver must satisfy its tolerance on the hardest element of the batch, which can cost extra evaluations. Uniqueness of the solution (so the model is even well-defined) needs the velocity network to be Lipschitz (its output can't change faster than some fixed multiple of its input, the condition that guarantees the ODE has exactly one solution), which $\tanh$ and ReLU satisfy, so this is a mild condition rather than a real obstacle. And the backward reconstruction can drift, the caveat from the adjoint section, which is why stiff problems may want checkpointing. None of these undo the core idea; they are the texture of working with a solver instead of a fixed stack.

Read end to end, the path is short. A residual layer is one Euler step. Many Euler steps are an ODE solve, whose depth a good solver chooses for you. That solve can be differentiated backward at constant memory by carrying an adjoint. And the same continuous view turns the determinant in a normalizing flow into a trace, and irregular time series into something a single shared dynamics can model. The layers were a discretization the whole time, and this paper put the underlying equation back in charge.

Provenance Verified against primary literature

Chen, Rubanova, Bettencourt, Duvenaud (2018)The paper: black-box ODE solvers as a model component, trained by the adjoint method.

E (2017); Lu et al.; Haber & RuthottoResidual networks read as Euler steps of an ODE. The analogy this paper builds on, not one it introduced.

Pontryagin et al. (1962)The adjoint (costate) sensitivity method, borrowed from optimal control.

ANODE (2019); Augmented Neural ODEs (2019)Later work that qualifies the constant-memory and expressiveness claims.

correctionTwo things the popular telling overstates. The residual-network-as-ODE reading predates this paper (E 2017; Lu et al.; Haber & Ruthotto); the contribution here is the adjoint training and the continuous flows. And “constant memory” means constant in depth, bought by recomputing the forward trajectory backward, which can drift from the true path (ANODE). We teach both, with the correct signs.

Questions you might still have

If a residual network is already an ODE, what does this paper actually add?
The continuous reading is older (E 2017; Lu et al.; Haber & Ruthotto, all 2017). The contribution is training through a black-box solver by the adjoint method, at memory that is constant in depth, plus continuous normalizing flows.

How can memory be constant if you still have to backpropagate?
You never store the forward trajectory. You recompute it backward from the final state while carrying the adjoint a(t) = dL/dz. Only the current point is in memory, so the cost is constant in depth, not constant in everything.

Is the gradient from the adjoint method exact?
In continuous time, yes. On a computer it is only as good as the backward solve, which can drift from the true forward path. ANODE (2019) shows this can bite on stiff problems; the paper falls back to checkpointing when it does.

What can a Neural ODE not do?
Its flow cannot let two trajectories cross, so it cannot represent some maps a plain residual network can (Augmented Neural ODEs, 2019). Adding a few extra dimensions to the state fixes it.

Footnotes & further reading

The paper: Chen, Rubanova, Bettencourt, Duvenaud, Neural Ordinary Differential Equations (NeurIPS 2018, best paper award). torchdiffeq is the authors' PyTorch implementation with differentiable ODE solvers.
Residual networks as discretized dynamics, the analogy this paper builds on: Weinan E, A Proposal on Machine Learning via Dynamical Systems (2017); Lu et al., Beyond Finite Layer Neural Networks; and Haber & Ruthotto, Stable Architectures for Deep Neural Networks.
The adjoint (costate) sensitivity method comes from optimal control: Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, The Mathematical Theory of Optimal Processes (1962).
The discrete change of variables and normalizing flows: Rezende & Mohamed, Variational Inference with Normalizing Flows (2015), and Dinh et al., NICE (2014).
The single-pass trace estimator that makes continuous flows scale (a follow-up, not in this paper): Grathwohl et al., FFJORD (2018).
The two caveats we flag: Gholami et al., ANODE (2019), on the accuracy of the backward reconstruction, and Dupont et al., Augmented Neural ODEs (2019), on the no-crossing limitation and its fix.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.