VerifiedarXiv:1806.0736624 min
Theory · Architecture

Neural Ordinary Differential Equations

A residual network, taken all the way to its limit.

Stop writing down the layers. Write down the derivative of the hidden state, hand it to an ODE solver, and let the solver decide how deep the network needs to be. It can then be trained with memory that does not grow with depth.

Explaining the paperNeural Ordinary Differential EquationsChen, Rubanova, Bettencourt, Duvenaud · U. Toronto / Vector Institute · NeurIPS 2018 · arXiv:1806.07366

What if "number of layers" were not a setting you pick, but a question you hand to a solver?

A deep network has a number you set before training: its depth. Pick 50 layers, or 100, or 1000, and you have committed. Each layer is a discrete step, and the steps are baked in. This explainer is about a paper that takes that integer and dissolves it into a continuous knob, by noticing that a particular kind of layer is already doing arithmetic that has a much older name.

The catch the whole paper hangs on is small and a little uncomfortable: the residual connection in every modern network is, line for line, one step of the oldest method for solving a differential equation. Once you see that, the layers stop being the model. The model is the equation underneath, and the layers are just one crude way to solve it. So you swap in a better solver, and several things you thought were fixed turn out to be dials.

None of the pieces are hard on their own. A residual layer is one Euler step. Many Euler steps are an ODE solve. An ODE solve can be differentiated cheaply by an old trick from control theory. And the same continuous view that buys cheap gradients also turns an expensive determinant in generative models into a cheap trace. We will build those up in order, and the paper falls out.

A network is a stack of small nudges

Start with the residual block, the idea that let networks get genuinely deep. A plain layer replaces its input with something new. A residual layer does not. It computes a small correction and adds it to what it was handed:

ht+1=ht+f(ht,θt)\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t)
(1)

Reading the symbols: ht\mathbf{h}_t is the hidden state going into layer tt (a vector, or for a Transformer a whole tensor of token vectors), ht+1\mathbf{h}_{t+1} is what comes out, and ff is that layer's learned transformation with its own weights θt\theta_t. The layer does not rebuild the state from scratch. It nudges it. This is the idea behind ResNets, and it is in every Transformer block: each layer only has to learn how to adjust a running state, never to regenerate it.

Picture the hidden state as a point, and the network as a sequence of nudges that walk that point from where it starts (the input) to where it ends (a representation you can read an answer off). Layer one nudges it a little. Layer two nudges the result a little more. Stack TT layers and you have walked the point along a path in TT discrete hops.

Now look at the shape of (1) with fresh eyes. "New position equals old position plus a step." If you have ever solved a differential equation by hand, that is the most basic thing you can write down.

Shrink the step to nothing

An ordinary differential equation is a rule that says, at every point and every moment, which way to move and how fast. A velocity. Euler's method, the simplest way to actually solve one on a computer, says: stand where you are, read off the velocity, take a small step of size hh in that direction, repeat.

ht+1=ht+hf(ht,t)\mathbf{h}_{t+1} = \mathbf{h}_t + h \cdot f(\mathbf{h}_t, t)

Set the step size h=1h = 1 and that is exactly equation (1). A residual layer is an Euler step, with the step size quietly fixed to one, and the papers that first drew this picture wrote the step size in on purpose, precisely so the connection to a continuous equation would be clean.

A coarse Euler solver takes a few big steps and cuts the corners of the true curve. A good solver takes many small steps and tracks it closely. What happens to a residual network if you add more layers and shrink the step toward zero? In the limit, the discrete stack of nudges becomes a smooth, continuous flow, and the network is no longer specifying a sequence of layers. It is specifying the velocity field itself:

dh(t)dt=f(h(t),t,θ)\frac{\mathrm{d}\mathbf{h}(t)}{\mathrm{d}t} = f(\mathbf{h}(t), t, \theta)
(2)

Read this as: a single small network ff, with one shared set of weights θ\theta, reports the velocity dh/dt\mathrm{d}\mathbf{h}/\mathrm{d}t of the hidden state at any position h\mathbf{h} and any depth tt. The layer index has turned into a continuous time, and "depth" is now how long you let the equation run. To get the output, you start at the input h(0)\mathbf{h}(0) and integrate this velocity forward to some final time TT:

h(T)=h(0)+0Tf(h(t),t,θ)dt=ODESolve(h(0),f,0,T,θ)\mathbf{h}(T) = \mathbf{h}(0) + \int_0^T f(\mathbf{h}(t), t, \theta)\,\mathrm{d}t = \texttt{ODESolve}(\mathbf{h}(0),\, f,\, 0,\, T,\, \theta)
(3)

The forward pass is now a call to an off-the-shelf ODE solver. The solver evaluates the little velocity network ff wherever it needs to, as many times as it needs to, and hands back h(T)\mathbf{h}(T). That box, ODESolve, is the whole network. It is a layer whose internals you do not write down; you write down its derivative and let the solver do the rest.

Drag the slider below. The smooth amber curves are four trajectories of the true continuous-depth ODE, with depth running up the page. The teal staircases are a residual network trying to follow each one with LL discrete Euler steps, one dot per layer. Few layers and the staircase overshoots and cuts corners. Add layers, shrink the step h=T/Lh = T/L, and the staircase melts onto the curve. The limit being taken is precise: step size to zero and layer count to infinity together, with the total depth TT held fixed, and an Euler staircase converging onto the smooth solution as its steps shrink is exactly the convergence an ODE solver guarantees.

Figure 1 · depth becomes time
L = 4 · h = 1.05
Depth runs up the vertical axis; the hidden state runs across. The continuous ODE flow is fixed. A residual network approximates it with LL Euler steps, one dot per layer. As LL grows and the step h=T/Lh=T/L shrinks, the staircase converges onto the curve. A layer is one Euler step; depth is elapsed time.

For a real trained residual network, though, the weights θt\theta_t change however they like from layer to layer, so "take the limit and you get a clean ODE" is a motivating story, not a theorem; a genuine continuous limit needs the weights to vary smoothly with depth. The paper sidesteps the gap: instead of hoping a stack converges, it defines the model as the ODE from the start, with a single ff shared across all of tt, and trains that.

Let the solver pick the depth

Once the forward pass is an ODE solve, you inherit two centuries of numerical analysis for free, and the first thing it buys you is adaptive depth. Euler is the crudest solver there is. Modern ones (Runge–Kutta and its adaptive descendants) carry an internal estimate of their own error. Where the trajectory is bending sharply they take small, careful steps; where it is nearly straight they stride. You hand the solver an error tolerance, and it spends exactly as many evaluations of ff as that tolerance demands.

That count, the number of times the solver calls ff, is the closest thing a Neural ODE has to a depth. The paper calls it the number of function evaluations, L~\tilde{L} (NFE), and unlike a layer count it is not fixed. It is chosen per input, and it can change as training proceeds. The forward-pass time is proportional to it.

Drag the tolerance below. Tighten it and the solver places more evaluation points (they cluster where the curve bends) and the error shrinks. Loosen it and the same trained network runs cheaper, with visible error. You can train at high accuracy and then, at deployment, dial the tolerance down for a real-time or low-power setting, without retraining anything.

Figure 2 · the solver chooses how hard to work
1e-2
The true solution is fixed. An adaptive solver places evaluation points where it needs them: dense where the curve bends, sparse where it is flat. Tightening the tolerance spends more function evaluations (NFE) for a closer fit; loosening it trades accuracy for speed on the same network. Compute scales with NFE.

The paper reports the flip side too: as a Neural ODE trains, its NFE creeps up, as if the model grows itself deeper to match the rising complexity of what it is learning. Depth stops being a hyperparameter and becomes something the model negotiates with the solver.

Backprop without the breadcrumbs

Now the hard part. The forward pass is a black-box solver call. How do you train it? The obvious answer is to record every operation the solver performed and backpropagate through all of them. That works, and it is exactly what you want to avoid, because it means storing the entire forward trajectory: every intermediate state, every internal stage of every step. Memory grows with the number of evaluations, which is the very thing we just made adaptive and potentially large.

The alternative is a method from 1960s optimal control called the adjoint sensitivity method. The idea is to ask, at every moment along the trajectory, a single question: how much would the final loss LL change if the state at this instant were nudged? Call that quantity the adjoint, following the paper's notation with z\mathbf{z} for the state:

a(t)=Lz(t)\mathbf{a}(t) = \frac{\partial L}{\partial \mathbf{z}(t)}

In an ordinary network, the gradient at one layer comes from the gradient at the next layer by the chain rule. The adjoint is the continuous version of that. It obeys its own differential equation, the instantaneous chain rule, run backward in time:

da(t)dt=a(t) ⁣f(z(t),t,θ)z\frac{\mathrm{d}\mathbf{a}(t)}{\mathrm{d}t} = -\,\mathbf{a}(t)^{\!\top} \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \mathbf{z}}
(4)

The minus sign matters, and it is correct as printed: the adjoint flows from the final time back to the start, accumulating sensitivity the way backprop accumulates gradient from output toward input. The term f/z\partial f/\partial \mathbf{z} is just the Jacobian of the little velocity network, and the product af/z\mathbf{a}^{\top}\,\partial f/\partial \mathbf{z} is a vector–Jacobian product, the same primitive ordinary backprop already computes, at about the cost of one evaluation of ff. The gradient you actually want, with respect to the weights, is then one more integral along the way back:

dLdθ=T0a(t) ⁣f(z(t),t,θ)θdt\frac{\mathrm{d}L}{\mathrm{d}\theta} = -\int_{T}^{0} \mathbf{a}(t)^{\!\top} \frac{\partial f(\mathbf{z}(t), t, \theta)}{\partial \theta}\,\mathrm{d}t
(5)

Notice the limits run from TT down to 00: this is a single solve backward in time. And this is the trick that makes it constant-memory. The backward equations (4) and (5) both need the state z(t)\mathbf{z}(t) along the trajectory, but we threw the trajectory away. So we recompute it on the fly: we re-integrate the original ODE backward from the final state z(T)\mathbf{z}(T), right alongside the adjoint, in one combined reverse solve. The state, the adjoint, and the parameter gradient are stacked into one augmented system and integrated together from TT back to 00.

# gradients with O(1) memory in depth: never store the forward trajectory
def ode_block_backward(hT, f, theta, dL_dhT):
    # augmented state s = [h, a, grad], with the adjoint a = dL/dh
    def aug(s, t):
        h, a = s.h, s.a
        return [ f(h, t, theta),               # rebuild h by re-integrating
                 -a @ df_dh(f, h, t, theta),    # the adjoint ODE, eq (4)
                 -a @ df_dtheta(f, h, t, theta) ]  # the eq (5) integrand
    s0 = [hT, dL_dhT, zeros_like(theta)]
    h0, dL_dh0, dL_dtheta = odesolve(s0, aug, t1=1.0, t0=0.0, theta)
    return dL_dh0, dL_dtheta                  # one reverse solve, eq (5)

Because nothing from the forward pass is stored, the memory cost does not depend on how many steps the solver took. It is constant in depth. On MNIST the paper measures exactly this: the ODE-Net trains at O(1)O(1) memory, against O(L~)O(\tilde{L}) for backpropagating through the solver and O(L)O(L) for an ordinary residual network. Press play and watch where the memory goes.

Figure 3 · the memory trick
forward
Forward (first half): the solver runs the state from t0t_0 to t1t_1. The store-every-step baseline saves an activation at each step, so its meter climbs to full. The adjoint method saves only the endpoint. Backward: the adjoint flows back while the state is reconstructed on the way. The baseline still holds its whole tape (O(L~)O(\tilde{L})); the adjoint never held more than one point (O(1)O(1)), at any depth.

Put numbers on the two columns to feel the gap. Say the hidden state z\mathbf{z} is a vector of 100 numbers and the solver takes, on this input, some number of steps L~\tilde{L}. Backpropagating through the solver keeps one copy of the state for every step it took, so its activation memory is 100×L~100 \times \tilde{L} numbers and climbs with every step the solver decides to add. At 50 steps that is 5,000 stored numbers; at 500 steps, 50,000; the meter tracks the trajectory. The adjoint method stores the endpoint z(T)\mathbf{z}(T) and reconstructs the rest, so during the backward solve it is holding the current state, the current adjoint, and the running parameter gradient, a fixed handful of buffers whose size is set by the state and the parameter count, not by how many steps were taken. Double the number of solver steps and the backprop column doubles while the adjoint column does not move. That is the precise sense of the claim: the cost is flat in depth, the number of solver steps L~\tilde{L}. It is emphatically not flat in bytes, since the 100-number state, the parameters, and the augmented adjoint buffers are all still resident; what depth no longer multiplies is the part that used to grow without bound.

The reconstruction is affordable for the same reason the forward pass is. The backward equations (4) and (5) advance using f/z\partial f/\partial \mathbf{z} and f/θ\partial f/\partial \theta, the Jacobians of the little velocity network ff, and they touch them only through the vector–Jacobian products af/z\mathbf{a}^{\top}\,\partial f/\partial \mathbf{z} and af/θ\mathbf{a}^{\top}\,\partial f/\partial \theta. A vector–Jacobian product is exactly what ordinary reverse-mode autodiff computes in one backward sweep of ff, and for a small network that sweep costs about one forward evaluation of ff, never the full D×DD \times D Jacobian written out. So a single backward step of the augmented system costs about the same as a single forward step, the backward solve runs at roughly the cost of the forward one (the paper measures its NFE at roughly half the forward count), and the whole method is a clean trade: you pay a forward-pass-sized recomputation on the way back to avoid storing a trajectory whose length you no longer control.

Three caveats, and they are the parts the headline drops. First, "constant memory" means constant in depth, not constant in bytes: you still store the parameters, the current state, and the augmented adjoint state. Second, what you save is memory, not time: the backward pass is itself an ODE solve, the adjoint system run backward with its own adaptive steps and its own NFE (the paper finds roughly half the forward count), and NFE rises during training as the learned dynamics get stiffer and harder to integrate and the solver works harder to hold the same tolerance. Third, the gradient is exact only for the continuous problem: on a real computer, reconstructing z(t)\mathbf{z}(t) by running the dynamics backward can drift away from the true forward path, especially on stiff problems, and then the gradient is slightly wrong. The paper flags this itself and offers checkpointing (store a few states on the way forward, re-integrate between them) as the fix; later work, ANODE, makes the failure case precise. It is real, and it is the price of throwing the breadcrumbs away.

The same trick makes flows cheap

The continuous view has a side benefit the authors call unexpected, and it lands in generative modeling. A normalizing flow builds a complicated probability distribution by taking simple noise and pushing it through an invertible function ff. To know the density of a generated sample, you use the change of variables formula, which corrects for how much the function locally stretches or squeezes space:

z1=f(z0)logp(z1)=logp(z0)logdetfz0\mathbf{z}_1 = f(\mathbf{z}_0) \quad\Longrightarrow\quad \log p(\mathbf{z}_1) = \log p(\mathbf{z}_0) - \log\left| \det \frac{\partial f}{\partial \mathbf{z}_0} \right|
(6)

That determinant is the problem. For a transformation on DD dimensions it costs on the order of D3D^3 to compute. So ordinary flows are forced into awkward, restricted architectures (triangular Jacobians, or layers with a single hidden unit) just to keep the determinant tractable. It is the bottleneck of the whole approach.

Now make the transformation continuous, a flow driven by an ODE instead of a stack of discrete maps. The paper's Theorem 1, the instantaneous change of variables, says the log-density then obeys its own simple differential equation:

dlogp(z(t))dt=tr ⁣(fz(t))\frac{\mathrm{d}\log p(\mathbf{z}(t))}{\mathrm{d}t} = -\operatorname{tr}\!\left( \frac{\partial f}{\partial \mathbf{z}(t)} \right)
(7)

The determinant has become a trace. Instead of an O(D3)O(D^3) determinant of the Jacobian, you need only the sum of its diagonal, which is cheap and, when the dynamics is a sum over hidden units, scales linearly in their number. The geometry explains the swap: the determinant measures the volume change of one whole discrete jump, while in continuous time volume changes at a rate, the divergence of the velocity field, and the trace of the Jacobian is exactly that divergence, the sum of each coordinate's own stretching. That rate is also the rate at which probability mass piles up or thins out: where the flow squeezes, density rises; where it spreads, density falls.

The figure makes the price difference checkable. Slide the dimension DD: the determinant needs every cell of the D×DD \times D Jacobian and its cost curve explodes, the trace needs only the diagonal and its curve crawls.

Figure 4 · determinant cost vs trace cost
D = 16
Left, the D×DD \times D Jacobian. Equation (6)'s log-determinant needs all D2D^2 cells and costs on the order of D3D^3, which is what forces discrete flows into restricted architectures. Equation (7)'s trace needs only the DD diagonal entries and scales linearly. Right, both costs against DD on a log axis, the gap is a factor of D2D^2. That cheapness is what lets the dynamics ff be a free-form network.

Drag the time slider. A square mesh of probability mass starts as a plain Gaussian and is carried by a smooth velocity field out toward three data clusters. Watch the cells: where they shrink, density concentrates (the trace is doing its bookkeeping); where they stretch, density thins. The tracer points ride the same flow from noise to data.

Figure 5 · a flow you can read the density off
0.00
A mesh of probability mass (the base Gaussian) is carried by a velocity field toward three data clusters. Cells that shrink pack density in; cells that stretch thin it out. That local area-change rate is tr(f/z)-\operatorname{tr}(\partial f/\partial \mathbf{z}), the trace that replaces the discrete flow's logdet\log\det. The same flow run backward evaluates any point's density.

Two consequences fall out. The velocity field ff no longer has to be invertible by construction: as long as the ODE has a unique solution, the flow it generates is automatically reversible, so you can run it backward to score real data and forward to sample. And the architecture is freed. You can use wide, ordinary networks for ff instead of the contortions a tractable determinant used to force. The paper calls these continuous normalizing flows, and shows a continuous flow with MM hidden units can be at least as expressive as a discrete one with MM stacked single-unit layers, and often more.

Cheap is relative, though. The trace this paper buys is cheap next to the determinant, but computing an exact trace of a free-form Jacobian still costs on the order of DD passes. The trick that makes it a single pass, a stochastic estimator, came a few months later in FFJORD, and it is easy to import that speedup backward in memory and credit it to this paper. It belongs to the follow-up.

Data that arrives whenever it likes

The continuous picture has a third home, time series whose observations do not land on a tidy grid, and of the three it is the most natural fit. Medical records, network traffic, a patient's lab tests. A recurrent network wants evenly spaced steps, so the usual fix is to chop time into bins and pretend the data is regular, which throws away exactly the timing information that often matters most.

A continuous model has no grid to fight. The paper's latent ODE works like a variational autoencoder over a trajectory. An encoder (a recurrent net reading the observations backward) infers a distribution over a single latent starting point zt0\mathbf{z}_{t_0}. One shared ODE then evolves that point forward through continuous time, and a decoder reads out a prediction at any moment you ask for, observed or not:

zt0p(zt0),zt1,,ztN=ODESolve(zt0,f,θ,t0,,tN),xtip(xzti)\mathbf{z}_{t_0} \sim p(\mathbf{z}_{t_0}), \qquad \mathbf{z}_{t_1}, \ldots, \mathbf{z}_{t_N} = \texttt{ODESolve}(\mathbf{z}_{t_0}, f, \theta, t_0, \ldots, t_N), \qquad \mathbf{x}_{t_i} \sim p(\mathbf{x} \mid \mathbf{z}_{t_i})
(8)

One starting point is enough because the ODE is deterministic: the entire trajectory is pinned by where it starts, so all the uncertainty lives in which zt0\mathbf{z}_{t_0} to start from, and inferring that is the encoder's whole job. Because the latent trajectory is defined for all tt, you can evaluate it at irregular observation times with no special handling, since the solver can stop wherever an observation happens to land, and you can run it past the last observation to extrapolate. The figure shows the paper's spiral benchmark: a handful of noisy, irregularly-timed points over the first part of a spiral are all the model gets. The latent ODE fits a smooth trajectory through them and keeps going, past the data, to extrapolate. A recurrent baseline tracks the early part but wobbles and loses the thread once the observations run out.

Figure 6 · fit the irregular points, then keep going
n = 30
A few noisy, irregularly-timed observations cover only the first part of a spiral. The latent ODE reconstructs a smooth trajectory (solid) and extrapolates past the data (dashed). A gray RNN baseline drifts once the points run out. Drag the observation count: the paper's Table 2 predictive RMSE (lower is better) favors the latent ODE at every setting.

On that benchmark the latent ODE's predictive error is roughly half the RNN's with only 30 of 100 points observed (RMSE 0.1640.164 versus 0.3940.394), and it stays ahead as you add points. The model can also tie a Poisson process onto the same latent trajectory, so the times at which events occur become evidence about the hidden state too, which is the right way to think about a patient who takes a test because they already feel sick.

So what does it actually do

For a forward-pass ODESolve block, the layer looks like this, and the point of it is that "depth" is now a solver argument, not an architecture decision:

# an ODE block: the "layer" is a solver call, depth is continuous time
def ode_block(h0, f, theta):
    # f(h, t, theta) is a small net: it returns the velocity dh/dt at (h, t)
    return odesolve(h0, f, t0=0.0, t1=1.0, theta=theta)   # #steps is adaptive

On MNIST, the ODE-Net reaches 0.42%0.42\% test error with about 0.220.22M parameters, against a residual network's 0.41%0.41\% at 0.600.60M. The accuracy is tied (the ODE-Net is a hair worse), at roughly a third of the parameters, and at constant memory in depth. The paper itself walked back an early parameter-efficiency claim in its acknowledgements, and a plain Runge–Kutta network hits the same parameter count without being a continuous model at all, so the result is matched accuracy at far less memory, not a stronger classifier.

The continuous normalizing flows fit two-dimensional densities at least as well as discrete planar flows while training by maximum likelihood, and with smoother, more interpretable transformations. The latent ODE wins on the irregular time-series benchmark by a clear margin. Across all three, the through-line is the same: an ODE solver, dropped in as a differentiable model component, trained end-to-end by the adjoint method.

The contribution is sharper than "networks are ODEs," which predates this paper. The paper's own additions are the three that needed the continuous view to exist at all: a way to backpropagate through any black-box solver at memory constant in depth; the instantaneous change of variables that turns a determinant into a trace; and a generative time-series model that eats irregular data without a grid. The shared engine is one general-purpose vector–Jacobian product that lets a solver sit inside a larger model and be trained like any other layer.

Where the continuous picture breaks

A continuous flow is elegant, and the elegance is also the constraint. Because the state moves along a smooth trajectory governed by a single velocity field, two trajectories can never cross. If they touched, the velocity at that point would have to send the state two different directions at once, which a function cannot do. The map from input to output is therefore a homeomorphism, a continuous deformation with a continuous inverse: it can stretch and bend space, but it cannot tear it or fold one region through another.

That sounds abstract until you notice it forbids simple things. A Neural ODE on the line cannot learn the map that sends 1-1 to +1+1 and +1+1 to 1-1, because the two paths would have to cross at the origin. A plain residual network can, precisely because its discrete jumps are allowed to leap over each other. The fix, from Augmented Neural ODEs, is to give the state a few extra dimensions to move in, so trajectories that would collide on the line can pass each other in the added room. It also tends to make training faster and the solver cheaper. Both halves of the argument, the forbidden crossing and the augmented escape, are drawn below; scrub the time and watch the swap need a collision in one dimension and a detour in two:

Figure 7 · why trajectories cannot cross, and the way out
t = 1.00 · T
Left: state xx against time tt. The teal trajectories belong to one smooth flow and never intersect. The dashed pink pair is the forbidden swap of 1-1 and +1+1: the two paths must cross, and at the crossing both share the same (x,t)(x, t) yet need different slopes, while one velocity field gives one slope. That is the homeomorphism limit in a single picture. Right: augment the state with one extra dimension and the same two endpoints swap by circling around each other, the Augmented Neural ODEs fix.

The smaller frictions are practical. Mini-batching is awkward, because batching distinct inputs into one solver means the solver must satisfy its tolerance on the hardest element of the batch, which can cost extra evaluations. Uniqueness of the solution (so the model is even well-defined) needs the velocity network to be Lipschitz, whichtanh\tanh and ReLU satisfy, so this is a mild condition rather than a real obstacle. And the backward reconstruction can drift, the caveat from the adjoint section, which is why stiff problems may want checkpointing. None of these undo the core idea; they are the texture of working with a solver instead of a fixed stack.

The argument, end to end, is four steps long. A residual layer is one Euler step. Many Euler steps are an ODE solve, whose depth a good solver chooses for you. An ODE solve can be differentiated backward at constant memory by carrying an adjoint. And the same continuous view turns the determinant in a normalizing flow into a trace, and irregular time series into something a single shared dynamics can model. The layers were a discretization the whole time. This paper put the underlying equation back in charge.

Provenance Verified against primary literature
Chen, Rubanova, Bettencourt, Duvenaud (2018)The paper: black-box ODE solvers as a model component, trained by the adjoint method.
E (2017); Lu et al.; Haber & RuthottoResidual networks read as Euler steps of an ODE. The analogy this paper builds on, not one it introduced.
Pontryagin et al. (1962)The adjoint (costate) sensitivity method, borrowed from optimal control.
ANODE (2019); Augmented Neural ODEs (2019)Later work that qualifies the constant-memory and expressiveness claims.
correctionTwo things the popular telling overstates. The residual-network-as-ODE reading predates this paper (E 2017; Lu et al.; Haber & Ruthotto); the contribution here is the adjoint training and the continuous flows. And “constant memory” means constant in depth, bought by recomputing the forward trajectory backward, which can drift from the true path (ANODE). We teach both, with the correct signs.

Questions you might still have

?

If a residual network is already an ODE, what does this paper actually add?
The continuous reading is older (E 2017; Lu et al.; Haber & Ruthotto, all 2017). The contribution is training through a black-box solver by the adjoint method, at memory that is constant in depth, plus continuous normalizing flows.

?

How can memory be constant if you still have to backpropagate?
You never store the forward trajectory. You recompute it backward from the final state while carrying the adjoint a(t) = dL/dz. Only the current point is in memory, so the cost is constant in depth, not constant in everything.

?

Is the gradient from the adjoint method exact?
In continuous time, yes. On a computer it is only as good as the backward solve, which can drift from the true forward path. ANODE (2019) shows this can bite on stiff problems; the paper falls back to checkpointing when it does.

?

What can a Neural ODE not do?
Its flow cannot let two trajectories cross, so it cannot represent some maps a plain residual network can (Augmented Neural ODEs, 2019). Adding a few extra dimensions to the state fixes it.

Footnotes & further reading

  1. The paper: Chen, Rubanova, Bettencourt, Duvenaud, Neural Ordinary Differential Equations (NeurIPS 2018, best paper award). torchdiffeq is the authors' PyTorch implementation with differentiable ODE solvers.
  2. Residual networks as discretized dynamics, the analogy this paper builds on: Weinan E, A Proposal on Machine Learning via Dynamical Systems (2017); Lu et al., Beyond Finite Layer Neural Networks; and Haber & Ruthotto, Stable Architectures for Deep Neural Networks.
  3. The adjoint (costate) sensitivity method comes from optimal control: Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, The Mathematical Theory of Optimal Processes (1962).
  4. The discrete change of variables and normalizing flows: Rezende & Mohamed, Variational Inference with Normalizing Flows (2015), and Dinh et al., NICE (2014).
  5. The single-pass trace estimator that makes continuous flows scale (a follow-up, not in this paper): Grathwohl et al., FFJORD (2018).
  6. The two caveats we flag: Gholami et al., ANODE (2019), on the accuracy of the backward reconstruction, and Dupont et al., Augmented Neural ODEs (2019), on the no-crossing limitation and its fix.