Neural Ordinary Differential Equations
A residual network, taken all the way to its limit.
Stop writing down the layers. Write down the derivative of the hidden state, hand it to an ODE solver, and let the solver decide how deep the network needs to be. It can then be trained with memory that does not grow with depth.
Explaining the paperNeural Ordinary Differential EquationsWhat if "number of layers" were not a setting you pick, but a question you hand to a solver?
A deep network has a number you set before training: its depth. Pick 50 layers, or 100, or 1000, and you have committed. Each layer is a discrete step, and the steps are baked in. This explainer is about a paper that takes that integer and dissolves it into a continuous knob, by noticing that a particular kind of layer is already doing arithmetic that has a much older name.
The catch the whole paper hangs on is small and a little uncomfortable: the residual connection in every modern network is, line for line, one step of the oldest method for solving a differential equation. Once you see that, the layers stop being the model. The model is the equation underneath, and the layers are just one crude way to solve it. So you swap in a better solver, and several things you thought were fixed turn out to be dials.
None of the pieces are hard on their own. A residual layer is one Euler step. Many Euler steps are an ODE solve. An ODE solve can be differentiated cheaply by an old trick from control theory. And the same continuous view that buys cheap gradients also turns an expensive determinant in generative models into a cheap trace. We will build those up in order, and the paper falls out.
A network is a stack of small nudges
Start with the residual block, the idea that let networks get genuinely deep. A plain layer replaces its input with something new. A residual layer does not. It computes a small correction and adds it to what it was handed:
Reading the symbols: is the hidden state going into layer (a vector, or for a Transformer a whole tensor of token vectors), is what comes out, and is that layer's learned transformation with its own weights . The layer does not rebuild the state from scratch. It nudges it. This is the idea behind ResNets, and it is in every Transformer block: each layer only has to learn how to adjust a running state, never to regenerate it.
Picture the hidden state as a point, and the network as a sequence of nudges that walk that point from where it starts (the input) to where it ends (a representation you can read an answer off). Layer one nudges it a little. Layer two nudges the result a little more. Stack layers and you have walked the point along a path in discrete hops.
Now look at the shape of (1) with fresh eyes. "New position equals old position plus a step." If you have ever solved a differential equation by hand, that is the most basic thing you can write down.
Shrink the step to nothing
An ordinary differential equation is a rule that says, at every point and every moment, which way to move and how fast. A velocity. Euler's method, the simplest way to actually solve one on a computer, says: stand where you are, read off the velocity, take a small step of size in that direction, repeat.
Set the step size and that is exactly equation (1). A residual layer is an Euler step, with the step size quietly fixed to one, and the papers that first drew this picture wrote the step size in on purpose, precisely so the connection to a continuous equation would be clean.
A coarse Euler solver takes a few big steps and cuts the corners of the true curve. A good solver takes many small steps and tracks it closely. What happens to a residual network if you add more layers and shrink the step toward zero? In the limit, the discrete stack of nudges becomes a smooth, continuous flow, and the network is no longer specifying a sequence of layers. It is specifying the velocity field itself:
Read this as: a single small network , with one shared set of weights , reports the velocity of the hidden state at any position and any depth . The layer index has turned into a continuous time, and "depth" is now how long you let the equation run. To get the output, you start at the input and integrate this velocity forward to some final time :
The forward pass is now a call to an off-the-shelf ODE solver. The solver evaluates the little velocity network wherever it needs to, as many times as it needs to, and hands back . That box, ODESolve, is the whole network. It is a layer whose internals you do not write down; you write down its derivative and let the solver do the rest.
Drag the slider below. The smooth amber curves are four trajectories of the true continuous-depth ODE, with depth running up the page. The teal staircases are a residual network trying to follow each one with discrete Euler steps, one dot per layer. Few layers and the staircase overshoots and cuts corners. Add layers, shrink the step , and the staircase melts onto the curve. The limit being taken is precise: step size to zero and layer count to infinity together, with the total depth held fixed, and an Euler staircase converging onto the smooth solution as its steps shrink is exactly the convergence an ODE solver guarantees.
For a real trained residual network, though, the weights change however they like from layer to layer, so "take the limit and you get a clean ODE" is a motivating story, not a theorem; a genuine continuous limit needs the weights to vary smoothly with depth. The paper sidesteps the gap: instead of hoping a stack converges, it defines the model as the ODE from the start, with a single shared across all of , and trains that.
Let the solver pick the depth
Once the forward pass is an ODE solve, you inherit two centuries of numerical analysis for free, and the first thing it buys you is adaptive depth. Euler is the crudest solver there is. Modern ones (Runge–Kutta and its adaptive descendants) carry an internal estimate of their own error. Where the trajectory is bending sharply they take small, careful steps; where it is nearly straight they stride. You hand the solver an error tolerance, and it spends exactly as many evaluations of as that tolerance demands.
That count, the number of times the solver calls , is the closest thing a Neural ODE has to a depth. The paper calls it the number of function evaluations, (NFE), and unlike a layer count it is not fixed. It is chosen per input, and it can change as training proceeds. The forward-pass time is proportional to it.
Drag the tolerance below. Tighten it and the solver places more evaluation points (they cluster where the curve bends) and the error shrinks. Loosen it and the same trained network runs cheaper, with visible error. You can train at high accuracy and then, at deployment, dial the tolerance down for a real-time or low-power setting, without retraining anything.
The paper reports the flip side too: as a Neural ODE trains, its NFE creeps up, as if the model grows itself deeper to match the rising complexity of what it is learning. Depth stops being a hyperparameter and becomes something the model negotiates with the solver.
Backprop without the breadcrumbs
Now the hard part. The forward pass is a black-box solver call. How do you train it? The obvious answer is to record every operation the solver performed and backpropagate through all of them. That works, and it is exactly what you want to avoid, because it means storing the entire forward trajectory: every intermediate state, every internal stage of every step. Memory grows with the number of evaluations, which is the very thing we just made adaptive and potentially large.
The alternative is a method from 1960s optimal control called the adjoint sensitivity method. The idea is to ask, at every moment along the trajectory, a single question: how much would the final loss change if the state at this instant were nudged? Call that quantity the adjoint, following the paper's notation with for the state:
In an ordinary network, the gradient at one layer comes from the gradient at the next layer by the chain rule. The adjoint is the continuous version of that. It obeys its own differential equation, the instantaneous chain rule, run backward in time:
The minus sign matters, and it is correct as printed: the adjoint flows from the final time back to the start, accumulating sensitivity the way backprop accumulates gradient from output toward input. The term is just the Jacobian of the little velocity network, and the product is a vector–Jacobian product, the same primitive ordinary backprop already computes, at about the cost of one evaluation of . The gradient you actually want, with respect to the weights, is then one more integral along the way back:
Notice the limits run from down to : this is a single solve backward in time. And this is the trick that makes it constant-memory. The backward equations (4) and (5) both need the state along the trajectory, but we threw the trajectory away. So we recompute it on the fly: we re-integrate the original ODE backward from the final state , right alongside the adjoint, in one combined reverse solve. The state, the adjoint, and the parameter gradient are stacked into one augmented system and integrated together from back to .
# gradients with O(1) memory in depth: never store the forward trajectory
def ode_block_backward(hT, f, theta, dL_dhT):
# augmented state s = [h, a, grad], with the adjoint a = dL/dh
def aug(s, t):
h, a = s.h, s.a
return [ f(h, t, theta), # rebuild h by re-integrating
-a @ df_dh(f, h, t, theta), # the adjoint ODE, eq (4)
-a @ df_dtheta(f, h, t, theta) ] # the eq (5) integrand
s0 = [hT, dL_dhT, zeros_like(theta)]
h0, dL_dh0, dL_dtheta = odesolve(s0, aug, t1=1.0, t0=0.0, theta)
return dL_dh0, dL_dtheta # one reverse solve, eq (5)Because nothing from the forward pass is stored, the memory cost does not depend on how many steps the solver took. It is constant in depth. On MNIST the paper measures exactly this: the ODE-Net trains at memory, against for backpropagating through the solver and for an ordinary residual network. Press play and watch where the memory goes.
Put numbers on the two columns to feel the gap. Say the hidden state is a vector of 100 numbers and the solver takes, on this input, some number of steps . Backpropagating through the solver keeps one copy of the state for every step it took, so its activation memory is numbers and climbs with every step the solver decides to add. At 50 steps that is 5,000 stored numbers; at 500 steps, 50,000; the meter tracks the trajectory. The adjoint method stores the endpoint and reconstructs the rest, so during the backward solve it is holding the current state, the current adjoint, and the running parameter gradient, a fixed handful of buffers whose size is set by the state and the parameter count, not by how many steps were taken. Double the number of solver steps and the backprop column doubles while the adjoint column does not move. That is the precise sense of the claim: the cost is flat in depth, the number of solver steps . It is emphatically not flat in bytes, since the 100-number state, the parameters, and the augmented adjoint buffers are all still resident; what depth no longer multiplies is the part that used to grow without bound.
The reconstruction is affordable for the same reason the forward pass is. The backward equations (4) and (5) advance using and , the Jacobians of the little velocity network , and they touch them only through the vector–Jacobian products and . A vector–Jacobian product is exactly what ordinary reverse-mode autodiff computes in one backward sweep of , and for a small network that sweep costs about one forward evaluation of , never the full Jacobian written out. So a single backward step of the augmented system costs about the same as a single forward step, the backward solve runs at roughly the cost of the forward one (the paper measures its NFE at roughly half the forward count), and the whole method is a clean trade: you pay a forward-pass-sized recomputation on the way back to avoid storing a trajectory whose length you no longer control.
Three caveats, and they are the parts the headline drops. First, "constant memory" means constant in depth, not constant in bytes: you still store the parameters, the current state, and the augmented adjoint state. Second, what you save is memory, not time: the backward pass is itself an ODE solve, the adjoint system run backward with its own adaptive steps and its own NFE (the paper finds roughly half the forward count), and NFE rises during training as the learned dynamics get stiffer and harder to integrate and the solver works harder to hold the same tolerance. Third, the gradient is exact only for the continuous problem: on a real computer, reconstructing by running the dynamics backward can drift away from the true forward path, especially on stiff problems, and then the gradient is slightly wrong. The paper flags this itself and offers checkpointing (store a few states on the way forward, re-integrate between them) as the fix; later work, ANODE, makes the failure case precise. It is real, and it is the price of throwing the breadcrumbs away.
The same trick makes flows cheap
The continuous view has a side benefit the authors call unexpected, and it lands in generative modeling. A normalizing flow builds a complicated probability distribution by taking simple noise and pushing it through an invertible function . To know the density of a generated sample, you use the change of variables formula, which corrects for how much the function locally stretches or squeezes space:
That determinant is the problem. For a transformation on dimensions it costs on the order of to compute. So ordinary flows are forced into awkward, restricted architectures (triangular Jacobians, or layers with a single hidden unit) just to keep the determinant tractable. It is the bottleneck of the whole approach.
Now make the transformation continuous, a flow driven by an ODE instead of a stack of discrete maps. The paper's Theorem 1, the instantaneous change of variables, says the log-density then obeys its own simple differential equation:
The determinant has become a trace. Instead of an determinant of the Jacobian, you need only the sum of its diagonal, which is cheap and, when the dynamics is a sum over hidden units, scales linearly in their number. The geometry explains the swap: the determinant measures the volume change of one whole discrete jump, while in continuous time volume changes at a rate, the divergence of the velocity field, and the trace of the Jacobian is exactly that divergence, the sum of each coordinate's own stretching. That rate is also the rate at which probability mass piles up or thins out: where the flow squeezes, density rises; where it spreads, density falls.
The figure makes the price difference checkable. Slide the dimension : the determinant needs every cell of the Jacobian and its cost curve explodes, the trace needs only the diagonal and its curve crawls.
Drag the time slider. A square mesh of probability mass starts as a plain Gaussian and is carried by a smooth velocity field out toward three data clusters. Watch the cells: where they shrink, density concentrates (the trace is doing its bookkeeping); where they stretch, density thins. The tracer points ride the same flow from noise to data.
Two consequences fall out. The velocity field no longer has to be invertible by construction: as long as the ODE has a unique solution, the flow it generates is automatically reversible, so you can run it backward to score real data and forward to sample. And the architecture is freed. You can use wide, ordinary networks for instead of the contortions a tractable determinant used to force. The paper calls these continuous normalizing flows, and shows a continuous flow with hidden units can be at least as expressive as a discrete one with stacked single-unit layers, and often more.
Cheap is relative, though. The trace this paper buys is cheap next to the determinant, but computing an exact trace of a free-form Jacobian still costs on the order of passes. The trick that makes it a single pass, a stochastic estimator, came a few months later in FFJORD, and it is easy to import that speedup backward in memory and credit it to this paper. It belongs to the follow-up.
Data that arrives whenever it likes
The continuous picture has a third home, time series whose observations do not land on a tidy grid, and of the three it is the most natural fit. Medical records, network traffic, a patient's lab tests. A recurrent network wants evenly spaced steps, so the usual fix is to chop time into bins and pretend the data is regular, which throws away exactly the timing information that often matters most.
A continuous model has no grid to fight. The paper's latent ODE works like a variational autoencoder over a trajectory. An encoder (a recurrent net reading the observations backward) infers a distribution over a single latent starting point . One shared ODE then evolves that point forward through continuous time, and a decoder reads out a prediction at any moment you ask for, observed or not:
One starting point is enough because the ODE is deterministic: the entire trajectory is pinned by where it starts, so all the uncertainty lives in which to start from, and inferring that is the encoder's whole job. Because the latent trajectory is defined for all , you can evaluate it at irregular observation times with no special handling, since the solver can stop wherever an observation happens to land, and you can run it past the last observation to extrapolate. The figure shows the paper's spiral benchmark: a handful of noisy, irregularly-timed points over the first part of a spiral are all the model gets. The latent ODE fits a smooth trajectory through them and keeps going, past the data, to extrapolate. A recurrent baseline tracks the early part but wobbles and loses the thread once the observations run out.
On that benchmark the latent ODE's predictive error is roughly half the RNN's with only 30 of 100 points observed (RMSE versus ), and it stays ahead as you add points. The model can also tie a Poisson process onto the same latent trajectory, so the times at which events occur become evidence about the hidden state too, which is the right way to think about a patient who takes a test because they already feel sick.
So what does it actually do
For a forward-pass ODESolve block, the layer looks like this, and the point of it is that "depth" is now a solver argument, not an architecture decision:
# an ODE block: the "layer" is a solver call, depth is continuous time
def ode_block(h0, f, theta):
# f(h, t, theta) is a small net: it returns the velocity dh/dt at (h, t)
return odesolve(h0, f, t0=0.0, t1=1.0, theta=theta) # #steps is adaptiveOn MNIST, the ODE-Net reaches test error with about M parameters, against a residual network's at M. The accuracy is tied (the ODE-Net is a hair worse), at roughly a third of the parameters, and at constant memory in depth. The paper itself walked back an early parameter-efficiency claim in its acknowledgements, and a plain Runge–Kutta network hits the same parameter count without being a continuous model at all, so the result is matched accuracy at far less memory, not a stronger classifier.
The continuous normalizing flows fit two-dimensional densities at least as well as discrete planar flows while training by maximum likelihood, and with smoother, more interpretable transformations. The latent ODE wins on the irregular time-series benchmark by a clear margin. Across all three, the through-line is the same: an ODE solver, dropped in as a differentiable model component, trained end-to-end by the adjoint method.
The contribution is sharper than "networks are ODEs," which predates this paper. The paper's own additions are the three that needed the continuous view to exist at all: a way to backpropagate through any black-box solver at memory constant in depth; the instantaneous change of variables that turns a determinant into a trace; and a generative time-series model that eats irregular data without a grid. The shared engine is one general-purpose vector–Jacobian product that lets a solver sit inside a larger model and be trained like any other layer.
Where the continuous picture breaks
A continuous flow is elegant, and the elegance is also the constraint. Because the state moves along a smooth trajectory governed by a single velocity field, two trajectories can never cross. If they touched, the velocity at that point would have to send the state two different directions at once, which a function cannot do. The map from input to output is therefore a homeomorphism, a continuous deformation with a continuous inverse: it can stretch and bend space, but it cannot tear it or fold one region through another.
That sounds abstract until you notice it forbids simple things. A Neural ODE on the line cannot learn the map that sends to and to , because the two paths would have to cross at the origin. A plain residual network can, precisely because its discrete jumps are allowed to leap over each other. The fix, from Augmented Neural ODEs, is to give the state a few extra dimensions to move in, so trajectories that would collide on the line can pass each other in the added room. It also tends to make training faster and the solver cheaper. Both halves of the argument, the forbidden crossing and the augmented escape, are drawn below; scrub the time and watch the swap need a collision in one dimension and a detour in two:
The smaller frictions are practical. Mini-batching is awkward, because batching distinct inputs into one solver means the solver must satisfy its tolerance on the hardest element of the batch, which can cost extra evaluations. Uniqueness of the solution (so the model is even well-defined) needs the velocity network to be Lipschitz, which and ReLU satisfy, so this is a mild condition rather than a real obstacle. And the backward reconstruction can drift, the caveat from the adjoint section, which is why stiff problems may want checkpointing. None of these undo the core idea; they are the texture of working with a solver instead of a fixed stack.
The argument, end to end, is four steps long. A residual layer is one Euler step. Many Euler steps are an ODE solve, whose depth a good solver chooses for you. An ODE solve can be differentiated backward at constant memory by carrying an adjoint. And the same continuous view turns the determinant in a normalizing flow into a trace, and irregular time series into something a single shared dynamics can model. The layers were a discretization the whole time. This paper put the underlying equation back in charge.
Questions you might still have
If a residual network is already an ODE, what does this paper actually add?
The continuous reading is older (E 2017; Lu et al.; Haber & Ruthotto, all 2017). The contribution is training through a black-box solver by the adjoint method, at memory that is constant in depth, plus continuous normalizing flows.
How can memory be constant if you still have to backpropagate?
You never store the forward trajectory. You recompute it backward from the final state while carrying the adjoint a(t) = dL/dz. Only the current point is in memory, so the cost is constant in depth, not constant in everything.
Is the gradient from the adjoint method exact?
In continuous time, yes. On a computer it is only as good as the backward solve, which can drift from the true forward path. ANODE (2019) shows this can bite on stiff problems; the paper falls back to checkpointing when it does.
What can a Neural ODE not do?
Its flow cannot let two trajectories cross, so it cannot represent some maps a plain residual network can (Augmented Neural ODEs, 2019). Adding a few extra dimensions to the state fixes it.
Footnotes & further reading
- The paper: Chen, Rubanova, Bettencourt, Duvenaud, Neural Ordinary Differential Equations (NeurIPS 2018, best paper award). torchdiffeq is the authors' PyTorch implementation with differentiable ODE solvers.
- Residual networks as discretized dynamics, the analogy this paper builds on: Weinan E, A Proposal on Machine Learning via Dynamical Systems (2017); Lu et al., Beyond Finite Layer Neural Networks; and Haber & Ruthotto, Stable Architectures for Deep Neural Networks.
- The adjoint (costate) sensitivity method comes from optimal control: Pontryagin, Boltyanskii, Gamkrelidze, Mishchenko, The Mathematical Theory of Optimal Processes (1962).
- The discrete change of variables and normalizing flows: Rezende & Mohamed, Variational Inference with Normalizing Flows (2015), and Dinh et al., NICE (2014).
- The single-pass trace estimator that makes continuous flows scale (a follow-up, not in this paper): Grathwohl et al., FFJORD (2018).
- The two caveats we flag: Gholami et al., ANODE (2019), on the accuracy of the backward reconstruction, and Dupont et al., Augmented Neural ODEs (2019), on the no-crossing limitation and its fix.
How could this explainer be improved? Found an error, or something unclear? I read every message.