Adam: A Method for Stochastic Optimization
Two running averages give every weight its own learning rate.
Adam is the optimizer that trains almost everything. It is six lines of arithmetic, and every line is there to fix a specific way that plain gradient descent fails. Build those fixes one at a time and the whole method, including the part the authors got wrong, falls out.
Explaining the paperAdam: A Method for Stochastic OptimizationHow do you set a learning rate for a model with a hundred billion parameters, when every one of them wants a different one?
Almost every neural network you have heard of was trained by Adam or a close relative. It is the first optimizer most people reach for in PyTorch or JAX, the thing you use before you have a reason to use anything else. The paper that introduced it is nine pages long and the core algorithm is a single short box. That combination, total ubiquity and apparent simplicity, is a good reason to actually understand it, because the simplicity is hiding a few careful decisions, and one careful decision that was quietly wrong.
The plan here is to earn the algorithm rather than memorize it. We start from plain stochastic gradient descent and watch it fail in two ordinary, fixable ways. Each fix is one of Adam's running averages. A third piece patches a cold-start problem the averages create. Then we read what the assembled update actually means, which turns out to explain why Adam barely needs tuning. Last, the convergence proof in the paper does not hold: there is a three-line example where Adam walks to the wrong answer, and the fix has a name. None of the pieces is hard. Stacked up, they are the whole method.
A noisy slope, and only the slope
Training a model means minimizing a loss: a single number that measures how wrong the model is, as a function of its parameters (all the weights, stacked into one long vector). The loss defines a landscape over parameter space, and we want its lowest valley. The only local information we have is the gradient , the vector of slopes that points in the direction the loss increases fastest. So we step the other way:
That is gradient descent. The minus sign is the whole idea, since the gradient points uphill and we want down. The scalar is the learning rate, the size of the step. (Walking downhill in fog is the usual picture, and it is a good one: you feel the slope under your feet and step downhill, blind to anything past your boots.)
Two things make this harder than the picture suggests. First, we never compute the true gradient. The true loss is an average over the whole dataset, which is far too expensive to touch every step, so we estimate the gradient on a small random minibatch of examples. That estimate is unbiased, meaning on average it points the right way, but any single batch is noisy, and the noise only shrinks like one over the square root of the batch size. Doubling the batch buys you less than you would hope, so the cheaper move is to average gradients across steps instead. This is stochastic gradient descent, and the stochastic part is the fog flickering. Second, we use only the slope, never the curvature. The curvature lives in a matrix (the Hessian) with one row and column per parameter; for a model with billions of parameters you could not store it, let alone invert it. So we navigate by first-order information alone and try to be clever about the one knob we have, the step.
A single fixed step size runs into two failures that show up constantly. When the landscape is a long narrow valley, steep across and gentle along, a step large enough to make progress down the gentle axis is far too large for the steep one, and you bounce between the walls instead of descending. When the gradient is noisy, a fixed step jitters around the true direction and wastes most of its motion. Adam is two fixes, one for each failure: average the gradient over time to calm the noise and the bouncing, and give every parameter its own step size so the steep and gentle axes can be served at once. We take them in that order.
First fix: average the direction
If consecutive gradient estimates disagree because of noise, the obvious move is to average them. Adam keeps a running average of the gradient, the first moment, and steps along the average rather than the raw sample:
This is an exponential moving average (EMA). Each step it keeps a fraction of the old average and mixes in a fraction of the new gradient, so the weight on a gradient from steps ago decays like . With the default the average has a memory of roughly steps. Opposite-sign wobbles from the noise cancel in the average; a persistent direction survives. In the narrow-valley picture, the side-to-side bouncing averages away while the slow drift down the valley accumulates. This is the same idea as classical momentum, a heavy ball that keeps rolling in the direction it has been going, so it is often called the momentum term.
One detail is worth flagging, because it trips people who know the classical version. Textbook heavy-ball momentum keeps a velocity that accumulates: with a steady gradient builds a velocity ten times its size. Adam's instead uses the matched weights and , which sum to one, so a steady gradient drives to the gradient itself, not ten times it. It is the same averaging idea kept on the raw-gradient scale, which is what lets the step size downstream stay interpretable. (And plain Adam evaluates the gradient at the current point, not at a look-ahead point. The look-ahead variant is Nesterov momentum, and the Adam version of that is a separate method called Nadam.)
Drag below. The amber dots are the raw, noisy gradient, which here flips sign partway through to stand in for a real change of direction during training. The teal curve is the average. Near zero it chases every sample; near one it is smooth but slow to notice the flip. That lag is the cost of a long memory, and it is why is not set higher than it is.
Averaging the direction is half of Adam. It calms the noise and damps the valley bouncing, but it does nothing about the deeper problem, which is that one global step size cannot be right for every parameter at once.
Second fix: a learning rate per parameter
Different parameters live on different slopes. A weight that always sees large gradients needs small steps or it will overshoot; a weight that sees small or rare gradients needs large steps or it will never move. A single forces one compromise on all of them. The fix is to give each parameter its own effective step, set from how large that parameter's gradients have actually been.
The first method to do this well was AdaGrad. For each parameter it accumulates the sum of its squared gradients and divides the step by the square root of that sum:
A parameter with a long history of big gradients gets a big divisor and so a small step; a rarely-active one keeps a large step. That is exactly the behavior you want for sparse features, and it is why AdaGrad caught on. (The smoothing constant sits outside the square root and only guards against dividing by zero; its default is zero. Many code libraries instead tuck a small constant inside the root, which is a convenient variant, not the original.)
But AdaGrad has a built-in problem for long training runs. The divisor is a cumulative sum, so it only ever grows, which means the effective step only ever shrinks, heading to zero whether or not the gradients are still carrying useful signal. On a convex problem with a decaying schedule that is a feature. On a long non-convex deep-learning run it is a way to stall.
RMSProp, an unpublished method from a Coursera lecture, fixed this with one change: replace the ever-growing sum with an exponential moving average of the squared gradient. This is Adam's second moment:
Same shape as the first moment in (2), but on the squared gradient. Because it is an average rather than a sum, it forgets: it tracks the recent size of a parameter's gradients instead of their whole history, so the effective step adapts up and down instead of decaying to nothing. The decay rate is the length of that memory.
The figure makes the difference concrete. The gradient magnitude drops to a third of its value partway through, the kind of thing that happens when training settles. AdaGrad's step (amber) was already small from the early large gradients and stays pinned near zero forever; it cannot recover. Adam's step (teal) climbs back once the gradients calm down. Push toward one and Adam's memory grows long, its recovery slows, and it starts to behave like AdaGrad again. That is not a coincidence; the two are the same method in different limits, which we will pin down later.
Adam: momentum over RMS
Now assemble the two fixes. Use the first moment as the direction (the smoothed gradient) and the square root of the second moment as the per-parameter divisor (the recent gradient size). One number on top for where to go, one number underneath for how far. With a small constant to keep the division safe, the update is:
The hats on and are a correction we unpack in the next section; for now read them as and . That is the entire method. The name Adam comes from adaptive moment estimation: the two moving averages are estimates of the first and second moments of the gradient, and they adapt the step. Here it is as code, the whole optimizer in six lines:
# Adam: one step on the parameter vector theta (all ops elementwise)
g = grad(loss, theta) # gradient at the current theta
m = b1 * m + (1 - b1) * g # 1st moment: a running mean of g
v = b2 * v + (1 - b2) * g * g # 2nd moment: a running mean of g^2
m_hat = m / (1 - b1**t) # undo the cold-start bias
v_hat = v / (1 - b2**t)
theta = theta - a * m_hat / (sqrt(v_hat) + eps) # the update
# start: m = v = 0, t counts from 1
# defaults: a = 1e-3, b1 = 0.9, b2 = 0.999, eps = 1e-8It helps to push one number through by hand. Take a single weight, the defaults, and a first gradient at step . The averages start at zero, so and . After the correction (next section) these become and , so , and the step is
The weight moves by almost exactly itself, against the gradient, not times the gradient (which could be anything). That is what dividing by the gradient's own size buys you: the raw scale of the gradient cancels, and what is left is a step you actually chose. Hold that thought; it is the subject of the trust-region section.
First, watch the assembled optimizer earn its keep. Below, three optimizers start at the same point in the same valley and share one global step size. The valley is ill-conditioned: steep across, gentle along, with a knob for how lopsided. Plain SGD bounces across the steep axis and crawls along the floor, and once the valley is steep enough it overshoots and diverges. Momentum helps but overshoots and oscillates. Adam gives each axis its own step, so it goes nearly straight to the bottom, and changing the conditioning barely touches it.
That robustness to conditioning is the practical reason Adam is a good default. SGD can match or beat it once you have tuned the learning rate and a schedule to the specific problem. Adam tends to be close to right out of the box, because the per-parameter scaling absorbs differences in slope that SGD makes you handle by hand.
The cold start
Both moving averages start at zero, because we have no gradients yet when we begin. That seems harmless and is not. For the first many steps the average is a blend of real gradients and the zero it was seeded with, so it reads too low, biased toward zero. The bias is worst for the second moment, where means the average mixes in only a thousandth of each new sample, so it crawls up from zero over hundreds of steps.
The exact size of the bias is easy to compute. Unroll the recursion (4) and the average is a weighted sum of every past squared gradient:
Take expectations, and if the underlying squared gradient is roughly steady the weights pull out into a clean factor:
The leftover is zero if the squared gradient is stationary and small otherwise, because down-weights the distant past where things differ most. (The paper's text calls this decay rate at one point, but the whole derivation is about the second moment, so the symbol should read . A harmless typo, but worth knowing if you read the original.) What matters is the factor : the weights form a geometric series that sums to exactly this, so the average is exactly that fraction of the true value, the total weight it has put on real, nonzero gradients so far. Divide it back out and the estimate is unbiased from the first step:
Those are the hats from (5). Think of a thermometer started at zero in a warm room: its reading climbs toward the true temperature, and dividing by "the fraction of the way it has warmed up" recovers the real value early instead of waiting. At the correction for the second moment is , which is exactly what turned into in the worked example. As grows goes to zero and the correction fades to one.
Skip the correction and the early steps misbehave in a specific way. A second moment biased low makes its square root too small, and dividing by a too-small number makes the step too large, exactly when the estimates are least trustworthy. The figure shows the raw average (grey) crawling up toward the true value (amber) while the corrected one (teal) sits on the truth from the start. Push toward one and the gap yawns open: at the raw average is still only a few percent of the truth after dozens of steps. That is why the correction matters most when the memory is long, which is also when you most need it (sparse gradients call for a large ).
What the ratio means
Set aside and look at the heart of the update, the ratio . It is the average gradient divided by the root-mean-square size of the gradient. Be precise about what the denominator is: the second moment estimates , the raw second moment, not the variance. They coincide only when the gradient has zero mean. So is an RMS magnitude, a measure of how big the gradient typically is, signal and noise together.
Dividing the mean by the RMS does two useful things at once. It is scale-invariant: multiply every gradient by a constant and the top scales by , the bottom by , and the ratio is unchanged. The update does not care whether your loss is measured in dollars or cents. And the ratio is bounded. Because the average of can never exceed its RMS in size (, which is just the statement that variance is nonnegative), the ratio sits between about and , so
Each step moves a parameter by at most about , regardless of how steep or gentle its slope is. The authors call this a trust region: sets how far you trust the current gradient before re-checking, and you can usually guess the right order of magnitude in advance, which is most of why Adam needs so little tuning. (The strict worst case is a touch larger. The bound is really two cases, and at the default decay rates the loose can reach about , the prefactor being with the bias-correction factors pulled apart, in the extreme where a parameter sees a single nonzero gradient and nothing else. In ordinary, non-sparse training you are safely near .)
The ratio also explains a behavior the paper leans on. A parameter whose gradient is consistent gets a near-full step: the average and the RMS are both about the gradient's size, so the ratio is near . A parameter whose gradient is mostly noise gets a near-zero step: the average cancels toward zero while the RMS stays large, so the ratio collapses. The same thing happens to every parameter as training approaches a minimum, where the true gradient shrinks and what is left is mostly noise. The steps then shrink on their own, a kind of automatic annealing that nobody has to schedule. Slide the noise level below and watch a confident step dissolve into a cautious one.
When the proof breaks
The paper does not only propose Adam; it offers a proof that Adam converges, in the language of online convex optimization. The setup is a game: at each round you pick parameters, then an adversary reveals a convex loss, and your regret is how much worse you did than the single best fixed parameter you could have committed to in hindsight. The adversarial framing is deliberately pessimistic: a method that holds up against a worst-case sequence of losses will certainly handle the friendlier fixed loss of real training.
If regret grows slower than , then the average regret goes to zero, which is the online way of saying you converge to the best answer. The paper claims a bound of order , the standard target.
Three years later, Reddi, Kale, and Kumar showed the proof is wrong. The argument relies on a quantity built from the change in the inverse step size,
Regret proofs of this kind add up these per-step changes and need the sum to collapse to a single bounded quantity, which happens only if every term has the same sign. So the argument assumes is always positive: that the effective step never grows. For SGD and AdaGrad that is true, because their divisors only increase. For Adam it is false. The whole point of the forgetting average is that can fall when gradients shrink, so the effective step can rise, so can be negative, and that telescoping collapse the proof needs never happens. The bound was never established.
The authors also built an explicit example where Adam converges to the wrong point. Take a single parameter on the interval . Once every three steps the gradient is a large positive ; the other two steps it is . The sum over a cycle is , so the average gradient points positive and the true optimum is the lower corner, . But Adam's second moment is given a short memory here (the example uses a deliberately small , not the usual , to make the forgetting fast), so the rare big gradient inflates the denominator and is then forgotten within a step or two, while the two frequent gradients each get a near-full step the other way. Adam drifts to , the worst point in the interval. Watch it happen, and watch the fix.
The same paper proposed the fix, AMSGrad. Keep a running maximum of the second moment and divide by that instead of by itself:
The maximum is a high-water mark: once a parameter has seen a big gradient, its divisor never slips back below the level that gradient set, so the effective step can only decrease, is positive again, and the regret bound goes through. In the counterexample the big sets a high mark that keeps damping the parameter, so AMSGrad heads to . (The story does not end cleanly: AMSGrad's own convergence proof was later found to have a gap of its own, and the convergence theory of these methods is still an active area. The honest summary is that the original guarantee was wrong, the first patch needed patching, and Adam works extraordinarily well in practice anyway. The algorithm shipped first and the theory has been catching up ever since.)
The family, and the recap
A few relatives are worth a sentence, because they show what the pieces are made of. AdaMax replaces the RMS denominator with a decayed running maximum of the gradient's absolute value:
which is what the second-moment denominator becomes if you push its norm from a square root all the way to an infinity norm. It needs no bias correction (a maximum of nonnegative numbers starting from zero is never biased low) and no , and it has the clean bound . The right limits also tie the family together: set , push , and anneal like , and Adam becomes AdaGrad exactly. Each piece earns its place: stops the average from ever forgetting, turning the EMA back into AdaGrad's running sum; drops the momentum; and the schedule supplies the shrinking step AdaGrad gets for free from its growing divisor. (This limit needs the bias correction; without it, it blows up.)
There is also a tidy way to read the whole thing as a cheap cousin of natural gradient descent. The second moment approximates the diagonal of the Fisher information matrix, the object natural gradient uses to measure distance in terms of how much the model's predictions change. A preconditioner is just a per-parameter rescaling of the step, which is what dividing by has been doing all along; Adam preconditions by the square root of the inverse diagonal, a deliberately gentler correction than the full inverse a true natural gradient would apply. Treat this as an analogy rather than an identity: the clean Fisher connection needs the loss to be a log-likelihood and the labels to come from the model's own predictions, so what Adam actually tracks is an empirical Fisher, which matches the real one only near a well-fit optimum.
Step back and Adam is four decisions, each answering a way the naive step fails. Average the gradient, so noise and valley bouncing cancel. Divide by the gradient's typical size per parameter, so steep and gentle directions get steps that fit them. Correct both averages for their cold start, so the early steps are not oversized. And read the result as a step of about in a trusted direction, which is why one learning rate works across wildly different problems. The convergence proof that came stapled to it was wrong, and fixing it properly is still being worked out. The algorithm did not wait. It is six lines, it asks almost nothing of you, and it trained most of the models you have heard of.
Questions you might still have
If v is the variance of the gradient, why divide by it?
v is not the variance; it estimates the second raw moment E[g^2], so its square root is the RMS size of the gradient. Dividing by the RMS rescales every coordinate to a comparable step and caps that step near alpha. It equals the variance only when the gradient has zero mean.
So Adam without bias correction is just RMSProp?
Not quite. Drop the bias correction and you get RMSProp with momentum, and even that is not algebraically Adam: it carries momentum on the already-rescaled gradient, while Adam averages the raw gradient and the raw squared gradient separately, then divides.
Does the broken proof mean Adam is broken?
No. It means the original guarantee was unproven, not that Adam fails in practice. On real problems it works extremely well. The theory was patched (AMSGrad, then patches to that), and tighter convergence results for Adam itself came later.
Why is the default beta2 (0.999) so much closer to 1 than beta1 (0.9)?
The second moment is a noisier thing to estimate than the mean, so it needs a longer memory to be stable, especially with sparse gradients. That long memory is exactly why the second moment warms up slowly from zero, and exactly why bias correction matters most for it.
Footnotes & further reading
- The paper: Kingma & Ba, Adam: A Method for Stochastic Optimization (ICLR 2015). The name is from "adaptive moment estimation," and the authors note the order was decided by a coin flip.
- The convergence proof is corrected in Reddi, Kale & Kumar, On the Convergence of Adam and Beyond (ICLR 2018), which introduces AMSGrad and the counterexample of Figure 6. AMSGrad's own proof is patched in Tran & Phong (2019).
- AdaGrad: Duchi, Hazan & Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (JMLR 2011). RMSProp: Tieleman & Hinton, Lecture 6.5, Neural Networks for Machine Learning (Coursera, 2012).
- Momentum: Polyak's heavy ball (1964) and Sutskever, Martens, Dahl & Hinton, On the importance of initialization and momentum in deep learning (ICML 2013).
- The online-learning framing is Zinkevich, Online Convex Programming and Generalized Infinitesimal Gradient Ascent (ICML 2003); the stochastic-approximation roots are Robbins & Monro (1951).
- On the Fisher connection and its limits: Kunstner, Balles & Hennig, Limitations of the Empirical Fisher Approximation (NeurIPS 2019). For the optimization background generally, Goodfellow, Bengio & Courville, Deep Learning, chapter 8.
How could this explainer be improved? Found an error, or something unclear? I read every message.