Optimization · Training

Adam: A Method for Stochastic Optimization

Two running averages give every weight its own learning rate.

Adam is the optimizer that trains almost everything. It is six lines of arithmetic, and every line is there to fix a specific way that plain gradient descent fails. Build those fixes one at a time and you have the method, including the part the authors got wrong.

Explaining the paperAdam: A Method for Stochastic OptimizationKingma, Ba · ICLR 2015 · arXiv:1412.6980 ↗

A single learning rate cannot suit a hundred billion parameters at once, because each one needs a rate of its own.

Almost every neural network you have heard of was trained by Adam or a close relative. It is the first optimizer most people reach for in PyTorch or JAX, the default before you have a reason to use anything else. The paper that introduced it is nine pages long and the core algorithm is a single short box. That combination, total ubiquity and apparent simplicity, is a good reason to actually understand it, because under the simplicity sit a few careful decisions, and one careful decision that was wrong.

The plan here is to build the algorithm rather than memorize it. We start from plain stochastic gradient descent (SGD) and watch it fail in two ordinary, fixable ways. Each fix is one of Adam's running averages. A third piece patches a cold-start problem the averages create. The assembled update then has a clear meaning, which explains why Adam barely needs tuning. Last, the convergence proof in the paper does not hold: there is a three-line example where Adam converges to the wrong answer, and the fix has a name.

A noisy slope, and only the slope

Training a model means minimizing a loss: a single number that measures how wrong the model is, as a function of its parameters $\theta$ (all the weights, stacked into one long vector). The loss defines a landscape over parameter space, and we want its lowest valley. The only local information we have is the gradient $g$ , the vector of slopes that points in the direction the loss increases fastest. So we step the other way:

\theta_t = \theta_{t-1} - \alpha\, g_t

(1)

That is gradient descent. The minus sign makes it descend: the gradient points uphill and we want down. The scalar $\alpha$ is the learning rate, the size of the step. (Walking downhill in fog is the usual picture, and it is a good one: you feel the slope under your feet and step downhill, blind to anything past your boots.)

Two things make this harder than the picture suggests. First, we never compute the true gradient. The true loss is an average over the full dataset, which is far too expensive to touch every step, so we estimate the gradient on a small random minibatch of examples. That estimate is unbiased, meaning on average it points the right way, but any single batch is noisy, and the noise only shrinks like one over the square root of the batch size. Doubling the batch buys you less than you would hope, so the cheaper move is to average gradients across steps instead. This is stochastic gradient descent, and the stochastic part is the fog flickering. Second, we use only the slope, never the curvature. The curvature lives in a matrix (the Hessian) with one row and column per parameter; for a model with billions of parameters you could not store it, let alone invert it. So we navigate by first-order information alone and try to be clever about the one knob we have, the step.

A single fixed step size runs into two failures that show up constantly. When the landscape is a long narrow valley, steep across and gentle along, a step large enough to make progress down the gentle axis is far too large for the steep one, and you bounce between the walls instead of descending. When the gradient is noisy, a fixed step jitters around the true direction and wastes most of its motion. Adam is two fixes, one for each failure: average the gradient over time to calm the noise and the bouncing, and give every parameter its own step size so the steep and gentle axes can be served at once. We take them in that order.

First fix: average the direction

If consecutive gradient estimates disagree because of noise, the obvious move is to average them. Adam keeps a running average of the gradient, the first moment, and steps along the average rather than the raw sample:

m_t = \beta_1\, m_{t-1} + (1-\beta_1)\, g_t

(2)

This is an exponential moving average (EMA). Each step it keeps a fraction $\beta_1$ of the old average and mixes in a fraction $1-\beta_1$ of the new gradient, so the weight on a gradient from $k$ steps ago decays like $\beta_1^k$ . With the default $\beta_1 = 0.9$ the average has a memory of roughly $1/(1-\beta_1) = 10$ steps. Opposite-sign wobbles from the noise cancel in the average; a persistent direction is preserved. In the narrow-valley picture, the side-to-side bouncing averages away while the slow drift down the valley accumulates. This is the same idea as classical momentum, a heavy ball that keeps rolling in the direction it has been going, so it is often called the momentum term.

This is not quite textbook momentum. Heavy-ball momentum keeps a velocity $v \leftarrow \mu v - \alpha g$ that accumulates: feed it a steady gradient of 1 with $\mu = 0.9$ and it balances only once it has grown to ten times the gradient ( $0.9 \times 10 + 1 = 10$ ). Adam's $m_t$ instead uses the matched weights $\beta_1$ and $1-\beta_1$ , which sum to one, so the same steady gradient drives $m_t$ to the gradient itself ( $0.9 \times 1 + 0.1 \times 1 = 1$ ), not ten times it. It keeps the same averaging idea on the raw-gradient scale, so the downstream step size stays interpretable. (And plain Adam evaluates the gradient at the current point, not at a look-ahead point. The look-ahead variant is Nesterov momentum, and the Adam version of that is a separate method called Nadam.)

Drag $\beta_1$ below. The amber dots are the raw, noisy gradient, which here flips sign partway through to stand in for a real change of direction during training. The teal curve is the average. Near zero it follows every sample; near one it is smooth but slow to notice the flip. That lag is the cost of a long memory, and it is why $\beta_1$ is not set higher than it is.

Figure 1 · the first moment

β₁0.90

The raw gradient is noisy and changes sign midway. The moving average m smooths it, cancelling the wobble while keeping the drift. Higher

\beta_1

means a longer memory (about 1/(1−

\beta_1

) steps), smoother but slower to turn.

Averaging the direction is the first of Adam's two fixes. It quiets the noise and damps the valley bouncing, but it does nothing about the deeper problem, which is that one global step size cannot be right for every parameter at once.

Second fix: a learning rate per parameter

Different parameters live on different slopes. A weight that always sees large gradients needs small steps or it will overshoot; a weight that sees small or rare gradients needs large steps or it will never move. A single $\alpha$ forces one compromise on all of them. Adam gives each parameter its own effective step, set from how large that parameter's gradients have actually been.

The first method to do this well was AdaGrad. For each parameter it accumulates the sum of its squared gradients and divides the step by the square root of that sum:

\theta_t = \theta_{t-1} - \alpha\,\frac{g_t}{\delta + \sqrt{\sum_{i=1}^{t} g_i^2}}

(3)

A parameter with a long history of big gradients gets a big divisor and so a small step; a rarely-active one keeps a large step. That is exactly the behavior you want for sparse features, and it is why AdaGrad caught on. (The smoothing constant $\delta$ sits outside the square root and only guards against dividing by zero; its default is zero. Many code libraries instead tuck a small constant inside the root, which is a convenient variant, not the original.)

But AdaGrad has a built-in problem for long training runs. The divisor is a cumulative sum, so it only ever grows, which means the effective step only ever shrinks, heading to zero whether or not the gradients are still carrying useful signal. On a convex problem with a decaying schedule that is a feature. On a long non-convex deep-learning run it is a way to stall.

RMSProp, an unpublished method from a Coursera lecture, fixed this with one change: replace the ever-growing sum with an exponential moving average of the squared gradient. This is Adam's second moment:

v_t = \beta_2\, v_{t-1} + (1-\beta_2)\, g_t^2

(4)

Same shape as the first moment in (2), but on the squared gradient. Because it is an average rather than a sum, it forgets: it tracks the recent size of a parameter's gradients instead of their entire history, so the effective step adapts up and down instead of decaying to nothing. The decay rate $\beta_2$ is the length of that memory.

The figure makes the difference concrete. The gradient magnitude drops to a third of its value partway through, the kind of thing that happens when training settles. AdaGrad's step (amber) was already small from the early large gradients and stays pinned near zero forever; it cannot recover. Adam's step (teal) climbs back once the gradients ease. As $\beta_2$ approaches one, Adam's memory grows long, its recovery slows, and it starts to behave like AdaGrad again. That is not a coincidence; the two are the same method in different limits.

Figure 2 · the denominator's memory

β₂0.900

Effective step size

\alpha/\sqrt{\cdot}

over time, after the gradients shrink threefold. AdaGrad divides by a cumulative sum, so its step is pinned near zero and never recovers. Adam divides by a forgetting average, so its step climbs back. Larger

\beta_2

means a longer memory and a slower recovery.

Adam: momentum over RMS

Now assemble the two fixes. Use the first moment $m_t$ as the direction (the smoothed gradient) and the square root of the second moment $v_t$ as the per-parameter divisor (the recent gradient size). One number on top for where to go, one number underneath for how far. With a small constant $\epsilon$ to keep the division safe, the update is:

\theta_t = \theta_{t-1} - \alpha\,\frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}

(5)

The hats on $\hat m_t$ and $\hat v_t$ are a correction that the bias-correction equations make precise; for now read them as $m_t$ and $v_t$ . That completes the update. The name Adam comes from adaptive moment estimation: the two moving averages are estimates of the first and second moments of the gradient, and they adapt the step. Here it is as code, the optimizer in six lines:

# Adam: one step on the parameter vector theta (all ops elementwise)
g     = grad(loss, theta)            # gradient at the current theta
m     = b1 * m + (1 - b1) * g        # 1st moment: a running mean of g
v     = b2 * v + (1 - b2) * g * g    # 2nd moment: a running mean of g^2
m_hat = m / (1 - b1**t)              # undo the cold-start bias
v_hat = v / (1 - b2**t)
theta = theta - a * m_hat / (sqrt(v_hat) + eps)        # the update
# start: m = v = 0, t counts from 1
# defaults: a = 1e-3, b1 = 0.9, b2 = 0.999, eps = 1e-8

Run one case through by hand: a single weight, the defaults, and a first gradient $g_1 = 0.1$ at step $t=1$ . The averages start at zero, so $m_1 = 0.1\,g_1 = 0.01$ and $v_1 = 0.001\,g_1^2 = 10^{-5}$ . After the bias correction these become $\hat m_1 = 0.1$ and $\hat v_1 = 10^{-2}$ , so $\sqrt{\hat v_1} = 0.1$ , and the step is

\Delta\theta_1 = \alpha\,\frac{0.1}{0.1 + 10^{-8}} \approx \alpha = 0.001.

The weight moves by almost exactly $\alpha$ itself, against the gradient, not $\alpha$ times the gradient (which could be anything). Dividing by the gradient's own size cancels the raw scale of the gradient, and what is left is a step you actually chose. The trust-region argument makes this precise.

The assembled optimizer behaves like this on a hard valley. Below, three optimizers start at the same point in the same valley and share one global step size. The valley is ill-conditioned: steep across, gentle along. Its condition number (the ratio of the steepest curvature to the gentlest) measures how lopsided it is, with 1 a round bowl and large values a long thin trough. Plain SGD bounces across the steep axis and crawls along the floor, and once the valley is steep enough it overshoots and diverges. Momentum helps, but it overshoots and oscillates. Where both struggle, Adam gives each axis its own step and goes nearly straight to the bottom, and changing the conditioning barely touches it.

Figure 3 · three optimizers, one valley

valley κ24:1

The same elongated bowl, one shared step size. SGD zig-zags and, past a condition number near 29, diverges. Momentum overshoots and rings. Adam rescales each axis by its own gradient size and heads almost straight in, at the same pace however lopsided the valley gets. Drag

\kappa

to make the valley nastier.

Adam is a good default in practice because of this robustness to conditioning. SGD can match or beat it once you have tuned the learning rate and a schedule to the specific problem. Adam tends to be close to right out of the box, because the per-parameter scaling absorbs differences in slope that SGD makes you handle by hand.

The cold start

Both moving averages start at zero, because we have no gradients yet when we begin. That seeding has a cost. For the first many steps the average is a blend of real gradients and the zero it was seeded with, so it reads too low, biased toward zero. The bias is worst for the second moment, where $\beta_2 = 0.999$ means the average mixes in only a thousandth of each new sample, so it crawls up from zero over hundreds of steps.

The exact size of the bias is easy to compute. Unroll the recursion (4) and the average is a weighted sum of every past squared gradient:

v_t = (1-\beta_2)\sum_{i=1}^{t}\beta_2^{t-i}\, g_i^2

Take expectations, and if the underlying squared gradient is roughly steady the weights pull out into one factor:

\mathbb{E}[v_t] = \mathbb{E}[g_t^2]\,\big(1 - \beta_2^{t}\big) + \zeta

(6)

The leftover $\zeta$ is zero if the squared gradient is stationary and small otherwise, because $\beta_2$ down-weights the distant past where things differ most. (The paper's text calls this decay rate $\beta_1$ at one point, but the derivation is about the second moment, so the symbol should read $\beta_2$ . A harmless typo, but useful to know if you read the original.) What matters is the factor $1-\beta_2^t$ : the weights $(1-\beta_2)\beta_2^{t-i}$ form a geometric series that sums to exactly this, so the average is exactly that fraction of the true value, the total weight it has put on real, nonzero gradients so far. The missing weight, the remaining $\beta_2^t$ , is still sitting on the zero the average was seeded with, and zero contributes nothing, so the estimate reads low by exactly that factor and by nothing else, which is why the division is an exact fix rather than a heuristic one. Divide it back out and the estimate is unbiased from the first step:

\hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t}

(7)

Those are the hats from (5). A thermometer started at zero in a warm room reads low the same way: the reading climbs toward the true temperature, and dividing by the fraction of the way it has warmed up recovers the real value early instead of waiting. At $t=1$ the second moment has warmed up by only $1-\beta_2$ , so the correction is $1/(1-\beta_2) = 1000$ , the factor that turned $v_1 = 10^{-5}$ into $\hat v_1 = 10^{-2}$ in the worked example. As $t$ grows $\beta_2^t$ goes to zero and the correction fades to one.

Skip the correction and the early steps misbehave in a specific way. A second moment biased low makes its square root too small, and dividing by a too-small number makes the step too large, exactly when the estimates are least trustworthy. The figure shows the raw average (grey) crawling up toward the true value (amber) while the corrected one (teal) sits on the truth from the start. As $\beta_2$ approaches one, the gap widens sharply: at $\beta_2 = 0.999$ the raw average is still only a few percent of the truth after dozens of steps. That is why the correction matters most when the memory is long, which is also when you most need it (sparse gradients call for a large $\beta_2$ ).

Figure 4 · bias correction

β₂0.950

The raw average of the squared gradient (grey) starts at zero and reads far too low. Dividing by

1-\beta_2^t

lifts the corrected estimate onto the true value from step one. The larger

\beta_2

is, the longer the raw average takes to warm up, so the more the correction is doing.

What the ratio means

With $\epsilon$ set aside, the heart of the update is the ratio $\hat m_t/\sqrt{\hat v_t}$ . It is the average gradient divided by the root-mean-square size of the gradient. The denominator deserves a precise statement: the second moment estimates $\mathbb{E}[g^2]$ , the raw second moment, not the variance. They coincide only when the gradient has zero mean. So $\sqrt{\hat v_t}$ is an RMS magnitude, a measure of how big the gradient typically is, signal and noise together.

Dividing the mean by the RMS does two useful things at once. It is scale-invariant: multiply every gradient by a constant $c$ and the top scales by $c$ , the bottom by $c$ , and the ratio is unchanged. The result is the same whether your loss is measured in dollars or cents. And the ratio is bounded. Because the average of $g$ can never exceed its RMS in size ( $|\mathbb{E}[g]| \le \sqrt{\mathbb{E}[g^2]}$ , which restates that variance is nonnegative), the ratio sits between about $-1$ and $+1$ , so

\Delta_t = \alpha\,\frac{\hat m_t}{\sqrt{\hat v_t}}, \qquad |\Delta_t| \lesssim \alpha.

The inequality has a plain reading: averaging can only shrink magnitude, the mean of a list of numbers can never be bigger than their typical size, because cancellation pulls the mean down while the RMS counts every entry at full strength. Each step moves a parameter by at most about $\alpha$ , regardless of how steep or gentle its slope is. The authors call this a trust region: $\alpha$ sets how far you trust the current gradient before re-checking, and you can usually guess the right order of magnitude in advance, which is most of why Adam needs so little tuning. (The strict worst case is a touch larger. The bound is really two cases, and at the default decay rates the loose $|\Delta_t| \lesssim \alpha$ can reach about $3.16\,\alpha$ , the prefactor being $(1-\beta_1)/\sqrt{1-\beta_2}$ with the bias-correction factors pulled apart. Reaching that ceiling takes a parameter that sees a single nonzero gradient and then silence, which dense training never looks like, so in ordinary, non-sparse training you are safely near $\alpha$ and the per-step travel is $\alpha$ itself.) That cap allows a single default learning rate to travel across problems whose raw gradient scales differ by orders of magnitude, because the ratio has already cancelled the scale before $\alpha$ is applied.

The ratio also explains a behavior the paper leans on. A parameter whose gradient is consistent gets a near-full step: the average and the RMS are both about the gradient's size, so the ratio is near $\pm 1$ . A parameter whose gradient is mostly noise gets a near-zero step: the average cancels toward zero while the RMS stays large, so the ratio collapses. The same thing happens to every parameter as training approaches a minimum, where the true gradient shrinks and what is left is mostly noise. The steps then shrink on their own, a kind of automatic annealing that nobody has to schedule. Slide the noise level below and watch a near-full step shrink to a near-zero one.

Figure 5 · signal, noise, and the step

noise σ0.60

One coordinate's gradient over time, mean fixed, noise on the slider. The average stays within the RMS band. Their ratio, shown on the gauge at the bottom of the figure as the step in units of

\alpha

, sits near

+1

when the direction is consistent and collapses toward zero as noise takes over, never leaving the trust region.

When the proof breaks

The paper does not only propose Adam; it offers a proof that Adam converges, in the language of online convex optimization. The setup is a game: at each round you pick parameters, then an adversary reveals a convex loss, and your regret is how much worse you did than the single best fixed parameter you could have committed to in hindsight. The adversarial framing is deliberately pessimistic: a method that holds up against a worst-case sequence of losses will certainly handle the friendlier fixed loss of real training.

R(T) = \sum_{t=1}^{T}\big[f_t(\theta_t) - f_t(\theta^*)\big]

(8)

If regret grows slower than $T$ , then the average regret goes to zero, which is the online way of saying you converge to the best answer. The paper claims a bound of order $\sqrt{T}$ , the standard target.

Three years later, Reddi, Kale, and Kumar showed the proof is wrong. The argument relies on a quantity built from the change in the inverse step size,

\Gamma_{t} = \frac{\sqrt{V_{t}}}{\alpha_{t}} - \frac{\sqrt{V_{t-1}}}{\alpha_{t-1}},

Regret proofs of this kind add these per-step changes into a telescoping sum, where each term cancels part of the next and same-sign terms let the chain collapse to its two endpoints, a single bounded quantity. So the argument assumes $\Gamma_t$ is always positive: that the effective step never grows, that it is monotone, only ever shrinking. For SGD and AdaGrad that is true, because their divisors only increase. For Adam it is false by design. The forgetting average exists so that $v_t$ can fall when gradients shrink, so the effective step can rise, so $\Gamma_t$ can be negative. One negative term breaks a link, the partial sums can drift far from the endpoints, the telescoping collapse never happens, and the bound was never established.

The authors also built an explicit example where Adam converges to the wrong point. The example uses a single parameter on the interval $[-1, 1]$ . Once every three steps the gradient is a large positive $C$ ; the other two steps it is $-1$ . The sum over a cycle is $C - 2 > 0$ , so the average gradient points positive and the true optimum is the lower corner, $x^* = -1$ . But Adam's second moment is given a short memory here (the example uses a deliberately small $\beta_2$ , not the usual $0.999$ , to make the forgetting fast), so the rare big gradient inflates the denominator, which then decays back within a step or two, while the two frequent $-1$ gradients each get a near-full step the other way. Adam drifts to $x = +1$ , the worst point in the interval. The figure traces this wrong-way drift, and then the fix.

Figure 6 · the counterexample

spike CC = 5

A convex online problem with a large gradient

C

once every three steps and

-1

otherwise, so the optimum is

x^*=-1

. Adam's average decays away the rare big gradient and converges to the wrong corner,

+1

. AMSGrad retains it via the running maximum and reaches

-1

. Drag

C

to sharpen the trap.

The same paper proposed the fix, AMSGrad. Keep a running maximum of the second moment and divide by that instead of by $v_t$ itself:

\hat v_t = \max\big(\hat v_{t-1},\, v_t\big), \qquad \theta_t = \theta_{t-1} - \alpha\,\frac{m_t}{\sqrt{\hat v_t}}.

The maximum is a high-water mark: once a parameter has seen a big gradient, its divisor never slips back below the level that gradient set, so the effective step can only decrease, $\Gamma_t$ is positive again, and the regret bound goes through. In the counterexample the big $C$ sets a high mark that keeps damping the parameter, so AMSGrad converges to $-1$ . (The story does not end cleanly: AMSGrad's own convergence proof was later found to have a gap of its own, and the convergence theory of these methods is still an active area. The original guarantee was wrong, the first patch needed patching, and Adam works extraordinarily well in practice anyway. The algorithm was published first, and the convergence theory was completed only later.)

The family, and the recap

A few relatives show what the pieces are made of. AdaMax replaces the RMS denominator with a decayed running maximum of the gradient's absolute value:

u_t = \max\big(\beta_2\, u_{t-1},\; |g_t|\big)

which is the second-moment denominator with its norm pushed from a square root all the way to an infinity norm. It needs no bias correction (a maximum of nonnegative numbers starting from zero is never biased low) and no $\epsilon$ , and it has the clean bound $|\Delta_t| \le \alpha$ . The right limits also tie the family together: set $\beta_1 = 0$ , push $\beta_2 \to 1$ , and anneal $\alpha$ like $1/\sqrt{t}$ , and Adam becomes AdaGrad exactly. Each piece serves a distinct role: $\beta_2 \to 1$ stops the average from ever forgetting, turning the EMA back into AdaGrad's running sum; $\beta_1 = 0$ drops the momentum; and the $1/\sqrt{t}$ schedule supplies the shrinking step AdaGrad obtains from its growing divisor. (This limit needs the bias correction; without it, it blows up.)

There is also a tidy way to read Adam as a cheap cousin of natural gradient descent. The second moment $\hat v_t$ approximates the diagonal of the Fisher information matrix, the object natural gradient uses to measure distance in terms of how much the model's predictions change. A preconditioner is a per-parameter rescaling of the step, which dividing by $\sqrt{\hat v_t}$ has done all along; Adam preconditions by the square root of the inverse diagonal, a deliberately gentler correction than the full inverse a true natural gradient would apply. Treat this as an analogy rather than an identity: the exact Fisher connection needs the loss to be a log-likelihood and the labels to come from the model's own predictions, so what Adam actually tracks is an empirical Fisher, which matches the real one only near a well-fit optimum.

Each of Adam's pieces fixes a specific failure of plain gradient descent: the gradient average cancels noise, the per-parameter division handles uneven slopes, and the bias correction stabilizes the early steps. Assembled, the update reads as a step of about $\alpha$ in a trusted direction, which is why one learning rate carries across wildly different problems. The convergence proof that came stapled to it was wrong, and patching it properly is still being worked out. The algorithm was adopted regardless. Six lines, almost nothing to tune, and it trained most of the models you have heard of.

Provenance Verified against primary literature

Robbins & Monro (1951)Stochastic approximation: the descending-with-noise update Adam inherits.

Duchi et al. (2011)AdaGrad. The per-parameter accumulator; its smoothing constant sits outside the root, and it is a running sum, not an average.

Tieleman & Hinton (2012)RMSProp. The EMA of squared gradients. The original slide has no epsilon, no momentum, and no bias correction.

Reddi et al. (2018)The O(sqrt T) convergence proof is wrong; AMSGrad is the (later also-patched) fix.

Kunstner et al. (2019)The "Fisher information" link is the empirical Fisher, exact only under a log-likelihood loss.

correctionSection 3 of the paper writes the second-moment decay rate as β₁ ("the exponential decay rate β₁ can ... be chosen"). The entire derivation is in β₂, so the symbol must be β₂. A benign typo, unchanged from v1 (2014) through v9. We teach it as β₂.

Questions you might still have

If v is the variance of the gradient, why divide by it?
v is not the variance; it estimates the second raw moment E[g^2], so its square root is the RMS size of the gradient. Dividing by the RMS rescales every coordinate to a comparable step and caps that step near alpha. It equals the variance only when the gradient has zero mean.

So Adam without bias correction reduces to RMSProp?
Not quite. Drop the bias correction and you get RMSProp with momentum, and even that is not algebraically Adam: it carries momentum on the already-rescaled gradient, while Adam averages the raw gradient and the raw squared gradient separately, then divides.

Does the broken proof mean Adam is broken?
No. It means the original guarantee was unproven, not that Adam fails in practice. On real problems it works extremely well. The theory was patched (AMSGrad, then patches to that), and tighter convergence results for Adam itself came later.

Why is the default beta2 (0.999) so much closer to 1 than beta1 (0.9)?
The second moment is a noisier thing to estimate than the mean, so it needs a longer memory to be stable, especially with sparse gradients. That long memory is exactly why the second moment warms up slowly from zero, and exactly why bias correction matters most for it.

Footnotes & further reading

The paper: Kingma & Ba, Adam: A Method for Stochastic Optimization (ICLR 2015). The name is from "adaptive moment estimation," and the authors note the order was decided by a coin flip.
The convergence proof is corrected in Reddi, Kale & Kumar, On the Convergence of Adam and Beyond (ICLR 2018), which introduces AMSGrad and the counterexample of Figure 6. AMSGrad's own proof is patched in Tran & Phong (2019).
AdaGrad: Duchi, Hazan & Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (JMLR 2011). RMSProp: Tieleman & Hinton, Lecture 6.5, Neural Networks for Machine Learning (Coursera, 2012).
Momentum: Polyak's heavy ball (1964) and Sutskever, Martens, Dahl & Hinton, On the importance of initialization and momentum in deep learning (ICML 2013).
The online-learning framing is Zinkevich, Online Convex Programming and Generalized Infinitesimal Gradient Ascent (ICML 2003); the stochastic-approximation roots are Robbins & Monro (1951).
On the Fisher connection and its limits: Kunstner, Balles & Hennig, Limitations of the Empirical Fisher Approximation (NeurIPS 2019). For the optimization background generally, Goodfellow, Bengio & Courville, Deep Learning, chapter 8.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.