VerifiedarXiv:1711.0510121 min
Optimization · Regularization

Decoupled Weight Decay Regularization

Weight decay and L2 regularization are the same for SGD, but not for Adam.

Most deep-learning libraries implement L2 regularization and call it weight decay. For adaptive optimizers like Adam the two are not the same, and separating them is a one-line change that lets Adam generalize as well as SGD.

Explaining the paperDecoupled Weight Decay RegularizationLoshchilov & Hutter · University of Freiburg · ICLR 2019 · arXiv:1711.05101

Open any training script and you will find a knob labeled weight decay. For Adam, that knob usually does something subtly different from weight decay, and the difference is why Adam earned a reputation for generalizing worse than plain SGD.

Two techniques in machine learning are treated as the same thing so routinely that the libraries themselves conflate them. One is L2 regularization: add a penalty for large weights to the loss. The other is weight decay: pull every weight a little toward zero on every step. PyTorch, TensorFlow, and their predecessors all expose a weight_decay argument that, under the hood, does L2. For decades nobody worried about the difference, because for ordinary stochastic gradient descent there isn't one.

This paper, by Ilya Loshchilov and Frank Hutter, makes a narrow and surprisingly consequential point: the equivalence holds for SGD and fails for adaptive optimizers like Adam. Because the popular libraries only implemented the L2 version, Adam was getting a weaker, lopsided form of regularization, and that is a large part of why it generalized worse than SGD with momentum on image classification. Applying weight decay the original way, separate from the gradient step, repairs it. The authors call the result AdamW, and it has since become the default optimizer for training large models.

A handful of ideas explain all of it: what L2 regularization is, what the original weight decay is, why they coincide for SGD, where a decay term enters the Adam update, and why that one placement decision breaks the equivalence. Then two practical refinements and the empirical payoff. None of the pieces is hard.

L2 regularization, and the one-half

Start with the version everyone knows. To discourage a network from leaning on large weights, you add a penalty to the loss that grows with the size of the weights:

freg(θ)=f(θ)+λ2θ22f^{\text{reg}}(\boldsymbol{\theta}) = f(\boldsymbol{\theta}) + \frac{\lambda'}{2}\,\lVert\boldsymbol{\theta}\rVert_2^2
(L2)

Here ff is the ordinary loss, θ\boldsymbol{\theta} is the weight vector, θ22=iθi2\lVert\boldsymbol{\theta}\rVert_2^2 = \sum_i \theta_i^2 is the squared length of that vector, and λ\lambda' sets how hard the penalty pushes. Minimizing this modified loss trades a little extra training error for smaller weights. Smaller weights usually mean a smoother, less brittle function, the kind that tends to do better on data it has not seen, which is why we regularize in the first place.

The 12\tfrac12 in front is not decoration. The gradient of the penalty is what actually enters training, and differentiating λ2θ22\tfrac{\lambda'}{2}\lVert\boldsymbol{\theta}\rVert_2^2 brings the square's exponent down to cancel the two:

freg(θ)=f(θ)+λθ\nabla f^{\text{reg}}(\boldsymbol{\theta}) = \nabla f(\boldsymbol{\theta}) + \lambda'\boldsymbol{\theta}

Without the one-half you would carry a stray factor of two through every formula below. So the penalty contributes a clean term λθ\lambda'\boldsymbol{\theta} to the gradient, pointing back toward the origin in proportion to each weight. Run one step of gradient descent with learning rate α\alpha and the update is θθαfαλθ\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\nabla f - \alpha\lambda'\boldsymbol{\theta}: the ordinary step, plus a small tug toward zero. That tug, scaled by αλ\alpha\lambda', is the only thing the penalty ever does. Hold onto that scaling by α\alpha; it is where the trouble starts.

Weight decay: shrink the weights directly

There is an older way to keep weights small that never touches the loss function. Before (or after) each gradient step, multiply every weight by a number slightly less than one. That is weight decay, introduced in this form by Hanson and Pratt in 1988:

θt+1=(1λ)θtαft(θt)\boldsymbol{\theta}_{t+1} = (1-\lambda)\,\boldsymbol{\theta}_t - \alpha\,\nabla f_t(\boldsymbol{\theta}_t)
(1)

Read it left to right: take the current weights, shave off a fixed fraction λ\lambda of every one of them, then take the usual gradient step. The decay is a flat haircut. A weight at 3.03.0 with λ=0.01\lambda = 0.01 loses 0.030.03; a weight at 0.20.2 loses 0.0020.002; both lose the same one percent. Left alone, with no gradient pushing back, a weight would shrink geometrically toward zero, compound interest run in reverse.

The contrast with L2 that matters later is small and easy to miss. In equation (1) the decay term λθ\lambda\boldsymbol{\theta} is applied directly to the weights. It is not part of the loss, not part of the gradient, and crucially not multiplied by the learning rate. It is its own separate operation, sitting beside the gradient step rather than inside it. Compare that with L2, where the penalty entered as a gradient term λθ\lambda'\boldsymbol{\theta} and then got scaled by α\alpha along with everything else. The two recipes look almost identical. The next section shows they are, for SGD, and then the section after that shows why the resemblance breaks for Adam.

For plain SGD, the two coincide

Set the two updates side by side. L2 regularization, expanded, gives θαfαλθ\boldsymbol{\theta} - \alpha\nabla f - \alpha\lambda'\boldsymbol{\theta}. Weight decay, from equation (1), gives (1λ)θαf(1-\lambda)\boldsymbol{\theta} - \alpha\nabla f, which is the same as θλθαf\boldsymbol{\theta} - \lambda\boldsymbol{\theta} - \alpha\nabla f. Line them up:

θαfαλθL2 regularizationvsθαfλθweight decay\underbrace{\boldsymbol{\theta} - \alpha\nabla f - \alpha\lambda'\boldsymbol{\theta}}_{\text{L2 regularization}} \qquad\text{vs}\qquad \underbrace{\boldsymbol{\theta} - \alpha\nabla f - \lambda\boldsymbol{\theta}}_{\text{weight decay}}

They are identical except for the shrink term. L2 shrinks by αλ\alpha\lambda'; weight decay shrinks by λ\lambda. Set those equal and you get the paper's first proposition: the two are exactly the same optimizer, provided you choose

λ=λα\lambda' = \frac{\lambda}{\alpha}
(2)

That asymmetry drives everything that follows. Weight decay's shrink term is not multiplied by the learning rate; L2's is, because it enters through the gradient. To produce the same shrink, L2's coefficient has to be larger by a factor of 1/α1/\alpha, and for a typical α<1\alpha < 1 that makes λ\lambda' larger than λ\lambda, sometimes much larger. It also couples the two knobs: the effective amount of regularization applied to a weight is αλ\alpha\lambda', so if you find a good λ\lambda' and then change the learning rate, the regularization strength changes with it, and the setting that was optimal is now wrong. You cannot tune one knob and leave the other alone. The figure below shows what this does to the search you run when you train a model: a grid over the learning rate α\alpha and the regularization factor, colored by test error, brightest where the model does best.

Figure 1 · the hyperparameter basin
A stylized test-error landscape over learning rate α (x) and the decay factor λ (y), brighter is better. With L2 (coupled) the low-error basin sits on a diagonal: change α and the best λ moves with it. Toggle to decoupled weight decay and the basin straightens into a flat horizontal band, one good λ across a wide range of α. The amber dot marks the joint optimum. (Stylized: read the diagonal-versus-flat orientation, not the exact colors. The paper's real grids come from a 26-layer ResNet, the "2×64d" meaning two residual branches 64 channels wide, on CIFAR-10.)

Watch the bright region rotate as you toggle. In the coupled panel the good settings lie along a diagonal, so optimizing the two hyperparameters means searching a tilted valley, and a one-at-a-time sweep keeps walking off the ridge. Decoupling, which the rest of this piece works out, swings that valley flat: pick a learning rate, then tune the decay independently, and one decay value keeps working as you change the rate. For SGD the coupling is a nuisance you can rescale away with equation (2). This paper exists because for Adam no single rescale works.

Adam, in one line

To see why, recall what Adam does differently from SGD. Plain SGD multiplies the gradient by one global learning rate. Adam gives every parameter its own effective rate, set by how big that parameter's gradients have recently been. It keeps two running averages: mt\boldsymbol{m}_t, a smoothed gradient (momentum), and vt\boldsymbol{v}_t, a smoothed average of the squared gradient. After a small bias correction (writing m^t\hat{\boldsymbol{m}}_t and v^t\hat{\boldsymbol{v}}_t), the update is:

θt=θt1αm^tv^t+ϵ\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \alpha\,\frac{\hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon}
(3)

Everything adaptive about Adam lives in that denominator. v^t\sqrt{\hat{\boldsymbol{v}}_t} is a per-parameter root-mean-square of recent gradient sizes (a magnitude, not a standard deviation, since vt\boldsymbol{v}_t averages the raw square). Dividing by it means a parameter whose gradients have been large takes smaller steps, and one whose gradients have been small takes larger steps, so every parameter moves at a comparable pace regardless of how steep its slice of the loss is. That per-parameter rescaling is what makes Adam robust to badly scaled problems and easy to use out of the box. You can write it compactly as a preconditioner, a diagonal matrix Mt=diag(1/(v^t+ϵ))\boldsymbol{M}_t = \text{diag}\big(1/(\sqrt{\hat{\boldsymbol{v}}_t}+\epsilon)\big) that reshapes the step, so the update is θt=θt1αMtm^t\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \alpha\,\boldsymbol{M}_t\hat{\boldsymbol{m}}_t.

Now the question this paper turns on. We want to add regularization, a pull toward zero, to this update. That pull has to enter somewhere. Does it go in before the v^t\sqrt{\hat{\boldsymbol{v}}_t} division, or after it?

Where the decay enters

There are exactly two places to put it, and they are the two techniques from the start of this piece.

L2 in Adam (coupled). Fold the penalty into the gradient, exactly as L2 prescribes: gt=ft+λθt1\boldsymbol{g}_t = \nabla f_t + \lambda\boldsymbol{\theta}_{t-1}. Now λθ\lambda\boldsymbol{\theta} is part of gt\boldsymbol{g}_t, so it flows into both running averages and gets divided by v^t+ϵ\sqrt{\hat{\boldsymbol{v}}_t}+\epsilon right along with the real gradient. This is what torch.optim.Adam(weight_decay=...) does.

AdamW (decoupled). Do what the original weight decay did: leave the gradient alone, and subtract λθ\lambda\boldsymbol{\theta} from the weights directly, after the normalized step, where the v^t\sqrt{\hat{\boldsymbol{v}}_t} division can no longer touch it:

θt=θt1ηt ⁣(αm^tv^t+ϵ  +  λθt1)\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t\!\left(\alpha\,\frac{\hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{v}}_t}+\epsilon} \;+\; \lambda\,\boldsymbol{\theta}_{t-1}\right)
(4)

Equation (4) is the complete algorithm: the λθt1\lambda\boldsymbol{\theta}_{t-1} term sits outside the fraction, added at the end. The ηt\eta_t in front is a schedule multiplier (a global factor you can ramp up or down over training); it scales both terms together, and you can ignore it for now by reading it as one. The figure traces a single decay packet through the pipeline so you can see where it joins.

Figure 2 · where λθ joins the update
inside ÷√v̂
The Adam pipeline is the same either way: gradient, moment EMAs, divide by v^+ϵ\sqrt{\hat{\boldsymbol{v}}}+\epsilon, update θ. Toggle where the decay term λθ joins. In Adam + L2 it enters before the EMAs, so the amber packet rides through the normalization box and visibly shrinks. In AdamW it joins after, reaching the update untouched.

The two differ by where one term is written, nothing else. The moments, the bias correction, the defaults (α=0.001\alpha = 0.001, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}) are all unchanged. Move λθ\lambda\boldsymbol{\theta} from inside the v^t\sqrt{\hat{\boldsymbol{v}}_t} division to outside it and you have turned Adam into AdamW. The rest of this piece is about why so small a change matters so much.

# one Adam-family step. The only difference is the last line.
g  = grad(f, theta)                    # loss gradient only
g  = g + lam * theta                   # Adam+L2: fold decay into g  (A)
m  = b1 * m + (1 - b1) * g             # 1st moment EMA
v  = b2 * v + (1 - b2) * g * g         # 2nd raw moment EMA
mh = m / (1 - b1**t)                   # bias-corrected
vh = v / (1 - b2**t)
step = alpha * mh / (sqrt(vh) + eps)   # the normalized Adam step
theta = theta - eta * step             # Adam+L2 stops here (decay already folded in at (A))
theta = theta - eta * lam * theta      # AdamW: decay theta directly  (B)

Why Adam breaks the match

For SGD, the coupling was annoying but curable: one rescale, λ=λ/α\lambda' = \lambda/\alpha, made L2 and weight decay identical. The paper's second proposition says that for Adam no such rescale exists. The argument is short: when you fold L2 into the gradient, the penalty term λθ\lambda'\boldsymbol{\theta} is multiplied by the same preconditioner as the real gradient, so the L2 step shrinks the weights by αλMtθ\alpha\lambda'\boldsymbol{M}_t\boldsymbol{\theta}, while decoupled weight decay shrinks them by λθ\lambda\boldsymbol{\theta}. For those to be equal at every weight you would need

λθ=αλMtθfor all θ\lambda\,\boldsymbol{\theta} = \alpha\lambda'\,\boldsymbol{M}_t\,\boldsymbol{\theta}\quad\text{for all }\boldsymbol{\theta}

and a single scalar λ\lambda' can pull that off only if Mt\boldsymbol{M}_t is a scalar multiple of the identity, the same number on every parameter. For SGD it is (the preconditioner equals 11), so the rescale works. For Adam, Mt\boldsymbol{M}_t is deliberately non-uniform, so no scalar reproduces a uniform shrink. L2 and weight decay have genuinely come apart.

Make the gap concrete, because it has a direction. Under L2, the decay a weight actually receives is its intended decay times α/(v^+ϵ)\alpha/(\sqrt{\hat{v}}+\epsilon), the same factor the normalization applies to its gradient. A weight whose gradients have been large, say v^=0.1\sqrt{\hat{v}} = 0.1 at the default α=0.001\alpha = 0.001, gets only about one percent of the intended decay; a weight with a tiny gradient history gets several times too much. Decoupled weight decay gives every weight exactly the intended amount. Drag a weight along its gradient history below and watch the two diverge.

Figure 3 · how much decay each weight actually gets
√v̂=1.0e-2 · L2 0.10×
The fraction of the intended decay a weight actually receives, against its typical gradient size v^\sqrt{\hat{v}}. Decoupled weight decay is flat at 1×1\times: every weight gets the full amount. Adam + L2 applies α/(v^+ϵ)\alpha/(\sqrt{\hat{v}}+\epsilon), crossing 1×1\times at v^=α\sqrt{\hat{v}}=\alpha, over-decaying small-gradient weights and under-decaying the large-gradient ones the paper cares about.

There is a partial way to describe what decoupled weight decay does instead, and it points at why the change might help. The paper's third proposition freezes the preconditioner at a fixed matrix diag(s)1\text{diag}(\boldsymbol{s})^{-1} and shows that, in that idealized case, decoupled weight decay is equivalent to a scale-adjusted L2 penalty:

fsreg(θ)=f(θ)+λ2θs22,λ=λαf^{\text{sreg}}(\boldsymbol{\theta}) = f(\boldsymbol{\theta}) + \frac{\lambda'}{2}\,\big\lVert\boldsymbol{\theta}\odot\sqrt{\boldsymbol{s}}\,\big\rVert_2^2,\qquad \lambda'=\frac{\lambda}{\alpha}
(5)

The s\sqrt{\boldsymbol{s}} weighting is the interesting part. A weight with a large sis_i (a large historic gradient) is penalized more here, in proportion to si\sqrt{s_i}, the exact opposite of what L2 inside Adam does to it. The motivation the authors offer is the flat-versus-sharp-minimum idea: weights whose gradients are large are the ones where small changes move the loss a lot, the brittle directions, and leaning on them harder may push the optimizer toward flatter regions that generalize better. That last step is a hypothesis, not something proposition 3 proves; the proposition only establishes the single-step equivalence.

A second, independent argument arrives at the same place. Aitchison (2018) showed that adaptive optimizers can be read as Bayesian filtering, tracking a distribution over the best weights as the other weights shift underneath them. In that reading Adam's preconditioner is the posterior uncertainty (you take bigger steps on weights you are less sure of), and the slow drift of the estimate between steps is modeled by a prior N((IA)θt,Q)\mathcal{N}\big((\boldsymbol{I}-\boldsymbol{A})\boldsymbol{\theta}_t, \boldsymbol{Q}\big). Set A=λI\boldsymbol{A}=\lambda\boldsymbol{I} and that prior multiplies the mean by (1λ)(1-\lambda) each step, which is exactly decoupled weight decay. The pull comes from the data-independent prior and does not depend on how certain you are about any weight; an L2 penalty would have made the pull scale with certainty. So under this theory it is weight decay, not L2, that emerges naturally. Both arguments land on weight decay, not L2.

Two practical add-ons

With the core idea in hand, the paper adds two refinements that make AdamW pleasant to use. Both are independent of the decoupling; they just remove sharp edges.

Normalized weight decay. The best raw λ\lambda depends on how long you train. Every update applies another small shrink, so a longer run accumulates more total decay, and the per-step value that was right for a short run is too strong for a long one. The authors absorb this by writing the raw decay in terms of a budget-free number λnorm\lambda_{\text{norm}}:

λ=λnormbBT=λnormI,I=TBb\lambda = \lambda_{\text{norm}}\sqrt{\frac{b}{B\,T}} = \frac{\lambda_{\text{norm}}}{\sqrt{I}}\,,\qquad I = \frac{T B}{b}
(6)

where bb is the batch size, BB the dataset size, and TT the number of epochs, so II is the total number of weight updates. The raw λ\lambda falls off as one over the square root of the number of updates; λnorm\lambda_{\text{norm}}, the decay you would use for a single batch pass, stays put. You tune it once and reuse it across budgets and datasets.

Figure 4 · normalized weight decay
I=39k · λ=2.5e-4
The raw optimal λ slides down a 1/I1/\sqrt{I} curve as you train longer (more weight updates I), while the normalized λ_norm you set once stays flat. CIFAR-10 and ImageNet32×32 at the same 100 epochs differ about 24×24\times in updates per epoch, so their raw λ differ about 5×5\times (244.9\sqrt{24}\approx 4.9), yet the same λ_norm is right for both.

That factor of five is not hypothetical. Train on ImageNet32×32, whose epochs are about 24 times longer than CIFAR-10's, with the raw λ\lambda that was optimal for CIFAR-10, and you would be decaying roughly five times too hard. Normalizing makes the optimal λnorm\lambda_{\text{norm}} land near the same value (0.0250.025 to 0.050.05) across both datasets and across SGDW and AdamW alike, which is most of the convenience.

Cosine annealing with warm restarts. The second add-on schedules the multiplier ηt\eta_t from equation (4). Within a run it cools the multiplier from one down to zero along a cosine, then restarts, snapping it back to one and beginning a new, longer run:

ηt=0.5+0.5cos ⁣(πTcurTi)\eta_t = 0.5 + 0.5\cos\!\left(\pi\,\frac{T_{\text{cur}}}{T_i}\right)
(7)

where TcurT_{\text{cur}} counts epochs since the last restart and TiT_i is the length of the current run, each run a factor of TmultT_{\text{mult}} longer than the last. Because ηt\eta_t scales both the gradient step and the decay, a restart relaxes the step and the decay together. The low points, where ηt=0\eta_t = 0, are good moments to snapshot the model.

Figure 5 · cosine annealing with warm restarts
T₀=100
T_mult=2
The multiplier η cosine-anneals from 1 to 0 within each run, then restarts back to 1; each run is TmultT_{\text{mult}} times longer than the last. The amber dots are the η=0\eta = 0 snapshots. With T0=100T_0 = 100 and Tmult=2T_{\text{mult}} = 2 the runs are 100, 200, 400, 800 epochs, so the restarts fall at 100, 300, 700, 1500. Drag the sliders to reshape the schedule.

Combining AdamW with this schedule gives AdamWR. Restarts do not change the final answer much, but they reach a good answer far sooner, up to ten times faster to a usable model than running the cosine straight through. Normalized weight decay is what made the combination practical: it let the authors keep one decay setting even as each restart stretched the run length.

AdamW closes the gap

The headline number is a 15% relative reduction in test error over Adam with L2, at the default learning rate of 0.0010.001, and it holds on both CIFAR-10 and ImageNet32×32. Relative is the operative word: if Adam was at a 6% error, AdamW lands near 5.1%, enough to close most of the gap to SGD with momentum. The improvement comes from better generalization, not faster convergence alone; matched at equal training loss, AdamW still tests better.

Figure 1's basin shows the other half of the result. With decoupled weight decay the good region of the (α,λ)(\alpha, \lambda) grid straightens out, so the two knobs can be tuned almost independently. That alone helps a lot: it turns a search over a tilted diagonal into two nearly separate one-dimensional searches.

Be precise about the claim, because it is easy to inflate. AdamW does not set a new state of the art; the 2.86%2.86\% CIFAR-10 figure floating around the experiments belongs to the Shake-Shake network trained with SGD that the authors borrow as their setup, not to AdamW. What AdamW does is make Adam competitive with SGD plus momentum on problems where Adam used to trail it, so practitioners no longer have to switch optimizers to get good generalization. This mattered: the decoupled form is now the default in PyTorch's torch.optim.AdamW and the optimizer behind most large language models. (The older torch.optim.Adam(weight_decay=...) still gives you coupled L2, and its default decay is zero where AdamW's is 0.010.01, so the two are genuinely different calls.)

Everything traces back to one relocation. L2 regularization writes a decay term into the gradient, where Adam's per-parameter normalization then shrinks it unevenly, hardest on exactly the weights you most want to regularize. Move that term out of the gradient and apply it straight to the weights, the way weight decay was originally defined, and the unevenness disappears: every weight decays by the same fraction, the learning rate and the decay no longer interact, and Adam catches up to SGD. One line of code, in a place nobody had thought to look.

Provenance Verified against primary literature
Hanson & Pratt (1988)The original multiplicative weight decay, equation (1): a fixed fractional shrink applied directly to the weights.
Adam (Kingma & Ba, 2015)The base optimizer: momentum over a per-parameter RMS-normalized gradient.
SGDR (Loshchilov & Hutter, 2016)Cosine annealing with warm restarts, reused unchanged to build AdamWR.
Aitchison (2018)Adaptive methods as Bayesian filtering; decoupled weight decay as the state-transition prior.
correctionProposition 3's printed statement gives the scale-adjusted regularizer coefficient as λ′/(2α); its own proof uses λ′/2 = λ/(2α). We teach the proof's coefficient. The statement double-counts the 1/α already inside λ′=λ/α, which would make the penalty ~1000× too strong at the default α=0.001.

Questions you might still have

?

Is this the same as the weight_decay argument in PyTorch’s Adam?
No. torch.optim.Adam(weight_decay=...) adds the decay to the gradient, so it gets divided by the sqrt(v-hat) normalization, which is coupled L2. Only torch.optim.AdamW applies the decay directly to the weights. They are different algorithms, and even their defaults differ: Adam decays by 0, AdamW by 0.01.

?

If decoupled weight decay is better, why did everyone use L2 for years?
Because for SGD the two are exactly equivalent once you rescale the coefficient by 1/alpha, and the deep-learning libraries inherited the L2 implementation. The inequivalence only appears for adaptive optimizers, and it went unnoticed until this paper.

?

Does AdamW change anything at inference, or only training?
Only training. AdamW is an optimizer; the model architecture and the forward pass are untouched. You can take a model trained with Adam+L2 and one trained with AdamW and run them identically; the difference is in the weights they converged to.

?

Why would penalizing large-gradient weights more help generalization?
This is a hypothesis, not a theorem. The intuition (which the paper cites, via flat-versus-sharp minima) is that weights with large gradients are the brittle directions where small changes move the loss a lot, so shrinking them harder may steer toward flatter, better-generalizing solutions. Proposition 3 only proves the single-step equivalence, not the generalization claim.

?

Is AdamW a new optimizer, or just Adam with a flag?
It is Adam with the decay term moved outside the normalization. The moments, bias correction, and all the defaults are unchanged. That is exactly why it was easy to adopt: one line, no retuning of the rest.

Footnotes & further reading

  1. The paper: Loshchilov & Hutter, Decoupled Weight Decay Regularization (University of Freiburg, ICLR 2019). Code.
  2. The base optimizer: Kingma & Ba, Adam: A Method for Stochastic Optimization (2015). We have a fuller walkthrough at Adam, explained.
  3. The original multiplicative weight decay: Hanson & Pratt, "Comparing biases for minimal network construction with back-propagation" (NeurIPS 1988). A related early study is Krogh & Hertz, A Simple Weight Decay Can Improve Generalization (1992), which (illustrating the very conflation this paper untangles) defines weight decay as the L2 cost.
  4. Cosine annealing with warm restarts: Loshchilov & Hutter, SGDR: Stochastic Gradient Descent with Warm Restarts (2017).
  5. The Bayesian-filtering view: Aitchison, A Unified Theory of Adaptive Stochastic Gradient Descent as Bayesian Filtering (2018). (The paper's reference list mis-cites the arXiv id as 1507.02030; the correct id is 1807.07540.)
  6. Implementations: PyTorch's torch.optim.AdamW applies the decoupled decay; the experiments here use 26-layer ResNets (the 2×64d and 2×96d variants, 11.6M and 25.6M parameters) with Shake-Shake regularization on CIFAR-10 and ImageNet32×32.