On the $O (\frac{d http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">}{K ^{1/4}})$ Convergence Rate of AdamW Measured by $ℓ_{1}$ Norm

Nankai University· Peking University· Pazhou Laboratory

AdamW Adam convergence analysis nonconvex optimization

Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood.

This paper establishes the convergence rate

\frac{1}{K} \sum_{k = 1}^{K} E ∣∣\nabla f (x^{k}) ∣ ∣_{1} \leq O (\frac{d http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> C}{K ^{1/4}})

for AdamW measured by

ℓ_{1}

norm, where K represents the iteration number, d denotes the model dimension, and C matches the constant in the optimal convergence rate of SGD.

Theoretically, we have

E ∣∣\nabla f (x) ∣ ∣_{1} \geq \frac{2 d}{π} http://www.w3.org/2000/svg" width="400em" height="1.88em" viewBox="0 0 400000 1944" preserveAspectRatio="xMinYMin slice"> E ∣∣\nabla f (x) ∣ ∣_{2}

when each element of

\nabla f (x)

is generated from Gaussian distribution

N (0, 1)

. Empirically, our experimental results on real-world deep learning tasks reveal

∣∣\nabla f (x) ∣ ∣_{1} = Θ (d http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">) ∣∣\nabla f (x) ∣ ∣_{2}

. Both support that our convergence rate can be considered to be analogous to the optimal convergence rate of SGD.