1 paper across 1 session
We observe that Adam’s performance in training transformers degrades differently under different types of random rotations of the objective function, highlighting the need for new, basis-dependent theory to fully understand Adam’s success.