3 papers across 2 sessions
A new scaling law formula with learning rate annealing that can fit and predict full loss curves.
We introduce a Functional Scaling Law that predicts full SGD loss dynamics under arbitrary learning rate schedules.