1 paper across 1 session
We investigate how to scale second-order optimizers effectively, showing they outperform Adam and reduce data needs in compute-optimal transformer training.