1 paper across 1 session
We introduce a new design principle for LLM matrix optimizers - gradient multi-normalization, unifying previous work, and enabling faster and memory-efficient training of LLMs.