2 papers across 2 sessions
We introduce a new design principle for LLM matrix optimizers - gradient multi-normalization, unifying previous work, and enabling faster and memory-efficient training of LLMs.