2 papers across 2 sessions
We derive scaling laws for optimal weight decay and batch size in LLM pre-training, finding optimal (and critical) batch size scales primarily with dataset size; we discuss implications for optimizing time and compute efficiency.
We propose AlphaDecay, a per-module weight decay method guided by heavy-tailedness, improving large language model performance.