2 papers across 2 sessions
We propose a simple way to measure the critical batch size for language model pretraining that alleviates issues with existing methods, and show that this can be used to train language models with fewer gradient steps in practice.
We derive scaling laws for optimal weight decay and batch size in LLM pre-training, finding optimal (and critical) batch size scales primarily with dataset size; we discuss implications for optimizing time and compute efficiency.