Research Scientist, Cerebras Systems, Inc
2 papers at NeurIPS 2025
We introduce CompleteP, which offers depth-wise HP transfer, FLOP savings when training deep models, and a larger range of compute-efficient width/depth ratios.
We derive scaling laws for optimal weight decay and batch size in LLM pre-training, finding optimal (and critical) batch size scales primarily with dataset size; we discuss implications for optimizing time and compute efficiency.