Associate Professor, University of Maryland, College Park
2 papers at NeurIPS 2025
We fit scaling laws for large language models with varying width-to-depth ratios and parameter counts.
We show that recurrent-depth transformers can be scaled to be effective language models, with particularly strong gains through additional compute for reasoning tasks.