Assistant Professor, Toyota Technological Institute at Chicago
3 papers at NeurIPS 2025
We show that slightly increasing transformers' depth with the input length increases their expressive power under standard complexity conjectures.
We propose a simple way to measure the critical batch size for language model pretraining that alleviates issues with existing methods, and show that this can be used to train language models with fewer gradient steps in practice.
We exactly characterize the expressive power of transformers with padding tokens as $\mathsf{TC}^0$, and we also characterize transformers with looping and padding.