Research Scientist, DeepMind
1 paper at NeurIPS 2025
We show that learning sparse attention is prone to emerging behaviors during training, and study (theoretically and empirically) how data and model design influence emergence speed.