Researcher, Google
3 papers at NeurIPS 2025
We prove a purely statistical separation between Transformers and other architectures such as feedforward and recurrent networks, where Transformers are more sample-efficient at learning sparse sequence models.
We show that transformers with linear width can solve many graph problems using constant depth, revealing a trade-off where increasing width enables shallower, faster models—though some tasks still demand quadratic width.